As part of your data wrangling process you might need to identify unique values in your dataset columns. In this tutorial we’ll learn how to use several useful pandas Series methods that help to search for distinct values in an entire column in our DataFrame.
Example Data
We will start by importing the pandas library and creating a very simple DataFrame that you can use in order to follow along.
import pandas as pd
language = ['Javascript', 'Python', 'Java', 'Java', 'Javascript']
office = ['Bangkok', 'Toronto', 'Los Angeles', 'Bangkok', None]
salary = [134.0, 127.0, 194.0, 123.0, 145.0]
data = dict(language = language, office = office, salary = salary)
hrdf = pd.DataFrame(data=data)
hrdf.head()
Different values in pandas DataFrame column
There are several ways to identify the distinct values in a column / pandas Series:
Using drop_duplicates()
We can invoke the drop_duplicates Series method and identify the unique values:
unique_lang_s = hrdf.language.drop_duplicates()
print(unique_lang)
This returns a pandas series:
0 Javascript 1 Python 2 Java Name: language, dtype: object
Using unique()
We can also use the Series unique() method.
unique_lang_arry = hrdf.language.unique()
print (unique_lang_arry)
This returns a numpy array:
['Javascript' 'Python' 'Java']
Print distinct column values as a list
We can easily export column values to a Python:
unique_languages_s = hrdf.language.drop_duplicates()
lang_lst = unique_languages_s.to_list()
print(lang_lst)
Get unique column values in sorted order
unique_languages_s.sort_values()
The Series is sorted in ascending alphabetical order:
2 Java 0 Javascript 1 Python Name: language, dtype: object
Find unique column elements with nan / None values
Columns in our DataFrame might contain empty values (NA, nan, None and as such). We can use the drop,na() Series method to get rid of empty values.
In our example, the office contains one empty value.
print(hrdf.office.unique())
The result is:
['Bangkok' 'Toronto' 'Los Angeles' None]
We can easily filter out the empty / nan values:
print(hrdf.office.dropna().unique())
Doing so will return the following array:
['Bangkok' 'Toronto' 'Los Angeles']
Count number of uniques in a Series
hrdf.language.drop_duplicates().count()
The result will be: 3
Count frequency of distinct values in a column
In the last section of this tutorial, we would like to count the number of occurrences of each separate value. We will use the value_counts() Series method.
hrdf.language.value_counts()
We’ll get the following result:
Javascript 2 Java 2 Python 1 Name: language, dtype: int64
Related learning
How to find one or multiple specific values in a pandas column?