In today’s post we’ll learn how to count unique distinct occurrences in one or multiple columns of a Pandas DataFrame.
Preparations
We will first import the Pandas library and acquire our DataFrame contents. In this example we will use a simple csv (comma separated values) file as our data source.
import pandas as pd
hr = pd.read_csv('candidates.csv')
Let’s see the DataFrame shape:
hr.shape
'The result is: (20,5)
Let’s take a look at the column index:
hr.columns
'The result is: Index(['area', 'month', 'salary', 'num_candidates', 'days_to_hire'], dtype='object')
Count distinct values in a column with values_counts
We can easily able to use the Series method value_counts to find the number of unique occurrences in a Pandas DataFrame column. Here’s a simple snippet to count unique occurrences in the area column:
hr['area'].value_counts()
The result will be:
Count distinct values in multiple columns
If instead of looking into a specific column, we would like to look into multiple columns in a DataFrame, we’ll first need to subset the DataFrame and then apply the value_counts method:
subset = ['month', 'salary']
hr[subset].value_counts(ascending=False)
Find distinct values in a groupby
Some time ago, we covered this topic in a specific tutorial on how to use the groupby.size() method to count distinct occurrences after aggregating data.
Unique values in an entire DataFrame
For completeness, we would like to mention that we are able to use the value_counts method on the entire DataFrame. The result will help us find number of unique observations in our data and the number of duplicated values.
hr.value_counts()
You can easily find the number of duplicated rows using the drop_duplicates DataFrame method.
num_of_dups = len(data['hr']) - len(data['hr'].drop_duplicates())