How to count distinct in one or multiple Pandas DataFrame columns?

In today’s post we’ll learn how to count unique distinct occurrences in one or multiple columns of a Pandas DataFrame.

Preparations

We will first import the Pandas library and acquire our DataFrame contents. In this example we will use a simple csv (comma separated values) file as our data source.

import pandas as pd
hr = pd.read_csv('candidates.csv')

Let’s see the DataFrame shape:

hr.shape

'The result is: (20,5)

Let’s take a look at the column index:

hr.columns

'The result is: Index(['area', 'month', 'salary', 'num_candidates', 'days_to_hire'], dtype='object')

Count distinct values in a column with values_counts

We can easily able to use the Series method value_counts to find the number of unique occurrences in a Pandas DataFrame column. Here’s a simple snippet to count unique occurrences in the area column:

hr['area'].value_counts()

The result will be:

Count distinct values in multiple columns

If instead of looking into a specific column, we would like to look into multiple columns in a DataFrame, we’ll first need to subset the DataFrame and then apply the value_counts method:

subset = ['month', 'salary']
hr[subset].value_counts(ascending=False)

Find distinct values in a groupby

Some time ago, we covered this topic in a specific tutorial on how to use the groupby.size() method to count distinct occurrences after aggregating data.

Unique values in an entire DataFrame

For completeness, we would like to mention that we are able to use the value_counts method on the entire DataFrame. The result will help us find number of unique observations in our data and the number of duplicated values.

hr.value_counts()

You can easily find the number of duplicated rows using the drop_duplicates DataFrame method.

num_of_dups = len(data['hr']) - len(data['hr'].drop_duplicates())

Additional Learning