Today we will learn how to find all unique row occurrences in a Pandas DataFrame using the very powerful drop_duplicates() DataFrame method.
Create Sample Data
Let’s start by importing pandas into our Python script and initialize a simple DataFrame that you can use in order to follow along this tutorial.
import pandas as pd
#define sample data
month = ['October', 'October', 'December', 'October', 'September']
language = ['Java', 'Java', 'Python', 'Javascript', 'Python']
salary = [135.0, 135.0, 113.0, 135.0, 155.0]
#initialize dataframe from dictionary
hr = dict(month = month, language = language, salary = salary)
hrdf = pd.DataFrame(data=hr)
#view the dataframe head
hrdf.head()
Here are the DataFrame contents, note that the first and second row values are completely identical, and the 4th row differs in the value of the language column.
month | language | salary | |
---|---|---|---|
0 | October | Java | 135.0 |
1 | October | Java | 135.0 |
2 | December | Python | 113.0 |
3 | October | Javascript | 135.0 |
4 | September | Python | 155.0 |
Select unique rows in DataFrame
We can remove our duplicated rows and get the unique rows using the following snippet:
hr_uniques_df = hrdf.drop_duplicates()
We get a DataFrame built off unique row occurrences. Row number 1 which is a duplicate of row 0 was removed.
month | language | salary | |
---|---|---|---|
0 | October | Java | 135.0 |
2 | December | Python | 113.0 |
3 | October | Javascript | 135.0 |
4 | September | Python | 155.0 |
Find unique rows based on one or multiple columns
By default drop_duplicates() looks for duplicates across all columns of our DataFrame. That said, we can search for duplicates based on a subset of our columns.
hr_uniques_df = hrdf.drop_duplicates(subset = ['month', 'salary'])
Taking under consideration the month and salary columns, pandas for 3 unique rows:
month | language | salary | |
---|---|---|---|
0 | October | Java | 135.0 |
2 | December | Python | 113.0 |
4 | September | Python | 155.0 |
Get and count unique rows
We can easily count our unique rows, using the value_counts() DataFrame method.
total_uniques = hrdf.drop_duplicates(subset = ['month', 'salary']).value_counts().sum()