Can’t remove duplicates from Pandas DataFrame
In this guide we will learn how to handle the case in which after invoking the drop_duplicates DataFrame method and removing non-unique records, your DataFrame still shows up duplicates.
Step 1: Prepare your pandas DataFrame
First off you need to acquire the data which you will filter for unique occurrences. For this simple example we will use some random HR data.
import pandas as pd
month = ['September', 'November', 'July', 'December', 'September']
language = ['Javascript', 'Java', 'R', 'Java', 'Javascript']
salary = [185.0, 138.0, 168.0, 118.0, 130.0]
hr_campaign = dict(month = month, language = language, salary = salary)
hrdf = pd.DataFrame(data=hr_campaign)
hrdf.head()
month | language | salary | |
---|---|---|---|
0 | September | Javascript | 185.0 |
1 | November | Java | 138.0 |
2 | July | R | 168.0 |
3 | December | Java | 118.0 |
4 | September | Javascript | 130.0 |
Step 2: Identify columns to check for duplicates
We will focus on removing duplicated records based on two columns – in our case the month and the language columns. We will define a list of DataFrame columns for determining the duplicated records
cols = ['month', 'language']]
Step 3: Drop duplicates on the selected columns
Our next step is to remove duplicates. We will do it based on the DataFrame subset we defined in the previous step:
hrdf.drop_duplicates(subset = cols)
The last record of our DataFrame will be removed:
month | language | salary | |
---|---|---|---|
0 | September | Javascript | 185.0 |
1 | November | Java | 138.0 |
2 | July | R | 168.0 |
3 | December | Java | 118.0 |
Note: If you would like to remove dups across all columns ignoring the index, simply omit the subset parameter:
hrdf.drop_duplicates()
Step 4: Persist your results in your DataFrame
We would like to persist the changes in our DataFrame. In order to do so we can use one of the following:
Using the inplace=True parameter
hrdf.drop_duplicates(subset = cols, inplace=True)
Re-assign to the original DataFrame
hrdf = hrdf.drop_duplicates(subset = cols)
Create a new DataFrame consisting of unique records
hrdf_2 = hrdf.drop_duplicates(subset = cols)