Drop duplicates in pandas DataFrame columns not working

Can’t remove duplicates from Pandas DataFrame

In this guide we will learn how to handle the case in which after invoking the drop_duplicates DataFrame method and removing non-unique records, your DataFrame still shows up duplicates.

Step 1: Prepare your pandas DataFrame

First off you need to acquire the data which you will filter for unique occurrences. For this simple example we will use some random HR data.

import pandas as pd
month = ['September', 'November', 'July', 'December', 'September']
language = ['Javascript', 'Java', 'R', 'Java', 'Javascript']
salary = [185.0, 138.0, 168.0, 118.0, 130.0]
hr_campaign = dict(month = month, language = language, salary = salary)
hrdf = pd.DataFrame(data=hr_campaign)
hrdf.head()
monthlanguagesalary
0SeptemberJavascript185.0
1NovemberJava138.0
2JulyR168.0
3DecemberJava118.0
4SeptemberJavascript130.0

Step 2: Identify columns to check for duplicates

We will focus on removing duplicated records based on two columns – in our case the month and the language columns. We will define a list of DataFrame columns for determining the duplicated records

cols = ['month', 'language']]

Step 3: Drop duplicates on the selected columns

Our next step is to remove duplicates. We will do it based on the DataFrame subset we defined in the previous step:

hrdf.drop_duplicates(subset =  cols)

The last record of our DataFrame will be removed:

monthlanguagesalary
0SeptemberJavascript185.0
1NovemberJava138.0
2JulyR168.0
3DecemberJava118.0

Note: If you would like to remove dups across all columns ignoring the index, simply omit the subset parameter:

hrdf.drop_duplicates()

Step 4: Persist your results in your DataFrame

We would like to persist the changes in our DataFrame. In order to do so we can use one of the following:

Using the inplace=True parameter

hrdf.drop_duplicates(subset =  cols, inplace=True)

Re-assign to the original DataFrame

hrdf = hrdf.drop_duplicates(subset =  cols)

Create a new DataFrame consisting of unique records

hrdf_2 = hrdf.drop_duplicates(subset =  cols)