How to find and highlight duplicates in pandas DataFrame?

In this tutorial we’ll learn how you can easily find duplicated rows in your Pandas DataFrame records. We will use the very handy duplicated() method on the entire DataFrame, one or more columns to see the duplicates so that we can decide whether to remove those rows as you prepare your data for further analysis.

Example DataFrame

We will start by creating a very simple dataset that you can use in order to follow along:

import pandas as pd

#define a data dictionary containing columns

interview_dict = \
{'domain': ['R', 'Python', 'Python', 'R'],
'office': ['Central', 'North' , 'North', 'Central'],
 'candidates': [24, 98, 98, 24],
 'salary_expect': [162, 184, 184, 134]}

# construct the Dataframe
hr_df = pd.DataFrame (interview_dict)

Here’s our data:

domainofficecandidatessalary_expect
0RCentral24162
1PythonNorth98184
2PythonNorth98184
3RCentral24134

Show all duplicated rows in Pandas DataFrame

We can easily show duplicated rows for the entire DataFrame using the duplicated() function. Let’s break it down:

When we invoke the duplicated() method on our DataFrame, we’ll get a Series of boolean representing whether each row is duplicated or not.

hr_df.duplicated()

Here is the Series we got:

0    False
1    False
2     True
3    False
dtype: bool

Now we can simply filter our DataFrame by passing the boolean series into the brackets:

hr_df[hr_df.duplicated()]

The output will be:

domainofficecandidatessalary_expect
2PythonNorth98184

By default, all duplicated occurrences except the first one are returned. We can get all duplicates by using the parameter keep=False:

hr_df[hr_df.duplicated(keep=False )]

All duplicated rows will be returned:

domainofficecandidatessalary_expect
1PythonNorth98184
2PythonNorth98184

Find all duplicates in a Pandas Series

We can also find duplicated values by specific conditions. In this example we’ll look only on duplicated values in the domain column. We’ll simply pass the Series name into our filter as shown below:


hr_df[hr_df.duplicated('domain')]

Here’s the output:

domainofficecandidatessalary_expect
2PythonNorth98184
3RCentral24134

Show duplicates in multiple DataFrame columns

In the same fashion, we can pass a list and search for duplicated values across multiple columns:

hr_df[hr_df.duplicated(['office','salary_expect'])]

Here’s the result:

domainofficecandidatessalary_expect
2PythonNorth98184

Highlight and color duplicated rows in a DataFrame

Last topic we would like to expand is how you can create a custom function that will help you color all duplicated rows so you can more easily visualize them.

#subset all duplicated rows
dup_rows = hr_df[hr_df.duplicated(keep=False)]

#function to determine background color for each row
def high_dup_rows (row, df):
    if row.values in df.values:
            color = 'pink'
    else:
            color = 'white'
    # list comprehension with formatting strings       
    return [f"background-color :{color}" for cell in row]

Now let’s call the function:

hr_df.style.apply(high_dup_rows, df = dup_rows, axis=1)

The relevant rows are highlighted as shown below: