In this tutorial we’ll learn how you can easily find duplicated rows in your Pandas DataFrame records. We will use the very handy duplicated() method on the entire DataFrame, one or more columns to see the duplicates so that we can decide whether to remove those rows as you prepare your data for further analysis.
Example DataFrame
We will start by creating a very simple dataset that you can use in order to follow along:
import pandas as pd
#define a data dictionary containing columns
interview_dict = \
{'domain': ['R', 'Python', 'Python', 'R'],
'office': ['Central', 'North' , 'North', 'Central'],
'candidates': [24, 98, 98, 24],
'salary_expect': [162, 184, 184, 134]}
# construct the Dataframe
hr_df = pd.DataFrame (interview_dict)
Here’s our data:
domain | office | candidates | salary_expect | |
---|---|---|---|---|
0 | R | Central | 24 | 162 |
1 | Python | North | 98 | 184 |
2 | Python | North | 98 | 184 |
3 | R | Central | 24 | 134 |
Show all duplicated rows in Pandas DataFrame
We can easily show duplicated rows for the entire DataFrame using the duplicated() function. Let’s break it down:
When we invoke the duplicated() method on our DataFrame, we’ll get a Series of boolean representing whether each row is duplicated or not.
hr_df.duplicated()
Here is the Series we got:
0 False 1 False 2 True 3 False dtype: bool
Now we can simply filter our DataFrame by passing the boolean series into the brackets:
hr_df[hr_df.duplicated()]
The output will be:
domain | office | candidates | salary_expect | |
---|---|---|---|---|
2 | Python | North | 98 | 184 |
By default, all duplicated occurrences except the first one are returned. We can get all duplicates by using the parameter keep=False:
hr_df[hr_df.duplicated(keep=False )]
All duplicated rows will be returned:
domain | office | candidates | salary_expect | |
---|---|---|---|---|
1 | Python | North | 98 | 184 |
2 | Python | North | 98 | 184 |
Find all duplicates in a Pandas Series
We can also find duplicated values by specific conditions. In this example we’ll look only on duplicated values in the domain column. We’ll simply pass the Series name into our filter as shown below:
hr_df[hr_df.duplicated('domain')]
Here’s the output:
domain | office | candidates | salary_expect | |
---|---|---|---|---|
2 | Python | North | 98 | 184 |
3 | R | Central | 24 | 134 |
Show duplicates in multiple DataFrame columns
In the same fashion, we can pass a list and search for duplicated values across multiple columns:
hr_df[hr_df.duplicated(['office','salary_expect'])]
Here’s the result:
domain | office | candidates | salary_expect | |
---|---|---|---|---|
2 | Python | North | 98 | 184 |
Highlight and color duplicated rows in a DataFrame
Last topic we would like to expand is how you can create a custom function that will help you color all duplicated rows so you can more easily visualize them.
#subset all duplicated rows
dup_rows = hr_df[hr_df.duplicated(keep=False)]
#function to determine background color for each row
def high_dup_rows (row, df):
if row.values in df.values:
color = 'pink'
else:
color = 'white'
# list comprehension with formatting strings
return [f"background-color :{color}" for cell in row]
Now let’s call the function:
hr_df.style.apply(high_dup_rows, df = dup_rows, axis=1)
The relevant rows are highlighted as shown below: