Solve valueerror cannot reindex from a duplicate axis in pandas

Cannot reindex DataFrame with duplicated axis

Let’s start by writing some simple Python coder to define a pandas DataFrame. In reality, most probably you will be acquiring your data from an external file, database or API.

import pandas as pd

stamps = ['01-02-23', ['01-02-23','01-02-24'] , '01-02-24', '01-03-24', '02-03-24']
sales_team = ['North', 'South', 'West', 'East', 'South']
revenue = [109.0, 201.0, 156.0, 181.0, 117.0]

rev_df = pd.DataFrame (dict(time = stamps, team = sales_team, revenue = revenue) )

print (rev_df)

We will get the following data set:

timeteamrevenue
001-02-23North109.0
1[01-02-23, 01-02-24]South201.0
201-02-24West156.0
301-03-24East181.0
402-03-24South117.0

As the time column contains a list, we will break down the second row to two different rows using the explode() function.

new_rev_df = rev_df.explode('time')
print(new_rev_df.head())

One feature of explode() is that it replicates indexes. We will get the following data:

timeteamrevenue
001-02-23North109.0
101-02-23South201.0
101-02-24South201.0
201-02-24West156.0
301-03-24East181.0

Trying to re-index the DataFrame so that the time column becomes the index, will fail with a Valueerror exception:

idx = ['time']
new_rev_df.reindex(idx)

The error message will be:

ValueError: cannot reindex on an axis with duplicate labels

I have encountered this error also when invoking the Seaborn library on data containing duplicated indexes.

Fixing the error

There are a couple of ways that can help to circumvent this error messages.

Aggregate the data

We can groupby the data and then save it as a DataFrame. Note that with this option no data is removed from your DataFrame

new_rev_df.groupby(['time','team']).revenue.sum().to_frame()

Remove Duplicated indexes

We can use the pandas loc indexer in order to get rid of any duplicated indexes. Using this option the second duplicated index is removed.

dup_idx = new_rev_df.index.duplicated()
new_rev_df.loc[~dup_idx]

Note: In this tutorial we replicated the problem for cases in which the row index is duplicated. You might as well encounter this issue when working with datasets, typically wide ones, that including duplicated columns.

Related learning

How to merge columns of a Pandas DataFrame object?