Cannot reindex DataFrame with duplicated axis
Let’s start by writing some simple Python coder to define a pandas DataFrame. In reality, most probably you will be acquiring your data from an external file, database or API.
import pandas as pd
stamps = ['01-02-23', ['01-02-23','01-02-24'] , '01-02-24', '01-03-24', '02-03-24']
sales_team = ['North', 'South', 'West', 'East', 'South']
revenue = [109.0, 201.0, 156.0, 181.0, 117.0]
rev_df = pd.DataFrame (dict(time = stamps, team = sales_team, revenue = revenue) )
print (rev_df)
We will get the following data set:
time | team | revenue | |
---|---|---|---|
0 | 01-02-23 | North | 109.0 |
1 | [01-02-23, 01-02-24] | South | 201.0 |
2 | 01-02-24 | West | 156.0 |
3 | 01-03-24 | East | 181.0 |
4 | 02-03-24 | South | 117.0 |
As the time column contains a list, we will break down the second row to two different rows using the explode() function.
new_rev_df = rev_df.explode('time')
print(new_rev_df.head())
One feature of explode() is that it replicates indexes. We will get the following data:
time | team | revenue | |
---|---|---|---|
0 | 01-02-23 | North | 109.0 |
1 | 01-02-23 | South | 201.0 |
1 | 01-02-24 | South | 201.0 |
2 | 01-02-24 | West | 156.0 |
3 | 01-03-24 | East | 181.0 |
Trying to re-index the DataFrame so that the time column becomes the index, will fail with a Valueerror exception:
idx = ['time']
new_rev_df.reindex(idx)
The error message will be:
ValueError: cannot reindex on an axis with duplicate labels
I have encountered this error also when invoking the Seaborn library on data containing duplicated indexes.
Fixing the error
There are a couple of ways that can help to circumvent this error messages.
Aggregate the data
We can groupby the data and then save it as a DataFrame. Note that with this option no data is removed from your DataFrame
new_rev_df.groupby(['time','team']).revenue.sum().to_frame()
Remove Duplicated indexes
We can use the pandas loc indexer in order to get rid of any duplicated indexes. Using this option the second duplicated index is removed.
dup_idx = new_rev_df.index.duplicated()
new_rev_df.loc[~dup_idx]
Note: In this tutorial we replicated the problem for cases in which the row index is duplicated. You might as well encounter this issue when working with datasets, typically wide ones, that including duplicated columns.