Reproducing the Value Error cannot reshape in Pandas
Let’s assume that we have the following Python code that creates a very small DataFrame:
import pandas as pd
month = ['Jan', 'Jan', 'Feb', 'Mar']
office = ['Los Angeles', 'Los Angeles', 'New York', 'New York']
salary = [ 98, 92, 87, 92]
data = dict(month = month, office = office, salary = salary)
test_df = pd.DataFrame(data=data)
Let’s look at our data:
print (test_df.head())
month | office | salary | |
---|---|---|---|
0 | Jan | Los Angeles | 98 |
1 | Jan | Los Angeles | 92 |
2 | Feb | New York | 87 |
3 | Mar | New York | 92 |
To reproduce the error message, we will call the DataFrame pivot() function:
test_df.pivot(values= 'salary', index='office', columns='month')
This will render the following exception error message:
ValueError: Index contains duplicate entries, cannot reshape
Fixing the index contains duplicate entries exception
The root cause for the error, is that if you look carefully, the first and second rows in our DataFrame has the same values in the office and month columns. Those are used as the pivot row and columns index respectively. As the pivot function in Pandas doesn’t (by default) aggregate numeric values it doesn’t know how to “crunch” those two rows.
Solution #1 – create a pandas pivot table
We can use the pivot_table() function instead, and specify an aggregating function. In our case we’ll aggregate the numeric values by their average value.
test_df.pivot_table(values= 'salary', index='office', columns='month', aggfunc='mean', fill_value = 0)
This will return the following pivot table (technically a DataFrame):
month | Feb | Jan | Mar |
---|---|---|---|
office | |||
Los Angeles | 92 | 98 | 0 |
New York | 87 | 0 | 92 |
Related: How to write a pandas pivot table to a new DataFrame?
Solution #2 : Use groupby and unstack
We first aggregate the data as required and the use the unstack() DataFrame function in order to arrange the pivot rows and columns. Last, we replace missing values in our pivot table.
test_df.groupby(['office','month']).mean().unstack().fillna(0)
The result will be similar to the one produced in the previous section.