In this tutorial we’ll learn how to use Python to join columns from multiple Pandas DataFrame objects. We’ll look into several cases:
- Join columns of similar length
- Add columns based on index
- Join DataFrame columns based on conditions
- Add columns with different length
Creating Example DataFrames
As we always do, we’ll start by importing Pandas and creating some data that can help you follow along this example. Here’s how to troubleshoot your Pandas install in case the library can’t be imported.
import pandas as pd
#define DataFrames
sales = pd.DataFrame({'area': ['North', 'East', 'South', 'West'], 'sales': [250,350,500,400]})
quotas = pd.DataFrame({'area': ['North', 'East', 'South', 'West'], 'target': [350,200,550,400], 'num_sales' : [10,15,13,14]})
targets = quotas['target']
Append a single column from one DataFrame to other
We can use the join() DataFrame method to add columns from different DataFrames. We chose to persist the resulting data as a new DataFrame.
sales_vs_targets = sales.join(targets, how='left')
sales_vs_targets.head()
Note that a left outer join is performed by default
Here’s our data:
area | sales | target | |
---|---|---|---|
0 | North | 250 | 350 |
1 | East | 350 | 200 |
2 | South | 500 | 550 |
3 | West | 400 | 400 |
Adding columns based on specific index
In the following case, instead of using the default DataFrame index, we’ll join based on a different index column.
sales_vs_targets = sales.set_index('area').join(quotas.set_index('area')['target'])
The result is slightly different:
sales | target | |
---|---|---|
area | ||
North | 250 | 350 |
East | 350 | 200 |
South | 500 | 550 |
West | 400 | 400 |
Add column based on condition
A more interesting case is that you might need to add columns based on a specific condition. Let’s assume that you want to focus only on areas with high sales targets.
#filter dataframe according to condition
filt = quotas['target'] > 350
quotas_subset = quotas[filt]
#join the dataframes
sales_vs_high_targets = sales.set_index('area').join(quotas_subset.set_index('area')['target'], how='right')
The result will be:
sales | target | |
---|---|---|
area | ||
South | 500 | 550 |
West | 400 | 400 |
Note: the right outer join helped to ensure that only the filtered rows in the quotes dataframe will be joined. In general tweaking the join strategy using the how parameter will help us to deal with dataframes having different length.
Adding multiple columns
In the same fashion we can add multiple DataFrame columns. Instead of passing a single column, we’ll pass a list.
sales_vs_targets_multiple = sales.set_index('area').join(quotas.set_index('area')[['target', 'num_sales']])
sales | target | num_sales | |
---|---|---|---|
area | |||
North | 250 | 350 | 10 |
East | 350 | 200 | 15 |
South | 500 | 550 | 13 |
West | 400 | 400 | 14 |
DataFrames with different length
In reality, you will most probably be joining complex DataFrames having different shapes.
Let’s see a very simple example:
quotas_short = pd.DataFrame({'area': ['North', 'East', 'South'], 'target': [350,200,550]})
sales.set_index('area').join(quotas_short.set_index('area'))
Will render the following dataframe containing empty values.
sales | target | |
---|---|---|
area | ||
North | 250 | 350.0 |
East | 350 | 200.0 |
South | 500 | 550.0 |
West | 400 | NaN |
Whereas specifying a right outer join will show only rows with non empty values.
sales.set_index('area').join(quotas_short.set_index('area'), how='right')
sales | target | |
---|---|---|
area | ||
North | 250 | 350 |
East | 350 | 200 |
South | 500 | 550 |