In today’s quick data Analysis tip we would like to learn how to summarize one or multiple Python DataFrame columns at once.
Creating test data
As we normally do, let’s create a simple test DataFrame using the red_csv() method.
# import required Python libraries
import pandas as pd
'Use the pd.read_csv() method to import a csv file
inter = pd.read_csv('interviews.csv')
print(inter)
Note: If for some reason, you are not able to import the pandas library, you might want to look into this article on how to troubleshoot your Pandas installation.
Here’s our data:
month | first_interview | second_interview | |
---|---|---|---|
1 | November | 85.0 | 78.0 |
2 | August | 87.0 | 82.0 |
3 | March | 85.0 | 84.0 |
4 | August | 77.0 | 80.0 |
5 | June | 81.0 | 87.0 |
Sum all columns in a Pandas DataFrame into new column
If we want to summarize all the columns, then we can simply use the DataFrame sum() method. Note that we passed the following parameters:
- axis: If we want to aggregate the columns, then we’ll use axis=1. For rows we’ll use axis=0.
- numeric_only = we’ll take under consideration only numeric columns.
We’ll aggregate the sum of the columns into a newly created column – named total.
inter['total'] = inter.sum(axis=1, numeric_only= True)
inter
month | first_interview | second_interview | total | |
---|---|---|---|---|
1 | November | 85.0 | 78.0 | 163.0 |
2 | August | 87.0 | 82.0 | 169.0 |
3 | March | 85.0 | 84.0 | 169.0 |
4 | August | 77.0 | 80.0 | 157.0 |
5 | June | 81.0 | 87.0 | 168.0 |
Add all columns using Apply and Lambda
This might not always be practical, but for completeness: if your DataFrame contains only numeric columns you add up all columns by using a simple apply statement and call a lambda function.
inter['total'] = inter.apply(lambda r:r.sum(), axis =1)
Sum DataFrame columns into a Pandas Series
Instead of creating a new column, we’ll receive a Python series:
int_s = inter.sum(axis=1, numeric_only= True)
Sum multiple columns in a Python DataFrame
If we want to go ahead and sum only specific columns, then we can subset the DataFrame by those columns and then summarize the result. We can obviously pass one column, specific ones, all columns except one etc’.
'Define your list of columns
cols = ['first_interview', 'second_interview']
inter ['total_2'] = inter[cols].sum(axis=1)
inter
month | first_interview | second_interview | total | total_2 | |
---|---|---|---|---|---|
1 | November | 85.0 | 78.0 | 163.0 | 163.0 |
2 | August | 87.0 | 82.0 | 169.0 | 169.0 |
3 | March | 85.0 | 84.0 | 169.0 | 169.0 |
4 | August | 77.0 | 80.0 | 157.0 | 157.0 |
5 | June | 81.0 | 87.0 | 168.0 | 168.0 |
DataFrame sum columns based on name (or other condition)
You could potentially only sum columns which name matches a specific string value as well. In this example, we’ll select the column to aggregate based on a specific text pattern. We then pass the selection into the column parameter or a loc indexer on our DataFrame.
col_array = inter.columns.str.contains('first')
inter.loc[:, col_array].sum(axis=1)
Next learning
Now that you are proficient with summing up columns, we suggest that you look into our tutorial on how to sum Pandas rows.