In today’s tutorial well learn how you can easily aggregate data using the Pandas library, specifically we’ll look into use cases in which we use the groupby DataFrame method to split and re-group data followed by the groupby.sum function to aggregate the data accordingly.
Preparations
As always, we’ll start by importing the Pandas library and create a simple DataFrame which we’ll use throughout this example. If you would like to follow along, you can download the dataset from here.
# pandas groupby sum
import pandas as pd
cand = pd.read_csv('candidates'.csv)
cand.head()
Here’s our DataFrame header:
language | month | salary | num_candidates | days_to_hire | |
---|---|---|---|---|---|
1 | Swift | 4 | 123.0 | 79.0 | 60.0 |
2 | Ruby | 4 | 127.0 | 84.0 | 75.0 |
3 | VisualBasic | 5 | 112.0 | 89.0 | 37.0 |
4 | Python | 9 | 115.0 | 83.0 | 66.0 |
5 | Scala | 12 | 111.0 | 78.0 | 73.0 |
Pandas groupby and sum example
Our first case is a simple grouping and sum aggregation by one column. We’ll pass the column name (in our case languages) to the Group by method, then use aggregate as needed using the sum function.
# pandas groupby sum one column (+ column name)
num_candidates = cand.groupby('language').agg(qty = ('num_candidates', 'sum')).astype({'qty':int})
num_candidates.head()
Here’s the output:
Using groupby.sort_values to sum and sort
In the next example we’ll simply sort the values using the sort_values method.
# sort and sum
sorted = cand.groupby('month').agg(candidate_qty = ('num_candidates', 'sum')).sort_values(by='month').astype({int})
sorted.head()
Here’s the output:
Aggregate by and sum multiple columns in Pandas
In the next example we would like to groupby multiple columns. To do that, we’ll pass a list of the relevant columns as shown below
num_candidates_mean_salary_month = cand.groupby(['language','month']).agg(qty = ('num_candidates', 'sum')).sort_values('month')
num_candidates_mean_salary_month.head()
And our result:
Combining mean and mean
In this example we will show how you can make more than aggregation to a groupby. In our case we’ll use the sum and mean functions to aggregate. Note that you can use count, min, max and others as needed.
# sum and mean
num_candidates_mean_salary_month = cand.groupby(['language','month']).agg(qty = ('num_candidates', 'sum'), avg_salary= ('salary', 'mean'))
num_candidates_mean_salary_month.head()
Ignoring NaN values
We can easily drop NaN values instead of filling them according to specific logic. In the code below empty values are removed from the DatFrame with dropna, and then aggregation is performed.
cand1 = cand.dropna()
num_candidates = cand1.groupby('language').agg(qty = ('num_candidates', 'sum'))
num_candidates.head()
Aggregate and sum specific rows
In our last example for today, we’ll go ahead and apply the aggregation only to a subset of our DataFrame. In this case we defined a list of languages for our grouping, subset our DataFrame accordinlgy and only then aggregate the data.
# condition
languages = ['Python', 'Scala', 'Ruby']
filtr = cand['language'].isin(languages)
cand[filtr].groupby('language').agg(qty = ('num_candidates', 'sum')).round()