How to use the Pandas groupby and then sum in Python DataFrames?

In today’s tutorial well learn how you can easily aggregate data using the Pandas library, specifically we’ll look into use cases in which we use the groupby DataFrame method to split and re-group data followed by the groupby.sum function to aggregate the data accordingly.

Preparations

As always, we’ll start by importing the Pandas library and create a simple DataFrame which we’ll use throughout this example. If you would like to follow along, you can download the dataset from here.

# pandas groupby sum

import pandas as pd

cand = pd.read_csv('candidates'.csv)
cand.head()

Here’s our DataFrame header:

languagemonthsalarynum_candidatesdays_to_hire
1Swift4123.079.060.0
2Ruby4127.084.075.0
3VisualBasic5112.089.037.0
4Python9115.083.066.0
5Scala12111.078.073.0

Pandas groupby and sum example

Our first case is a simple grouping and sum aggregation by one column. We’ll pass the column name (in our case languages) to the Group by method, then use aggregate as needed using the sum function.

# pandas groupby sum one column (+ column name)

num_candidates = cand.groupby('language').agg(qty = ('num_candidates', 'sum')).astype({'qty':int})

num_candidates.head()

Here’s the output:

Using groupby.sort_values to sum and sort

In the next example we’ll simply sort the values using the sort_values method.

# sort and sum
sorted = cand.groupby('month').agg(candidate_qty = ('num_candidates', 'sum')).sort_values(by='month').astype({int})
sorted.head()

Here’s the output:

Aggregate by and sum multiple columns in Pandas

In the next example we would like to groupby multiple columns. To do that, we’ll pass a list of the relevant columns as shown below



num_candidates_mean_salary_month = cand.groupby(['language','month']).agg(qty = ('num_candidates', 'sum')).sort_values('month')

num_candidates_mean_salary_month.head()

And our result:

Combining mean and mean

In this example we will show how you can make more than aggregation to a groupby. In our case we’ll use the sum and mean functions to aggregate. Note that you can use count, min, max and others as needed.

# sum and mean
num_candidates_mean_salary_month = cand.groupby(['language','month']).agg(qty = ('num_candidates', 'sum'), avg_salary= ('salary', 'mean'))
num_candidates_mean_salary_month.head()

Ignoring NaN values

We can easily drop NaN values instead of filling them according to specific logic. In the code below empty values are removed from the DatFrame with dropna, and then aggregation is performed.

cand1 = cand.dropna()
num_candidates = cand1.groupby('language').agg(qty = ('num_candidates', 'sum'))
num_candidates.head()

Aggregate and sum specific rows

In our last example for today, we’ll go ahead and apply the aggregation only to a subset of our DataFrame. In this case we defined a list of languages for our grouping, subset our DataFrame accordinlgy and only then aggregate the data.

# condition
languages = ['Python', 'Scala', 'Ruby']
filtr = cand['language'].isin(languages)
cand[filtr].groupby('language').agg(qty = ('num_candidates', 'sum')).round()