In today’s post we would like to show how to use the DataFrame Groupby method in pandas in order to aggregate data by one or multiple column values.
Group pandas DataFrame data by column
To group your pandas DataFrame data by one or multiple specific columns, use the groupby DataFrame method. This method splits your DataFrame rows into groups based on column values, then allows you to aggregate and transform the data as needed, such as calculate a sum or average. Last, it combines the aggregated data into a structure that you are able to further manipulate, analyze or visualize.
Pandas groupby – practical example
Let’s assume that we have a very simple Data set that consists in some dummy HR related information that we’ll be using throughout this tutorial.
import pandas as pd
candidates_df = pd.read_csv('candidates')
Let’s take a look at our DataFrame:
print(candidates_df)
language | month | salary | num_candidates | days_to_hire | |
---|---|---|---|---|---|
1 | PHP | April | 118.0 | 83.0 | 54.0 |
2 | Scala | February | 127.0 | 80.0 | 39.0 |
3 | Go | May | 122.0 | 75.0 | 58.0 |
4 | Kotlin | July | 146.0 | 82.0 | 73.0 |
5 | Haskell | September | 122.0 | 79.0 | 48.0 |
6 | Haskell | February | 130.0 | 90.0 | 51.0 |
7 | Haskell | July | 118.0 | 73.0 | 65.0 |
8 | PHP | November | 116.0 | 77.0 | 54.0 |
9 | Scala | February | 114.0 | 88.0 | 39.0 |
10 | Python | October | 147.0 | 78.0 | 60 |
DataFrame Groupby using one column
We first want to aggregate the number of candidates by month. In order to do so we’ll create a new DataFrame that contains the aggregated value. We’ll also assign the num_candidates name to the newly created aggregating column.
candidates_by_month = candidates_df.groupby('month').agg(num_cand_month = ('num_candidates', 'sum'))
print(candidates_by_month)
Let’s take a look at our newly created DataFrame:
month | num_cand_month |
---|---|
April | 83.0 |
February | 258.0 |
July | 155.0 |
May | 75.0 |
November | 77.0 |
October | 78.0 |
September | 79 |
Pandas DF groupby multiple functions for same column
In this case we would like to show multiple aggregations (in our case min, mean and max) for the same column. Here is the Python code:
# group by - multiple aggregations - same column
candidates_salary_by_month = candidates_df.groupby('month') \
.agg(min_sal = ('salary', 'min'), \
mean_sal = ('salary', 'mean'),
max_sal = ('salary', 'max')).round(2)
print(candidates_salary_by_month)
And the result:
Note: We could as well pass a dictionary containing the column to aggregate and the functions to use. In our case:
candidates_salary_by_month = candidates_df.groupby('month') \
.agg({'salary':['min','mean','max']})
Grouping by multiple columns with multiple aggregations functions
Can you groupby your data set multiple columns in Pandas? You bet!
Here’s an example of multiple aggregations per grouping, each with their specific calculated function: a sum of the aggregating column and an average calculation.
# multiple columns
candidates_month_languages = candidates_df.groupby(['language','month']) \
.agg(num_cand_month = ('num_candidates', 'sum'), avg_sal = ('salary', 'mean'))
print(candidates_month_languages)
And here’s the output:
Note that we have renamed the aggregating columns as needed. As an example, the avg_sal column represents the mean of the salary column aggregated by language and month.
Formatting the groupby DataFrame
We can get rid of the decimal numbers by using a styler as shown below:
#formatting
candidates_salary_by_month = candidates_df.groupby('month').agg(num_cand_month = \
('num_candidates', 'sum'), \
avg_sal = ('salary', 'mean')).style.format('{:.0f}')
print(candidates_salary_by_month)
Here’s the result:
Plotting our groupby multiple columns
If we want to better visualize the data and look for outliers, we can use the Pandas DataFrame.plot() function. Note that in order to create more appealing charts we can use Seaborn; but in this case a simple bar graph will do.
# Re-define the DataFrame by grouping, and then plot
candidates_salary_by_month = candidates_df.groupby('month') \
.agg(min_sal = ('salary', 'min'), \
mean_sal = ('salary', 'mean'),
max_sal = ('salary', 'max')).round(2)
candidates_salary_by_month.plot(kind='bar');
The result looks as following:
Note: If you would like to learn more about simple barplots in Python, look at our bar chart tutorial.
Exporting DataFrame grouped columns to a Python list
A reader asked about how to save the content of the aggregated column into a Python list object. The Series method to_list() is probably the answer.
candidates_salary_by_month['mean_sal'].to_list()