Pandas: How to group a dataframe by one or multiple columns?

In today’s post we would like to provide you the required information for you to successfully use the DataFrame Groupby method in Pandas.

Using GroupBy on a Pandas DataFrame is overall simple: we first need to group the data according to one or more columns ; we’ll then apply some aggregation function / logic, being it mix, max, sum, mean etc’.

We’ll start with a simple Dataset that we’ll be using throughout this tutorial.

import pandas as pd
candidates_df = pd.read_csv('candidates')

Let’s take a look at our DataFrame:

print(candidates_df)
languagemonthsalarynum_candidatesdays_to_hire
1PHPApril118.083.054.0
2ScalaFebruary127.080.039.0
3GoMay122.075.058.0
4KotlinJuly146.082.073.0
5HaskellSeptember122.079.048.0
6HaskellFebruary130.090.051.0
7HaskellJuly118.073.065.0
8PHPNovember116.077.054.0
9ScalaFebruary114.088.039.0
10PythonOctober147.078.060

DataFrame Groupby using one column

We’ll first aggregate the number of candidates by month. In order to do so we’ll create a new DataFrame that contains the aggregated value. We’ll also assign the num_candidates name to the newly created aggregating column.

candidates_by_month = candidates_df.groupby('month').agg(num_cand_month = ('num_candidates', 'sum'))

print(candidates_by_month)

Let’s take a look at our newly created DataFRame:

monthnum_cand_month
April83.0
February258.0
July155.0
May75.0
November77.0
October78.0
September79

Pandas DF groupby multiple functions for same column

In this case we would like to show multiple aggregations (in our case min, mean and max) for the same column. Here is the Python code:

# group by  - multiple aggregations - same column

candidates_salary_by_month = candidates_df.groupby('month') \
                            .agg(min_sal = ('salary', 'min'), \
                                 mean_sal = ('salary', 'mean'),
                                 max_sal = ('salary', 'max')).round(2)

print(candidates_salary_by_month)

And the result:

Note: We could as well pass a dictionary containing the column to aggregate and the functions to use. In our case:

candidates_salary_by_month = candidates_df.groupby('month') \
                            .agg({'salary':['min','mean','max']})

Grouping by multiple columns with multiple aggregations functions

Here’s an example of multiple aggregations per grouping, each with their specific calculated function: a sum of the aggregating column and an average calculation.

# multiple columns
candidates_month_languages = candidates_df.groupby(['language','month']) \
.agg(num_cand_month = ('num_candidates', 'sum'), avg_sal = ('salary', 'mean'))

print(candidates_month_languages)

And here’s the output:

Formatting the groupby DataFrame

We can get rid of the decimal numbers by using a styler as shown below:

#formatting
candidates_salary_by_month =  candidates_df.groupby('month').agg(num_cand_month = \
                                                                ('num_candidates', 'sum'), \
                                                                avg_sal = ('salary', 'mean')).style.format('{:.0f}')

print(candidates_salary_by_month)

Here’s the result:

Plotting our groupby multiple columns

If we want to better visualize the data and look for outliers, we can use the Pandas DataFrame.plot() function. Note that in order to create more appealing charts we can use Seaborn; but in this case a simple bar graph will do.

# Re-define the DataFrame by grouping, and then plot

candidates_salary_by_month = candidates_df.groupby('month') \
                            .agg(min_sal = ('salary', 'min'), \
                                 mean_sal = ('salary', 'mean'),
                                 max_sal = ('salary', 'max')).round(2)

candidates_salary_by_month.plot(kind='bar');

The result looks as following:

Note: If you would like to learn more about simple barplots in Python, look at our bar chart tutorial.

Exporting DataFrame grouped columns to a Python list

A reader asked about how to save the content of the aggregated column into a Python list object. The Series method to_list() is probably the answer.

candidates_salary_by_month['mean_sal'].to_list()