Group by and filter with conditions (SQL where and having) in Pandas

To write a where clause with Pandas, proceed as following:

your_df.query('your_condition').groupby('your_col')

#Example:

interviews_df.query ("language == 'Python'").groupby("language")["salary"].mean()

To write a having clause in Pandas proceed as following:

your_df.groupby('your_col').filter(lambda_function)

#Example:

interviews_df.groupby('language').filter(lambda g: len(g) > 3)

Create example data

We’ll start by creating a sample DataFrame based on an imaginary HR dataset.

import pandas as pd

month = ['December', 'September', 'September', 'December', 'June', 'November']
language = ['Python', 'Python', 'R', 'Python', 'R', 'Python']
salary = [108.0, 143.0, 169.0, 149.0, 139.0, 121.0]
data = dict(month = month, language = language, salary = salary)
interviews_df = pd.DataFrame(data=data)

Let’s look into our data:

interviews_df.head()
monthlanguagesalary
0DecemberPython108.0
1SeptemberPython143.0
2SeptemberR169.0
3DecemberPython149.0
4JuneR139.0

We can easily aggregate the data by language using the snippet below

interviews_df.groupby('language')["salary"].mean().reset_index()
languagesalary
0Python160.25
1R154.00

Aggregate by a specific column value

In our fist example we would like to aggregate the data by the language column, but only take under consideration those rows that includes the ‘Python’ value. This is similar to writing a WHERE clause in SQL. In Pandas we’ll use the DataFrame query() method, that allows to filter only the relevant rows to aggregate and then grouped them.

(interviews_df
 .query ("language == 'Python'")
 .groupby("language")["salary"].mean()
 .reset_index())

This returns the following results:

languagesalary
0Python160.25

Group with multiple conditions

In this example we will use a complex condition to filter out our data, for example we would like to group only records pertaining to the Python language with salary bigger than 130K:

(interviews_df
.query ("language == 'Python' & salary > 130")
.groupby("month")["salary"].mean()
.reset_index())

This will return the following:

monthsalary
0December158.5
1November181.0
2September143.0

Group by count and filter by group size

In this example we will write an SQL having clause in Pandas. We’ll use the filter function and write a lambda function that will filter out languages with less than three observations:

interviews_df.groupby('language').filter(lambda g: len(g) > 3)

This return only the rows related to the Python language related interviews:

monthlanguagesalary
0DecemberPython168.0
1SeptemberPython143.0
3DecemberPython149.0
5NovemberPython181.0

Aggregate and filter by sum of values

Similary we can filter our by sum or average values. In this case we would like to filter out those rows related to languages which aggregated salary was lower than 400K.

interviews_df.groupby('language').filter(lambda g: g['salary'].sum() < 400)