To write a where clause with Pandas, proceed as following:
your_df.query('your_condition').groupby('your_col')
#Example:
interviews_df.query ("language == 'Python'").groupby("language")["salary"].mean()
To write a having clause in Pandas proceed as following:
your_df.groupby('your_col').filter(lambda_function)
#Example:
interviews_df.groupby('language').filter(lambda g: len(g) > 3)
Create example data
We’ll start by creating a sample DataFrame based on an imaginary HR dataset.
import pandas as pd
month = ['December', 'September', 'September', 'December', 'June', 'November']
language = ['Python', 'Python', 'R', 'Python', 'R', 'Python']
salary = [108.0, 143.0, 169.0, 149.0, 139.0, 121.0]
data = dict(month = month, language = language, salary = salary)
interviews_df = pd.DataFrame(data=data)
Let’s look into our data:
interviews_df.head()
month | language | salary | |
---|---|---|---|
0 | December | Python | 108.0 |
1 | September | Python | 143.0 |
2 | September | R | 169.0 |
3 | December | Python | 149.0 |
4 | June | R | 139.0 |
We can easily aggregate the data by language using the snippet below
interviews_df.groupby('language')["salary"].mean().reset_index()
language | salary | |
---|---|---|
0 | Python | 160.25 |
1 | R | 154.00 |
Aggregate by a specific column value
In our fist example we would like to aggregate the data by the language column, but only take under consideration those rows that includes the ‘Python’ value. This is similar to writing a WHERE clause in SQL. In Pandas we’ll use the DataFrame query() method, that allows to filter only the relevant rows to aggregate and then grouped them.
(interviews_df
.query ("language == 'Python'")
.groupby("language")["salary"].mean()
.reset_index())
This returns the following results:
language | salary | |
---|---|---|
0 | Python | 160.25 |
Group with multiple conditions
In this example we will use a complex condition to filter out our data, for example we would like to group only records pertaining to the Python language with salary bigger than 130K:
(interviews_df
.query ("language == 'Python' & salary > 130")
.groupby("month")["salary"].mean()
.reset_index())
This will return the following:
month | salary | |
---|---|---|
0 | December | 158.5 |
1 | November | 181.0 |
2 | September | 143.0 |
Group by count and filter by group size
In this example we will write an SQL having clause in Pandas. We’ll use the filter function and write a lambda function that will filter out languages with less than three observations:
interviews_df.groupby('language').filter(lambda g: len(g) > 3)
This return only the rows related to the Python language related interviews:
month | language | salary | |
---|---|---|---|
0 | December | Python | 168.0 |
1 | September | Python | 143.0 |
3 | December | Python | 149.0 |
5 | November | Python | 181.0 |
Aggregate and filter by sum of values
Similary we can filter our by sum or average values. In this case we would like to filter out those rows related to languages which aggregated salary was lower than 400K.
interviews_df.groupby('language').filter(lambda g: g['salary'].sum() < 400)