In Data Analysis we often aggregate our data and then typically apply specific functions on it. Today we’ll learn how to count values on data that we have previously aggregated using the DataFrame.groupby() Pandas method.
Creating example data
Let’s first import the Python Pandas library and acquire data into our Python development environment:
import pandas as pd
hr = pd.read_csv('interview_data.csv')
hr.info()
Groupby and count in Pandas
We can easily aggregate our dataset and count the number of observations related to each programming language in our dataset.
hr.groupby('language').size()
Note that unlike the count() method, size() counts also occurrences of nan empty values.
Count pandas group by with condition
Let’s now assume that we want to show up only programming languages for which we interviewed more than twice during the year. We will first aggregate the data and then define a new column displaying the values we counted.
# groupby
languages = hr.groupby('language').agg(number_of_months = ('month', 'count'))
# define condition
filt = languages [ 'number_of_months'] > 2
# filter the DataFrame
languages [filt]
Here’s our result:
Aggregating and counting with multiple conditions
In the same manner, we can filter our grouped DataFrame by multiple conditions:
# aggregate data
languages_month_salary = hr.groupby('language').agg(number_of_months = ('month', 'count'), avg_sal = ('salary', 'mean'))
# define a complex condition - boolean or
filt = (languages_month_salary [ 'number_of_months']) > 2 |(languages_month_salary [ 'avg_sal'] > 100)
#filter the DataFrame
languages_month_salary [filt]
Result:
Groupby and count distinct values
In this case, we will first go ahead and aggregate the data, and then count the number of unique distinct values. We will then sort the data in a descending orders. The result in this case is a series.
hr.groupby('language')['month'].nunique().sort_values(ascending=False)
Count specific values
We suggest that you look into our tutorial on how to count unique specific values.