While analyzing data you typically group data using the very handy Pandas groupby method. You can then more easily subset your data , iterate over the groups while applying some transformation logic or just probably visualize your data using pandas, matplotlib or seaborn. However, once your data is grouped as a Series object, you might want to convert it to a DataFrame instead. Once you have a DataFrame, you can take advantage of significant amount of methods that will simplify your data wrangling process.
In this tutorial i would like to show you three simple ways which i typically use to convert group by objects into DataFrames. As a follow up topic – i have added a simple tutorial to demonstrate the use of the count method on grouped data.
Creating example Data
We’ll start by creating data and grouping it using the groupby DataFrame method
import pandas as pd
# define the dataset
language = ['VisualBasic', 'VisualBasic', 'Java', 'Java', 'VisualBasic', 'Java', 'C#', 'VisualBasic']
office = ['Atlanta', 'New York', 'New York', 'Boston', 'Atlanta', 'New York', 'Atlanta', 'New York']
month = ['March', 'May', 'February', 'January', 'March', 'May', 'May', 'May']
salary = [109.0, 110.0, 106.0, 107.0, 110.0, 113.0, 128.0, 131.0]
salaries = dict(month=month, language=language, office=office, salary = salary)
hiring = pd.DataFrame(data=salaries)
# grouping the data
hiring_gp = hiring.groupby(['language','month'])['salary'].mean()
print(hiring_gp)
We now have a Series that contains the grouping results.
# 1 – Convert the Groupby to a DataFrame with to_frame()
The first option to convert the grouped data to a DataFrame is using the Series method to_frame(). Here’s the snippet:
hiring_gp.to_frame()
and the result:
# 2 – Turn Series to DataFrame with unstack
We can use the to_frame() method that we just reviewed in order to convert Series objects to DataFrames.
Another useful method to achieve that is to use the unstack method:
hiring_gp.unstack(level='month')
And here is the result:
# 3 – Group data directly into a DataFrame with groupby.agg
We can obviously aggregate data directly into a DataFrame, using the groupby.agg method:
hiring.groupby(['language','month']).agg (avg_salary= ('salary', 'mean'))
Groupby DataFrame by all columns (or multiple ones)
Another question we typically get is how to groupby DataFrame data by multiple columns (or even all columns). If you are in search for a solution for that look into this post on multiple column grouping in Pandas.
Pandas groupby DataFrame to csv
We can easily write our data to a comma separated value (CSV) file.
hiring_gp.unstack(level='month').to_csv('hiring.csv')
Converting the GroupBy object to Excel
We can easily use the Series to_excel method to convert the grouped data to Excel:
hiring_gp.to_excel
Alternatively we can first create a DataFrame and use a similar to_excel method for DataFrames
import openpyxl
hiring_gp.unstack(level='month').to_excel('hiring.xlsx')
You might want to look into a more comprehensive tutorial on how to export Pandas DataFrame to Excel workbooks.
Count number of elements after aggregating
In Data Analysis we often aggregate our data and then typically apply specific functions on it. Today we’ll learn how to count values on data that we have previously aggregated using the DataFrame.groupby() Pandas method.
Creating example data
Let’s first import the Python Pandas library and acquire data into our Python development environment:
import pandas as pd
hr = pd.read_csv('interview_data.csv')
hr.info()
?
Groupby and count in Pandas
We can easily aggregate our dataset and count the number of observations related to each programming language in our dataset.
hr.groupby('language').size()
Note that unlike the count() method, size() counts also occurrences of nan empty values.
Count pandas group by with condition
Let’s now assume that we want to show up only programming languages for which we interviewed more than twice during the year. We will first aggregate the data and then define a new column displaying the values we counted.
# groupby
languages = hr.groupby('language').agg(number_of_months = ('month', 'count'))
# define condition
filt = languages [ 'number_of_months'] > 2
# filter the DataFrame
languages [filt]
?
Here’s our result:
Aggregating and counting with multiple conditions
In the same manner, we can filter our grouped DataFrame by multiple conditions:
# aggregate data
languages_month_salary = hr.groupby('language').agg(number_of_months = ('month', 'count'), avg_sal = ('salary', 'mean'))
# define a complex condition - boolean or
filt = (languages_month_salary [ 'number_of_months']) > 2 |(languages_month_salary [ 'avg_sal'] > 100)
#filter the DataFrame
languages_month_salary [filt]
Result:
Groupby and count distinct values
In this case, we will first go ahead and aggregate the data, and then count the number of unique distinct values. We will then sort the data in a descending orders. The result in this case is a series.
hr.groupby('language')['month'].nunique().sort_values(ascending=False)
Follow up questions
Is it possible to filter groups before converting to a DataFrame?
Yes you are able to use the filter() GroupBy method to apply a function to determine which groups will be converted to a DataFrame. In this example, we will convert only groups with salary qreater than 100.
def filter_groups(month):
return salary.mean() > 100
hiring_filtered = hiring_gp.filter(filter_groups)
hiring = hiring_filtered .reset_index(drop = True)
Can i assign custom column names when converting a GroupBy?
Yes, use the rename DataFrame method and pass a dictionary as needed.
hiring.rename({mapping_dictionary})