How to turn a Pandas groupby to a dataframe?

While analyzing data you typically group data using the very handy Pandas groupby method. You can then more easily subset your data , iterate over the groups while applying some transformation logic or just probably visualize your data using pandas, matplotlib or seaborn. However, once your data is grouped as a Series object, you might want to convert it to a DataFrame instead. Once you have a DataFrame, you can take advantage of significant amount of methods that will simplify your data wrangling process.

In this tutorial i would like to show you three simple ways which i typically use to convert group by objects into DataFrames. As a follow up topic – i have added a simple tutorial to demonstrate the use of the count method on grouped data.

Creating example Data

We’ll start by creating data and grouping it using the groupby DataFrame method

import pandas as pd

# define the dataset
language = ['VisualBasic', 'VisualBasic', 'Java', 'Java', 'VisualBasic', 'Java', 'C#', 'VisualBasic']
office = ['Atlanta', 'New York', 'New York', 'Boston', 'Atlanta', 'New York', 'Atlanta', 'New York']
month = ['March', 'May', 'February', 'January', 'March', 'May', 'May', 'May']
salary = [109.0, 110.0, 106.0, 107.0, 110.0, 113.0, 128.0, 131.0]
salaries = dict(month=month, language=language, office=office, salary = salary)
hiring = pd.DataFrame(data=salaries)

# grouping the data
hiring_gp = hiring.groupby(['language','month'])['salary'].mean()
print(hiring_gp)

We now have a Series that contains the grouping results.

# 1 – Convert the Groupby to a DataFrame with to_frame()

The first option to convert the grouped data to a DataFrame is using the Series method to_frame(). Here’s the snippet:

hiring_gp.to_frame()

and the result:

# 2 – Turn Series to DataFrame with unstack

We can use the to_frame() method that we just reviewed in order to convert Series objects to DataFrames.

Another useful method to achieve that is to use the unstack method:

hiring_gp.unstack(level='month')

And here is the result:

# 3 – Group data directly into a DataFrame with groupby.agg

We can obviously aggregate data directly into a DataFrame, using the groupby.agg method:

hiring.groupby(['language','month']).agg (avg_salary= ('salary', 'mean'))

Groupby DataFrame by all columns (or multiple ones)

Another question we typically get is how to groupby DataFrame data by multiple columns (or even all columns). If you are in search for a solution for that look into this post on multiple column grouping in Pandas.

Pandas groupby DataFrame to csv

We can easily write our data to a comma separated value (CSV) file.

hiring_gp.unstack(level='month').to_csv('hiring.csv')

Converting the GroupBy object to Excel

We can easily use the Series to_excel method to convert the grouped data to Excel:

hiring_gp.to_excel

Alternatively we can first create a DataFrame and use a similar to_excel method for DataFrames

import openpyxl
hiring_gp.unstack(level='month').to_excel('hiring.xlsx')

You might want to look into a more comprehensive tutorial on how to export Pandas DataFrame to Excel workbooks.

Count number of elements after aggregating

In Data Analysis we often aggregate our data and then typically apply specific functions on it. Today we’ll learn how to count values on data that we have previously aggregated using the DataFrame.groupby() Pandas method.

Creating example data

Let’s first import the Python Pandas library and acquire data into our Python development environment:

import pandas as pd
hr = pd.read_csv('interview_data.csv')

hr.info()
?
This image has an empty alt attribute; its file name is image.png

Groupby and count in Pandas

We can easily aggregate our dataset and count the number of observations related to each programming language in our dataset.

hr.groupby('language').size()

Note that unlike the count() method, size() counts also occurrences of nan empty values.

Count pandas group by with condition

Let’s now assume that we want to show up only programming languages for which we interviewed more than twice during the year. We will first aggregate the data and then define a new column displaying the values we counted.

# groupby
languages = hr.groupby('language').agg(number_of_months = ('month', 'count'))

# define condition
filt = languages [ 'number_of_months'] > 2


# filter the DataFrame

languages [filt]
?

Here’s our result:

This image has an empty alt attribute; its file name is group_by_count_pandas.png

Aggregating and counting with multiple conditions

In the same manner, we can filter our grouped DataFrame by multiple conditions:

# aggregate data
languages_month_salary = hr.groupby('language').agg(number_of_months = ('month', 'count'), avg_sal = ('salary', 'mean'))


# define a complex condition - boolean or
filt = (languages_month_salary [ 'number_of_months']) > 2 |(languages_month_salary [ 'avg_sal'] > 100)

#filter the DataFrame
languages_month_salary [filt]

Result:

This image has an empty alt attribute; its file name is pandas_groupby_multiple_conditions.png

Groupby and count distinct values

In this case, we will first go ahead and aggregate the data, and then count the number of unique distinct values. We will then sort the data in a descending orders. The result in this case is a series.

hr.groupby('language')['month'].nunique().sort_values(ascending=False)

Follow up questions

Is it possible to filter groups before converting to a DataFrame?

Yes you are able to use the filter() GroupBy method to apply a function to determine which groups will be converted to a DataFrame. In this example, we will convert only groups with salary qreater than 100.

def filter_groups(month):
        return salary.mean() > 100

hiring_filtered = hiring_gp.filter(filter_groups)

hiring = hiring_filtered .reset_index(drop = True)

Can i assign custom column names when converting a GroupBy?

Yes, use the rename DataFrame method and pass a dictionary as needed.

hiring.rename({mapping_dictionary})