Create Pandas crosstab with percentages on one or multiple columns

In today’s tutorial we’ll learn how to quickly create and customize crosstabs. As part of our Data Analysis work in Pandas, we typically use crosstabs in order to count , plot and analyze frequencies of occurrences on multiple columns.

Importing our data

For this example we’ll import a simple csv file that is located in the same working directory as our Pythin notebook or script:

# import pandas
import pandas as pd

sal_df = pd.read_csv('hr.csv')

print(sal_df)

Here’s our data:

Create a Simple Pandas crosstab

We’ll call the pd.crosstab function and render a very simple crosstab:

crosstb1 = pd.crosstab(index = sal_df['month'], columns = sal_df['language'])

crosstb1

Note that by default the crosstab displays the occurrence count (not the unique / distinct values) and there is no need to specify the aggfunc = count parameter.

Create a cross tab with percentages

By definition, the crosstab counts the occurrences of each row in the DataFrame. We can do the same using a Pandas Pivot Table. What’s special about the crosstab function, if that as we mentioned before, it also allows to easily computer the relative frequency between values in different columns. We do that by adding the normalize=True parameter.

crosstb2 = pd.crosstab(index = sal_df['month'],  \
                      columns = sal_df['language'] , \
                      normalize=True) *100
crosstb2

Using a simple snippet we computed the relative frequency (percentage) of each observation in our data.

Adding a total column and row to the crosstab

Next, we’ll use the margins and margins_name parameters to add a summary column and row to the table.

crosstb3 = pd.crosstab(index = sal_df['month'],  \
                      columns = sal_df['language'] , \
                      normalize=True, margins = True, \
                      margins_name= "Total") *100
crosstb3

To improve the look and feel of our table, we’ll append percentage signs to each cell in the cross tab. In order to do so, we’ll first convert the crosstab contents to a Python string data type. We do that by using the astype DataFrame method. We then append the % character using a simple lambda function.

crosstb3.astype(str).apply(lambda x:x + '%')

Here is the resulting crosstab:

Pandas crosstabs with multiple columns

Next, we would like to allow an easier drill down into the data by adding another level to our crosstab columns section. Note that instead of passing a single column we are passing a list containing the columns we want displayed.

crosstb4 = pd.crosstab(index = sal_df['month'],  \
                      columns = [sal_df['language'], sal_df['office']] , \
                      normalize=True, margins = True, \
                      margins_name= "Total") *100

crosstb4.astype(str).applymap(lambda x:x + '%')

Additional Learning

How to count values in specific Pandas rows and columns?