Pandas – Python’s most powerful Data Analysis library, is more than just a data manipulation tool – it is resource rich in features that many overlook. This article reveals 40 of such features. We will demonstrate every capability using some fictitious HR data to highlight their practical use.
Let’s get started. We will organize the capabilities we’re going to discuss into nine separate groups:
- Data Acquisition
- Data Pre-Processing
- Data Transformations
- Data Querying and filtering
- Data Grouping
- Data Exploration
- Time Series Analysis
- Data Visualization and chart styling
- Improving Performance and scalability
Acquiring the learning Dataset
Some of the examples in this article make use of a demo dataset. In order to best follow along with this example, consider importing the learning sample dataset into your Python Data Analysis environment.
import pandas as pd
#read data into DataFrame
emp_df = pd.read_csv('emp_df.csv')
# basic preprocessing
emp_id = emp_df['emp_id'] = emp_df['emp_id'].astype('str')
emp_df['emp_join_date'] = pd.to_datetime(emp_df['emp_join_date'])
You can now look into the dataset that you have imported:
emp_df.head(10)
Here’s a snippet from our data:
Data Acquisition
#1 – Acquire SQL data using read_sql
Databases are a very common data source for your Pandas analysis. You can directly connect into SQL databases using the sqlalchemy package and then pull data for your analysis. In this example, we will extract data from an SQLite database table containing employee data into a DataFrame.
# create sqlalchemy engine
from sqlalchemy import create_engine
db_engine = create_engine('sqlite:///path/to/emp_database.db')
# create DataFrame
emp_df = pd.read_sql('SELECT * FROM employees', con = db_engine)
#2 – Read multiple Excel tabs
It often happens that your source data is split into several worksheets in an Excel file. The following code pulls data from several Excel sheets into several DataFrame objects that can be easily merged into one:
xl_file = pd.ExcelFile('employee_by_group.xlsx')
# Read all the workbook sheets into a dictionary of DataFrames
sheets_dict = pd.read_excel(xl_file, sheet_name=None)
# Create a DataFrame for each sheet
it_emp_df = sheets_dict['IT']
fin_emp_df = sheets_dict['Finance']
#concatenate the dataframes
emp_df = pd.concat([it_emp_df, fin_emp_df])
#3 – Read data from the web
Python offers rich functionality to scrape and manipulate data from HTML pages. Pandas has the read_html() function that allows you read one or more tables in a web page, each to a different DataFrame.
The following code reads a list of tables each containing the employee list by department.
# Read tables from a webpage into list of dataframes
emp_by_dpt_list = pd.read_html('path_to_your_hr_intranet_page')
# Concatenate data into one dataframe
pd.concat(emp_by_dpt_list, ignore_index=True)
emp_by_dpt_list = pd.read_html('path_to_your_hr_intranet_page')
# Read tables from a webpage
emp_by_dpt_list = pd.read_html('path_to_your_hr_intranet_page')
# Concatenate t
emp_by_dpt_list = pd.read_html('path_to_your_hr_intranet_page')
Data Pre-Processing
#4 Binning Data
Binning is a technique that can be used to convert numerical data into categorical data. You can either bin by equal intervals (using the cut function) or by quantiles (using qcut).
Let’s see how we bin the employee salary data into 4 quantiles:
emp_df['salary_group'] = pd.qcut(emp_df['emp_salary'], q=4)
#5 Replace values based on a function
Your data preprocessing could include replacing of specific column values using a custom function. In this example, we would like to process the values in the emp_department column and replace those by abbreviations. We’ll first define a Python dictionary and then map the values accordingly
# define mapping function
emp_dep_dict = {
'Finance': 'FIN',
'Logistics': 'LOG',
'Sales': 'SLS',
'Marketing': 'MKT',
}
# Use dictionary map to replace department names
emp_df['emp_department'] = emp_df['emp_department'].map(emp_dep_dict)
#6 Handling Categorical Data
In order to improve memory efficiency and boost overall query performance.
In this example,we will convert the ‘department’ column from its original type (either object or string) into a category type.
emp_df['emp_department'] = emp_df['emp_department'].astype('category')
#7 Use Categorical Data in Machine Learning
If you are preparing your data for machine learning processing – for example one shot encoding; You will need to transform your categorical data into indicator variables. The following code creates a new DataFrame containing binary data, representing a category in your original dataset:
emp_df_encoded = pd.get_dummies(emp_df, columns=['emp_department'])
Data Transformation
#8 -Create new columns using Assign
The most Pythonic way to create new DataFrame column is using the assign method. This is especially useful when you need to quickly define one or more calculated columns.
Suppose that all employees in our fictional company received a bonus of 10% during 2023. You can create a new column showing your employees’ compensation using the following code:
emp_df['emp_department']= emp_df.assign(emp_comp=emp_salary * 1.1)
#9 – Explode lists to rows
If one or more of your dataset rows contains lists of values (or just multiple values stored in one cell) you will need to normalize your data before analyzing. The Pandas method explode() transforms each list element into a row. Let’s assume that the emp_department column contains lists of values. If so, we can normalize the content as following:
emp_df_normalized = emp_df.explode('emp_department')
#10 – Reshape data with pivot(), stack() and unstack()
As a Data Analyst You are probably very familiar with the Excel pivot table. Pandas offers several useful methods to shape your dataset.
- Pivot() – creates a pivot table – very useful for analyzing according to categorical data.
- Stack() – Create a more complex pivot by converting one or multiple DataFrame columns to hierarchical index levels.
- Unstack() – Reverses the Stack() operation.
Data Querying and filtering
# 11 Simplify filtering with query()
The df.query() function allows to filter a DataFrame according to complex conditions. Let’s assume that we are interested to find all IT employees who are over 40 year old. We can use the following code to accomplish that:
it_over_40_emps = emp_df.query('emp_age > 40 & emp_department == "IT"')
print(it_over_40_emps)
This will return the following data set:
# 12 Use variables in a query
We just learnt about the query method. When writing more complex queries, you can assign query values to variables and use those in your code. Here’s an example:
# Define variables
age = 40
department = 'IT'
# Use variables in query
it_over_40_emps = emp_df.query('emp_age > @age & emp_department == @department')
print(it_over_40_emps)
# 13 Subset your data with filter()
You can use the filter() method to select a subset of your DataFrame columns based on a specific text expression. In this example we will find all columns ending with the regex ‘name’:
emp_df_subset = emp_df.filter(regex='.*name$')
# 14 Select a data range with between()
In order to select rows where a value falls within a range you can use the between DataFrame method. For example, we can find the employees whose salary range is between 70K and 80K.
high_range_salaries = emp_df[emp_df['emp_salary'].between(70000, 80000)]
print(high_range_salaries)
#15 – Use where for Conditional replacement
When possible use the where() method to replace values conditionally. In this example, we will assign a 10% raise bonus only to employees with greater than 80K salaries.
emp_df['emp_salary'] = (emp_df['emp_salary']*1.1).where(emp_df['emp_salary'] < 80000, emp_df['emp_salary'])
Data Grouping
#16 – Filter groups in your dataset
You can use the filter() method to remove any groups which do not meet specific criteria from your analysis. In this example we would like to focus on employees who work in departments in which the average salary is over 75K.
emp_df_filtered = emp_df.groupby('emp_department').filter(lambda e: e['emp_salary'].mean() > 75000)
This will return only the employees in the Logistics department:
#17 – Using Custom functions in aggregations
After grouping your data you can apply custom calculations. In this case we would like to find out the gap between the maximum and minimum salaries in each department.
emp_df_delta = emp_df.groupby('emp_department').agg({'emp_salary': lambda e: e.max() - e.min()})
#18 – Using Multiple Aggregations in Grouping
Once data is aggregated we can make several calculations on each group. Let’s assume that for each department we would like to see the min and max salary figures and average employee age:
emp_summary = emp_df.groupby('emp_department').agg({'emp_salary': ['min', 'max'], 'emp_age': 'mean'})
#19 – Group by time series
We can also group data by specific timeframes. The following code prints the average age and salary according to the year in which the employee joined the company.
emp_df_grouped = emp_df.set_index('emp_join_date')
emp_df_grouped.groupby(pd.Grouper(freq='Y')).mean().round()
Data Exploration
#20 – Fetch descriptive statistics data
Use the describe() DataFrame method to quickly find summary statistics for your quantitative data.
emp_df.describe().round(2)
This will return the following stats:
#21 – Find the correlations in your data
Using the Pandas corr() method, you can easily find the relationships between your numerical variables of your dataset.
emp_corr_matrix = emp_df.corr()
print(emp_corr_matrix)
Below is the dataset correlation matrix:
# 22 – Find the unique value count
Use the value_counts() method to find the count of unique values in a pandas column/series. The following code returns the number of employees in each department.
num_emp_department = emp_df['emp_department'].value_counts()
print (num_emp_department )
# 22 – Define a cross tab for categorical analysis
You can analyze relationships between categorical variables using the pd.crosstab() function made available by Pandas.
emp_department_age_cross = pd.crosstab(emp_df['emp_department'], emp_df['emp_age'])
print(emp_department_age_cross)
Time Series Analysis
#23 – Index your data by Datetime
Before starting your Time Analysis, you will need to define an appropriate index. This will allow you to slice, aggregate and slices based on time. In our example, we will define the employee join date as the index.
emp_df['emp_join_date'] = pd.to_datetime (emp_df['emp_join_date'])
emp_df.set_index('emp_join_date', inplace=True)
#24 – Resample your data
We can change the time frequency of our data sample by resampling. In our example, we can analyze the amount of people joining our company every year by using the resample() method.
emp_join_year = emp_df['emp_id'].resample('Y').count()
#25 – Analyze Seasonality in your data
Pandas allows you to identify regular patterns over a specific time period. In our case we would like to identify any cyclical trend in our employee salary data.
from pandas.plotting import autocorrelation_plot
autocorrelation_plot(emp_df['emp_salary'])
The following plot will be returned:
#26 – Calculate cumulative sums
Use the expanding() method to provide cumulative calculations over time. In the following example, we can plot the cumulative number of employees.
cumulative_number_employees = emp_df.sort_values(by ='emp_join_date', ascending=True)['emp_id'].expanding().count()
cumulative_number_employees.plot()
Here’s our chart:
#27 – Handling missing data
When analyzing your dataset by time period, you should ensure that missing data won’t distort your analysis. The fillna() method, which is commonly used during the pre processing phase of this workflow, comes very handy here as well. In this example, we will use the forward fill method to
emp_df_no_nas = emp_df.fillna(method='ffill')
Data Visualization and chart styling
#28 – Customize Plot Styles
You can use the matplotlib library, which is partly contained in Pandas, to customize the appearance of your plots using style parameters to make them more visually appealing and informative.
The following code plots a salary trend for our imaginary company by employee join date.
import matplotlib.pyplot as plt
plt.style.use ('seaborn-v0_8-whitegrid')
emp_df.plot (kind='line', x = 'emp_join_date', y = 'emp_comp')
plt.xlabel ('Join date')
plt.ylabel ('Salary')
plt.title ('Salary Trend')
plt.show()
Here’s our chart:
#29 – List your available plot styles
If you are unsure of the available plot styles that you can use in your notebook, you can easily fetch the list using this code:
print(plt.style.available)
This will return a list containing the style list. You can apply the style on your specific chart using the following code:
import matplotlib.pyplot as plt
plt.style.use ('grayscale')
#30 – Adding Annotations to your chart
You might want to insert notes into your chart in order to highlight or explain specific information. We will use the annotate method and pass the note text, point to annotate, annotation position and style of an arrow that connects the annotation text to its point.
import matplotlib.pyplot as plt
plt.style.use('seaborn-v0_8-whitegrid')
emp_df.plot(kind='line', x = 'emp_join_date', y = 'emp_comp')
plt.annotate('Lowest Salary', xy=('2020-10-1', 60000), xytext=('2020-10-1', 65000), arrowprops=dict(facecolor='green', shrink=0.1))
plt.xlabel('Join date')
plt.ylabel('Salary')
plt.title('Salary Trend')
plt.show()
This will render a text note on top of our chart:
#31 – Adding Gridlines
You can easily improve the readability of your plot by using grid lines.
emp_df ['emp_salary'].plot (kind='line')
plt.grid (True)
#32 – Enhance your Visualizations with Seaborn
Seaborn is another Python visualization library which makes it quite easy to create more sophisticated visualization based on Pandas DataFrames.
import seaborn as sns
bar = sns.barplot(data=emp_df, x='emp_department', y='emp_salary')
bar.set_xlabel('Department')
bar.set_ylabel('Salary')
bar.set_title('Average Salary by Department');
Here’s our chart:
#33 – Exporting Plots
You can easily export your pandas plot to an image file using the savefig method.
plt.savefig('hiring_trends.png')
#34 – Style your Pandas DataFrames with CSS
You can easily customize the display style of your pandas DataFrame using Cascade StyleSheets (CSS). This code changes the display of our DataFrame in Jupyter or other Python environments.
emp_df.style.set_properties(**{'color': 'white', 'background-color': 'green',})
Improving Performance and scalability
#35 – Understand your DataFrame memory consumption
You can use the memory_usage() DataFrame method to understand the memory consumption footprint of your DataFrame:
print (emp_df.memory_usage(deep=True))
This command will return a Series object showing the memory consumption in bytes. In our case, the DataFrame comprises 50 records, hence the memory usage is quite small.
#36 – Use vectorized operations
One of the key features of Pandas as the fact it allows to use vectorized operations to perform operations on entire DataFrame columns, instead of using a for loop or a lambda function. The benefit is way more performant and concise code.
A simple example of writing a vectorized operation, in this case to apply a 10% salary raise:
emp_df['emp_salary'] = emp_df['emp_salary'] * 1.1
#37 – Use Categorical data
Specifying data types for your DataFrame columns can help reduce memory consumption, specially for large datasets. The category data type is specially useful in reducing the memory footprint of the DataFrame.
Use the astype method to convert the department column to a category:
emp_df['emp_department'] = emp_df['emp_department'].astype('category')
#38 – Optimize dtypes with infer_objects()
Another nice trick is to allow Pandas to optimize column Dtypes automatically for performance. This is accomplished using the infer_objects() method.
emp_df = emp_df.infer_objects()
We can then check the new Pandas data types assigned to this DataFrame using the following code:
emp_df.dtypes
#39 – Use isin for Filtering values
To improve performance, when filtering your DataFrame, instead of using multiple boolean AND or OR condition, use ISIN. This provides a more efficient way to filter DataFrame rows according to a list of values.
departments_lst = ['FIN', 'SLS']
filtered_emp_df = emp_df[emp_df ['emp_department'].isin(departments_lst)]
#40 – Use clip()
Sometimes you are interested to filter out outliers in a large set of observations. One little known function to accomplish that is clip().
Here is an example that displayas all employees aged 30 to 55.
emp_df['emp_age'] = emp_df['emp_age'].clip(lower=30, upper=55)
Conclusion
In summary, Pandas is a very powerful data analysis tool featuring dozens of less known features. By leveraging the useful tips mentioned above, you can streamline your workflow and enhance your Data Analysis productivity. Know Feel free to post additional lesser known Pandas hacks in the comments.