In today’s Data Wrangling tutorial we’ll learn how to use Python in order to subset Pandas DataFrames and select specific columns according to column data and label values. This is useful for many different cases: Cleaning up data by filtering missing or invalid data, Visualizing only subsets of the data, prepare data for export to a file of Database table, pre-process our Data for machine learning and more. Let’s get started.
Filter Pandas DataFrame by row and column
You can subset a pandas DataFrame by row and column values using the brackets notation, the loc indexer or the DataFrame query method. Example:
#1
mask = (my_df['col_name'] == 'value')
my_df[mask]
#2
my_df.loc[mask]
#3
my_df.query("col_name = 'value'")
Create an example dataset
Use the following code to initialize our DataFrame.
import pandas as pd
candidates = {'city': ['Miami', 'Boston', 'Boston','New York', 'Atlanta'],
'month': ['01', '02', '03', '04', '05',],
'num_candidates': [190,82,88, 48, 56]}
hr = pd.DataFrame(candidates)
hr.head()
Here’s our DataFrame:
Subset dataframe column if it contains specific values
The simplest method to subset according to cell value:
filt = (hr['city'] == 'Boston')
hr[filt]
Select column by one or multiple values values in list
The following code will select rows in which the city name matches multiple values specified in the cities list.
# column value in list
cities = ['Boston', 'Atlanta']
hr[hr['city'].isin(cities)]
Here’s the result:
Filter according to complex conditions
In this case we’ll filter out rows in which column values match specific complex conditions. Note that we are able to chain different conditions using the AND (&) or OR (|) boolean gates.
# multiple conditions / greater than
filt = (hr['city'] == 'Boston') | (hr['num_candidates'] > 60)
Using the query() method to subset DataFrame rows
You are able to filter specific rows by using the query function. Here is a simple example that retrieves all rows pertaining to Boston:
hr.query("city == 'Boston'")
Subset using a regex
In this example we’ll subset only columns which label matches a specific expression. We’ll use the filter() method and pass the expression into the like parameter as shown in the example depicted below.
# filter by column label value
hr.filter(like='ity', axis=1)
We can also cast the column values into strings and then go ahead and use the contains() method to filter only columns containing a specific pattern.
filt = hr['city'].str.contains('ton')
hr[filt]
Filter according to the column label
For completeness, below you can find how we can use the loc indexer to subset a DataFrame by column label.
# filter by column label value
hr.loc[:,'city']
Note that you can obviously accomplish the same result by using the following techniques:
Select columns with bracket notation:
hr['city']
or using the iloc indexer, we can write the following expression:
hr.iloc[:,0]
Get DataFrame columns by type
Another use case is to be able to find columns containing specific data types. In this example we will quickly subset only numeric columns:
hr.select_dtypes(include = 'number')