In this short tutorial we’ll learn how to subset a Pandas DataFrame according to simple and complex conditions. This is a core capability that you should master in order to effectively wrangle data. You can think about it as similar functionality to the one provided by the SQL where statement but in Pandas.
Create our example DataFrame
As we typically do, we’ll start by creating a DataFrame that you can use to follow along this example in your own computer. Enter the code below into your Python Data Analysis development environment:
import pandas as pd
month = ['June', 'August', 'February', 'July']
language = ['Java', 'Kotlin', 'PHP', 'Python']
first_interview = (78, 93, 76, 89)
hr_dict = dict(month=month, language=language, interview_1=first_interview)
hr_df = pd.DataFrame(data=hr_dict)
print(hr_df.head())
Here’s our DataFrame:
month | language | interview_1 | |
---|---|---|---|
0 | June | Java | 78 |
1 | August | Kotlin | 93 |
2 | February | PHP | 76 |
3 | July | Python | 89 |
Find rows by single condition
First case will be to filter our DataFrame according to rows containing specific values. Initially we’ll use a simple condition as an example:
# select rows by simple condition
condition = (hr_df['language'] == 'Python')
hr_df[condition]
The following records will be selected:
month | language | interview_1 | |
---|---|---|---|
3 | July | Python | 89 |
Select rows by multiple conditions (with AND and OR)
More interesting is to query our DataFrame according to multiple conditions. We concatenate the conditions using AND (&) and OR (|) boolean operators.
Using the OR operator
# multiple conditions (OR)
condition = (hr_df['language'] == 'Python') | (hr_df['month'] == 'June')
hr_df[condition]
The result will be:
month | language | interview_1 | interview_2 | |
---|---|---|---|---|
0 | June | Java | 78 | 43 |
3 | July | Python | 89 | 61 |
Using the AND operator
# multiple conditions (AND)
condition = (hr_df['language'] == 'Python') & (hr_df['month'] == 'June')
hr_df[condition]
No results will be found.
Selecting rows when column value starts with
Could be that you might want to select row values that matches a specific string pattern. In the example below, we’ll filter only entries in the language column which starts with the letter K.
#select rows column value starts with
condition = (hr_df['language'].str.startswith('K'))
hr_df[condition]
Here’s the result:
month | language | interview_1 | interview_2 | |
---|---|---|---|---|
1 | August | Kotlin | 93 | 68 |
Pick records by conditions in list
In our last example for today, we’ll select one or multiple rows only if they match a specific list of values.
# find rows by condition in list
#define the list of values
lang_lst = ['PHP', 'Python']
#subset the dataframe
hr_df[hr_df['language'].isin(lang_lst)]
month | language | interview_1 | interview_2 | |
---|---|---|---|---|
2 | February | PHP | 76 | 47 |
3 | July | Python | 89 | 61 |
Additional Learning