In today’s tutorial we’ll learn how to select DataFrame rows by specific or multiple conditions. For people new to Pandas but experienced in SQL, we’ll learn how to write where statements aimed at selecting data from our DataFrames.
We’ll look into several cases:
- Filtering rows by column value
- Selecting by multiple boolean conditions
- Selecting only rows containing items in a list
- Using a lambda function to define a filter
Example DataFrame
We’ll start by defining a very simple DataFrame that you can use to follow along with the tutorial.
import pandas as pd
year = [2022, 2021, 2022, 2022, 2020]
area = ['R', 'Python', 'VisualBasic', 'Python', 'JavaScript']
candidates= (76, 79, 78, 88, 83)
interview = dict(year = year, area = area, candidates = candidates)
hr = pd.DataFrame(data=interview)
print(hr)
Here are our DataFrame rows:
year | area | candidates | |
---|---|---|---|
0 | 2022 | R | 76 |
1 | 2021 | Python | 79 |
2 | 2022 | VisualBasic | 78 |
3 | 2022 | Python | 88 |
4 | 2020 | JavaScript | 83 |
#1 Filter rows by column value
This is the simplest example, we have a simple condition to subset our DataFrame with.
col_val = 'Python'
filt1 = (hr['area'] == col_val)
hr[filt1]
Here is the result:
year | area | candidates | |
---|---|---|---|
1 | 2021 | Python | 79 |
3 | 2022 | Python | 88 |
#2 Select DataFrame rows by multiple conditions
A little bit more realistic scenario is that we need to write a complex filter made of several boolean expressions.
# we can also use or ('\') boolean gates
filt2 = (hr['area'] == 'Python') & (hr['candidates'] > 87)
hr[filt2]
year | area | candidates | |
---|---|---|---|
3 | 2022 | Python | 88 |
#3 Subset rows by condition in list
In this next example we’ll search and filter by values in a list.
area_lst = ['R', 'Python']
filt3 = hr['area'].isin(area_lst)
hr[filt3]
The result:
year | area | candidates | |
---|---|---|---|
0 | 2022 | R | 76 |
1 | 2021 | Python | 79 |
3 | 2022 | Python | 88 |
#4 Filtering DataFrames using lambda functions
Lambda functions are very powerful artifacts that we can leverage to select specific DataFrame records. Here’s a very simple example:
# select only rows with more than 80 candidates OR records from 2020
filt4 = lambda x: (hr['candidates'] >80) | (hr['year'] == 2020)
hr[filt4]
And here’s our result:
year | area | candidates | |
---|---|---|---|
3 | 2022 | Python | 88 |
4 | 2020 | JavaScript | 83 |
Using the loc indexer to select specific data
For completeness, we can also filter our data using the loc indexer. This will work for all examples we provided so far in this tutorial. Example:
hr.loc[filt3]