One of the prevalent tasks in data wrangling is filtering data that we have previously imported into one or multiple DataFrames. In today’s tutorial we will learn how to use the loc accessor to filter pandas DataFrame rows and columns.
Example DataFrame
As we typically do, we’ll import the pandas library and create a very simple dataset that you can use in order to follow along.
import pandas as pd
area = ['Javascript', 'Java', 'Java', 'R', 'Python', 'R']
office = ['Shenzhen', 'Bangalore', 'N', 'Oew York', 'Rio de Janeiro', 'Rio de Janeiro']
salary = [151.0, 148.0, 248.0, 181.0, 134.0, 136.0]
interviews = dict(area = area, office = office, salary = salary)
sal_df = pd.DataFrame(data = interviews)
Let’s take a look at our data:
sal_df.head()
area | office | salary | |
---|---|---|---|
0 | Javascript | Shenzhen | 151.0 |
1 | Java | Bangalore | 148.0 |
2 | Java | New York | 248.0 |
3 | R | New York | 181.0 |
4 | Python | Rio de Janeiro | 134.0 |
Using the pandas loc accesor to filter by multiple conditions
The loc accessor allows to filter a Pandas DataFrame by rows and columns labels. The basic syntax is simple:
my_df.loc[row_labels, column_labels]
If we want to filter by multiple criteria, we’ll define a conditional statement – in this case using an AND (&) condition:
filt = (sal_df['area'] == 'Java') & (sal_df['salary'] > 200)
This will return a series of boolean elements that we can pass to the loc accessor:
sal_df.loc[filt]
The result will look as following:
area | office | salary | |
---|---|---|---|
2 | Java | New York | 248.0 |
We can define a complex condition using OR (|):
filt = (sal_df['area'] == 'Java') | (sal_df['salary'] > 200)
sal_df.loc[filt]
Here’s the DataFrame subset:
area | office | salary | |
---|---|---|---|
1 | Java | Bangalore | 148.0 |
2 | Java | New York | 248.0 |
Filter by multiple string values
Let’s now assume that we would like to subset the DataFrame according to multiple string values. In this case we will show rows which contains the values Java or JavaScript. Here’s a simple example of how to do that:
filt = sal_df['area'].str.contains ('Java|Javascript')
sal_df.loc[filt]
Here’s the result:
area | office | salary | |
---|---|---|---|
0 | Javascript | Shenzhen | 151.0 |
1 | Java | Bangalore | 148.0 |
2 | Java | New York | 248.0 |
Subset by multiple values in a list
In this case we would like to pass a list of strings or numbers and then use those as a criteria for sub setting our DataFrame. In this case we will use the isin() Series method is search for specific rows and then filter the DataFRame accordingly.
value_lst = ['Javascript', 'Java']
filt = sal_df['area'].isin(value_lst)
sal_df.loc[filt]
Questions / feedback? Feel free to let us know in the Comments section.