In today’s tutorial we will learn how to filter one or multiple pandas DataFrame and to select specific rows by value and condition.
Example DataFrame
Let’s start by importing the pandas Python library and defining an example DataFrame that you can use to follow along. In your Python dev environment run the following code
sales_dict = {
'area' : ['South', 'North', 'East', 'West'],
'type' : ['B2B', 'B2C', 'B2B', 'B2C'],
'revenue' : [1345, 1987, 5634, 3424]
}
sales_df = pd.DataFrame (sales_dict, columns = {'area': 'area', 'type':'cost_center', 'revenue':'revenue'})
To browse through our DataFrame values, run the following command:
print(sales_df.head())
Here’s our data:
area | type | revenue | |
---|---|---|---|
0 | South | B2B | 1345 |
1 | North | B2C | 1987 |
2 | East | B2B | 5634 |
3 | West | B2C | 3424 |
Filter DataFrame row by index value
In this first example, we’ll use the iloc accesor in order to slice out a single row from our DataFrame by its index.
sales_df.iloc[0]
The output is a Series representing the row values:
area South type B2B revenue 1345 Name: 0, dtype: object
Filter one or multiple rows by value
The next simple example is to filter our data by a single value in a column.
filt = (sales_df ['area'] == 'North')
sales_df[filt]
The output will be:
area | type | revenue | |
---|---|---|---|
1 | North | B2C | 1987 |
Select rows by column value in list
Next example, we would like to find only rows that include specific column values which are elements of a Python list:
filt = (sales_df ['area'].isin ([ 'North', 'South']))
sales_df[filt]
Here’s the output:
area | type | revenue | |
---|---|---|---|
0 | South | B2B | 1345 |
1 | North | B2C | 1987 |
Aside: make sure that you invoke the Series isin method correctly. The following code will return a Type Error because the isin builtin method can’t receive a list as a parameter if not enclosed by parentheses.
# this will render an error:
filt = (sales_df ['area'].isin [ 'North', 'South'])
TypeError: 'method' object is not subscriptable
Filter rows by value range
We can use the between function to define value ranges on numeric data as shown below:
filt = (sales_df ['revenue'].between(2500, 4000))
sales_df[filt]
Here’s the output:
area | type | revenue | |
---|---|---|---|
3 | West | B2C | 3424 |
Get the maximum value row
To get the highest value of a column we use the max() function. Let’s take a look:
filt = (sales_df ['revenue'] == sales_df ['revenue'].max(axis=0))
sales_df[filt]
Here’s the maximum value row:
area | type | revenue | |
---|---|---|---|
2 | East | B2B | 5634 |
Find the minimum value row
Conversely, to find the row containing a column minimal value, we use the min() function.
filt = (sales_df ['revenue'] == sales_df ['revenue'].min(axis=0))
sales_df[filt]
Here’s the minimal value row:
area | type | revenue | |
---|---|---|---|
0 | South | B2B | 1345 |
Select by condition value in any / multiple columns
We will wrap up this tutorial with filtering our DataFrame according to a complex condition that searches for specific values in multiple columns. In this case we created an OR (|) condition. You can obviously concatenate conditions also with an AND (&) boolean gate.
filt = (sales_df['area'].isin ([ 'North', 'South'])) | (sales_df['type'] == 'B2C')
sales_df[filt]
Here’s our output:
area | type | revenue | |
---|---|---|---|
0 | South | B2B | 1345 |
1 | North | B2C | 1987 |
3 | West | B2C | 3424 |