In this tutorial we’ll learn how to validate whether a specific pandas DataFrame row or specific one or multiple column cells contain a specific text pattern.
Example DataFrame
We will get started with importing the pandas library and defining some simple data that you are welcome to use in order to follow along this example.
import pandas as pd
month = ['April', 'March', 'January', 'March', 'May']
language = ['Python', 'R', 'Java', 'Javascript', 'Python']
office = ['LA', 'Rome', 'Madrid', 'New York', 'Toronto']
salary = [177.0, 177.0, 155.0, 139.0, 155.0]
salaries = dict(month=month, language=language, salary = salary)
salary_df = pd.DataFrame(data=salaries)
salary_df.head()
Let’s take a quick look at our data:
month | language | salary | |
---|---|---|---|
0 | April | Python | 177.0 |
1 | March | R | 177.0 |
2 | January | Java | 155.0 |
3 | March | Javascript | 139.0 |
4 | May | Python | 155.0 |
Check if a DataFrame row contains specific text
We’ll start by defining a string pattern and then search our DataFrame rows:
# define value to search for
str_pattern = 'Python'
# row by row check for the string pattern in the row values
filt = (salary_df
.apply(lambda r: r.astype('string').str.contains(str_pattern)
.any(), axis=1))
print(filt)
Note tat in order to search row by row, we first need to convert each DataFrame row to the string type. Also note that we have used axis=1 to search across the columns of each row. We’ll get a Series of boolean values highlighting the DataFrame rows contain the string pattern:
0 True 1 False 2 False 3 False 4 True dtype: bool
We can now filter our DataFrame to highlight the relevant rows:
salary_df[filt]
The result will be:
month | language | salary | |
---|---|---|---|
0 | April | Python | 177.0 |
4 | May | Python | 155.0 |
Verify if a column cell contains a value in list
Let’s start by defining a list of strings we will be searching for:
lst_pattern = ['Python', 'R']
Now filter the our column (Series) accordingly:
filt = salary_df['language'].isin(lst_pattern)
salary_df[filt]
This will be out output:
month | language | salary | |
---|---|---|---|
0 | April | Python | 177.0 |
1 | March | R | 177.0 |
4 | May | Python | 155.0 |
Search for text string in multiple columns
In a similar fashion we can search for sub strings in a subset of our DataFrame.
# define list and DataFrame subset
lst_pattern = ['Python', 'R']
subset = salary_df[['month','language']]
# Define a mask to search for a list value in the DataFrame
filt = subset.apply (lambda r: r.isin(lst_pattern).any(), axis=1)
Now we can subset the DataFrame and show only the relevant rows:
subset[filt]
And the output will be:
month | language | |
---|---|---|
0 | April | Python |
1 | March | R |
4 | May | Python |