In this Data Analysis tutorial we’ll learn how to use Python to search for a specific or multiple strings in a Pandas DataFrame column.
Pandas str contains
In a nutshell we can easily check whether a string is contained in a pandas DataFrame column using the .str.contains() Series function:
df_name['col_name'].str.contains('string').sum()
Read on for several examples of using this capability.
Creating test data
We’ll start by creating a simple DataFrame for this example
import pandas as pd
#Create data
month = ['October', 'November', 'July', 'June', 'February']
language = ['Python', 'Python', 'Haskell', 'JavaScript', 'Go']
salary = (81, 77, 74, 80, 75)
hr_dict = dict(month=month, language=language, salary=salary)
hr_df = pd.DataFrame(data=hr_dict)
print(hr_df)
This will result in the following DataFrame:
month | language | salary | |
---|---|---|---|
0 | October | Python | 81 |
1 | November | Python | 77 |
2 | July | Haskell | 74 |
3 | June | JavaScript | 80 |
4 | February | Go | 75 |
Check if Series contains a string
The simplest example is to check whether a DataFrame column contains a string literal
hr_df['language'].str.contains('Python')
This will return a Series of boolean values.
More useful is to get the count of occurrences matching the string ‘Python’. Note that the this assumes case sensitivity.
hr_df['language'].str.contains('Python').sum()
In this case 2.
Check for Case insensitive strings
In a similar fashion, we’ll add the parameter Case=False to search for occurrences of the string literal, this time being case insensitive:
hr_df['language'].str.contains('python', case=False).sum()
Check is a substring exists in a column with Regex
Another case is that you might want to search for substring matching a Regex pattern:
hr_df['language'].str.contains('\Dt', regex=True).sum()
Search for multiple strings in DataFrame column
We might want to search for several strings. We can use the boolean operators ‘|’ and ‘&’ to define character sequences to be checked for. In this case we search for occurrences of Python or Haskell in our language Series.
hr_df['language'].str.contains('Python|Haskell').sum()
Check for strings contained in a list
In this case we’ll use the Series function isin() to check whether elements in a Python list are contained in our column:
# define a list
lang_lst = ['Python', 'Go']
#count occurrences from the list
hr_df['language'].isin(lang_lst).sum()
The result will be 3.