How to check if a Pandas column contains a string?

In this Data Analysis tutorial we’ll learn how to use Python to search for a specific or multiple strings in a Pandas DataFrame column.

In a nutshell we can easily check whether a string is contained in a column using the .str.contains() Series function:

df_name['col_name'].str.contains('string').sum()

Read on for several examples of using this capability.

Creating test data

We’ll start by creating a simple DataFrame for this example

import pandas as pd

#Create data
month = ['October', 'November', 'July', 'June', 'February']
language = ['Python', 'Python', 'Haskell', 'JavaScript', 'Go']
salary = (81, 77, 74, 80, 75)

hr_dict = dict(month=month, language=language, salary=salary)
hr_df = pd.DataFrame(data=hr_dict)

print(hr_df)

This will result in the following DataFrame:

monthlanguagesalary
0OctoberPython81
1NovemberPython77
2JulyHaskell74
3JuneJavaScript80
4FebruaryGo75

Check if Series contains a string

The simplest example is to check whether a DataFrame column contains a string literal

hr_df['language'].str.contains('Python')

This will return a Series of boolean values.

More useful is to get the count of occurrences matching the string ‘Python’. Note that the this assumes case sensitivity.


hr_df['language'].str.contains('Python').sum()

In this case 2.

Check for Case insensitive strings

In a similar fashion, we’ll add the parameter Case=False to search for occurrences of the string literal, this time being case insensitive:

hr_df['language'].str.contains('python', case=False).sum()

Check is a substring exists in a column with Regex

Another case is that you might want to search for substring matching a Regex pattern:

hr_df['language'].str.contains('\Dt', regex=True).sum()

Search for multiple strings in DataFrame column

We might want to search for several strings. We can use the boolean operators ‘|’ and ‘&’ to define character sequences to be checked for. In this case we search for occurrences of Python or Haskell in our language Series.

hr_df['language'].str.contains('Python|Haskell').sum()

Check for strings contained in a list

In this case we’ll use the Series function isin() to check whether elements in a Python list are contained in our column:

# define a list
lang_lst = ['Python', 'Go']

#count occurrences from the list
hr_df['language'].isin(lang_lst).sum()

The result will be 3.

Additional suggested learning