In today’s tutorial we’ll learn how to find specific single or multiple values across columns of your pandas DataFrame.
We’ll first import the pandas Data Analysis library into your Python development environment (Jupyter, PyCharm or others).
import pandas as pd
We’ll now define a simple pandas DataFrame that you are able to use to follow along this example.
month = ['July', 'October', 'September', 'July', 'November', 'July']
language = ['Java', 'Python', 'Python', 'Javascript', 'R', 'Javascript']
office = ['Istanbul', 'New York', 'Osaka', 'Toronto', 'New York', 'Hong Kong']
salary = [181.0, 203.0, 163.0, 181.0, 121.0, 132.0]
hr_campaign = dict(month = month, language = language, salary = salary)
interviews_data = pd.DataFrame(data=hr_campaign)
Let’s take a look at the data:
interviews_data.head()
month | language | salary | |
---|---|---|---|
0 | July | Java | 181.0 |
1 | October | Python | 203.0 |
2 | September | Python | 163.0 |
3 | July | Javascript | 181.0 |
4 | November | R | 121.0 |
Finding specific value in Pandas DataFrame column
Let’s assume that we would like to find interview data related to Python candidates. We’ll define our search criteria and filter the pandas DataFrame accordingly.
value = 'Python'
mask = interviews_data['language'].str.contains(value)
interviews_data[mask]
Here’s our result:
month | language | salary | |
---|---|---|---|
1 | October | Python | 203.0 |
2 | September | Python | 163.0 |
Search for multiple strings in a column
In the same fashion we can find multiple strings. We’ll first define a Python list containing values to search for and then subset the DataFrame.
value_lst = ['Java', 'Python']
mask = interviews_data['language'].isin(value_lst)
interviews_data[mask]
This also returns a DataFrame:
month | language | salary | |
---|---|---|---|
0 | July | Java | 181.0 |
1 | October | Python | 203.0 |
2 | September | Python | 163.0 |
Same result can be returned by filtering your pandas DataFrame using the loc indexer:
interviews_data.loc[mask]
Filter column by value greater than
Same logic applies for selecting specific rows according to a condition. In the following example we will retrieve rows with average salary expectation higher to the maximal salary threshold defined by the company:
max_salary = 170
mask = interviews_data['salary'] >= max_salary
interviews_data[mask]
Searching for a string and return the index
We can return the index of the relevant rows using the index DataFrame method:
interviews_data[mask].index
This will return the following index:
Int64Index([0, 1, 3], dtype='int64')
Find values based on other column values
In the next use case, we’ll use the query DataFrame method to find the salaries pertaining to the Python candidates:
value = 'Python'
python_salaries = interviews_data.query('language == @value')[['month','salary']]
python_salaries
This will render the following DataFrame subset:
month | salary | |
---|---|---|
1 | October | 203.0 |
2 | September | 163.0 |
Replace values found in column
After finding specific string or numeric values, we can also use the replace DataFrame method to replace values as needed. In this very trivial example we’ll replace the string “Javascript” with the string “JavaScript”.
max_salary = 170
mask = interviews_data['salary'] >= max_salary
high_salaries = interviews_data[mask].replace('Javascript', 'JavaScript')