As part of your cleaning you data and prepare it for analysis, you might need to find and replace data stored in your pandas DataFrame. In this hands-on tutorial, we will use the very helpful replace() method; available for both DataFrames and Series objects to quickly change cell values in pandas.
Define example DataFrame
We will begin by importing the pandas library and building our example dataset.
import pandas as pd
office = ['Istanbul', pd.NA, 'Istanbul', 'New York', 'Toronto', 'Osaka']
month = ['June', 'October', 'June', 'September', 'December', 'March']
language = ['R', 'Javascript', 'R', 'Java', 'Javascript', 'Python']
salary = [136.0, 156.0, 199.0, 164.0, 85.0, pd.NA]
hr = dict(month = month, language =language, salary = salary, office = office)
sal_df = pd.DataFrame(data=hr)
print(sal_df)
Here’s the output:
month | language | salary | office | |
---|---|---|---|---|
0 | June | R | 136.0 | Istanbul |
1 | October | Javascript | 156.0 | <NA> |
2 | June | R | 199.0 | Istanbul |
3 | September | Java | 164.0 | New York |
4 | December | Javascript | 85.0 | Toronto |
5 | March | Python | <NA> | Osaka |
Replacing a value in a pandas column / Series
Let’s assume that we would like to replace all occurrences of a value in a column.
We can easily see the impacted rows by filtering the DataFrame as shown below:
sal_df[sal_df['language'] == 'R']
month | language | salary | office | |
---|---|---|---|---|
0 | June | R | 136.0 | Istanbul |
2 | June | R | 199.0 | Istanbul |
We’ll now replace all occurrences of the value R with the value Python:
sal_df['language'] = sal_df['language'].replace(to_replace = 'R', value = 'Python')
Or alternatively persist your changes by using the inplace=True parameter.
sal_df['language'].replace(to_replace = 'R', value = 'Python', inplace=True)
And the resulting Series will be:
0 Python 1 Javascript 2 Python 3 Java 4 Javascript 5 Python Name: language, dtype: object
Find and Change a sub string in a pandas DataFrame column
In this next recipe, we’ll modify all occurrences of a specific part of a string in a DataFrame column:
# define a string to search for
sub_str = 'Java'
# find the impacted rows
sal_df[sal_df['language'].str.contains(sub_str)]
# Replace the string
sal_df['language'].replace(to_replace= sub_str, value = 'J', regex=True)
The output will be a Series. Note that the parameter regex=True allowed to replace part of a string even if the entire cell value didn’t match.
0 Python 1 Jscript 2 Python 3 J 4 Jscript 5 Python Name: language, dtype: object
Find and replace multiple values in a pandas column
If we want to match and change several values we will pass a list to the replace() method:
value_lst = ['Istanbul','Tokyo', 'Osaka']
sal_df['office'].replace(to_replace= value_lst , value = 'Asia')
Search and change NAN in pandas DataFrames
We can obviously run the replace(0 function on the DataFrame. To replace all empty cell occurrences we can use the code below
Find in which DataFrame columns we have nan values:
sal_df.isna().any()
The result will be a boolean Series:
month False language False salary True office True dtype: bool
It is now clear that we have two columns containing empty nan values. We can replace the nan values by passing a dictionary:
sal_df.fillna({'office' : '', 'salary' : sal_df['salary'].mean()})