As part of your data cleaning work you might need a replace one or multiple substrings in a pandas DataFrame. In today’s tutorial we will learn how to search for specific column values in your DataFrame that meet certain criteria and change them to other strings.
Example data
Let’s start by initializing a DataFrame:
import pandas as pd
month = [1,2,3,3,2]
language = ['R', 'Java', 'Javascript', 'Javascript', 'Python']
office = ['NYC-Midtown', 'Osaka', 'NYC-Midtown', 'Bangkok', 'Toronto']
salary = [144.00, 194.00, 131.00, 131.00, 107.00]
interviews = dict(month = month, office = office, language = language, salary = salary)
hr_df = pd.DataFrame(data=interviews)
hr_df.head()
Here’s our data:
month | office | language | salary | |
---|---|---|---|---|
0 | 1 | NYC-Midtown | R | 144.0 |
1 | 2 | Osaka | Java | 194.0 |
2 | 3 | NYC-Midtown | Javascript | 131.0 |
3 | 3 | Bangkok | Javascript | 131.0 |
4 | 2 | Toronto | Python | 107.0 |
Replace string in pandas column
First case will be to change specific strings in a DataFrame column and replace those with another value.
hr_df['language'] = hr_df['language'].replace(to_replace = 'Javascript', value= 'JS')
hr_df['language']
This will render the following Series / column:
0 R 1 Java 2 JS 3 JS 4 Python Name: language, dtype: object
Change cell value based on a condition or other column value
Next example is to modify values of a column according to specific if condition that is evaluated base on values in other single or multiple columns. Let us assume for example that we would like to apply a 25% raise to the salary offered in our NYC-Midtown office. This would look as following:
hr_df.loc[hr_df['office'] == "NYC-Midtown", 'salary'] = hr_df['salary'] * 1.25
hr_df.head()
month | office | language | salary | |
---|---|---|---|---|
0 | 1 | NYC-Midtown | R | 180.00 |
1 | 2 | Osaka | Java | 194.00 |
2 | 3 | NYC-Midtown | Javascript | 163.75 |
3 | 3 | Bangkok | Javascript | 131.00 |
4 | 2 | Toronto | Python | 107.00 |
Set column values according to a dictionary map
In our next example we would like to modify column values according to a mapping dictionary object that we will pass to the replace() method as shown below. In our case we just mapped the month values to month names.
hr_df.replace (to_replace = {1: 'January', 2 : 'February', 3:'March'})
Our DataFrame will look as following:
month | office | language | salary | |
---|---|---|---|---|
0 | January | NYC-Midtown | R | 180.00 |
1 | February | Osaka | Java | 194.00 |
2 | March | NYC-Midtown | Javascript | 163.75 |
3 | March | Bangkok | Javascript | 131.00 |
4 | February | Toronto | Python | 107.00 |
Update column values based on a list
In this last example for today, we will modify the contents of a DataFrame column based on a Python list. We’ll first define the list object, then drop the existing column. Last step will be to insert a new column based on the list values.
lang_lst = ['Python', 'JavaScript', 'Python', 'R', 'Julia']
hr_df_1 = hr_df.drop('language', axis=1)
hr_df_1.insert(2, column = 'language', value = lang_lst)
hr_df_1
Related learning
How to cast a pandas DataFrame column values to integer type?