How to update values in PySpark DataFrames?

When working with PySpark DataFrames, you might need to update specific cell values in its rows and columns. This could be for data cleaning, transformations, or simply to correct errors.

Create a Spark DataFrame

Let’s consider that we have a DataFrame that contains employee data with the following columns: id, name, department, and salary.

from pyspark.sql import SparkSession
from pyspark.sql.functions import when

# Initialize Spark session
spark_session = SparkSession.builder.appName('dataframe_update').getOrCreate()

# Define sample data
employees = [("1", "David", "Finance", 83000),
        ("2", "Dean", "IT", 84000),
        ("3", "Harry", "HR", 75000)]

# Create Spark DataFrame
hrdf = spark_session.createDataFrame(employees, ["id", "name", "department", "salary"])

Update values in PySpark columns

I will use the withColumn method, which returns a new DataFrame by adding a column or replacing the existing column that has the same name. The when function allows us to apply conditional updates.

In the following example, we are making a salary update for all the IT team members.

# Raise the IT employees salary by 25%
hrdf = hrdf.withColumn("salary",
                                     when(employee_df.department == "IT", 
                                          employee_df.salary * 1.25)

# Display the updated DataFrame

Replace string in PySpark DataFrame

In our second example, we would like to show how you can standardize the naming conventions within your data by making string replacements. Specifically, we will replace department name ‘IT’ with ‘Information Technology’ in our DataFrame. Here’s the code you can use:

from pyspark.sql.functions import col

# Modify the department name
employee_df = employee_df.withColumn("department",
                                     when(col("department") == "IT", 
                                          "Information Technology")


Can i update null values in a specific column?

Yes, use the fillna or when combined with the isNull to update null values.

How to add a new column with updated values based on other columns?

Use the withColumn method with a new column name.Then, define the new column’s values using existing columns.