How to check if a PySpark DataFrame or column contains a string or value?

To check if a column in a Spark DataFrame contains a specific value, you can use the filter function alongside with the isin method. In this tutorial, i will provide a detailed step-by-step guide for finding one or multiple values in an employee dataset, which we will use as an example.

#1 – Import PySpark modules

We will first need to import the PySpark modules and create a PySpark app:

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName ('SearchValue').getOrCreate()

#2 – Create a PySpark DataFrame

Next we will create a PySpark DataFrame from a Python list. You can also define a PySpark DataFrame from Python dictionary objects.

emp_cols = ["Name", "Department", "Salary"]

employees = [("James", "Sales", 3000),
        ("Michael", "Sales", 4600),
        ("Robert", "IT", 4100),
        ("Maria", "Finance", 3000)]

hr_df = spark.createDataFrame (employees, emp_cols)

#3 – Search for specific value in DataFrame

We will first define which value to search for:

search_value = 3000

Then we will check if the value indeed exists in the DataFrame:

value_exists_in_df = employees.filter(df.isin(search_value)).count() > 0
print( value_exists_in_df )

This code will print either True if the value is found or False otherwise.

#4 – Find specific strings in a column

If you want to check for a value within a specific column, we will use the PySpark col module.

from pyspark.sql.functions import col

We will then define the string to search for and filter the DataFrame column.

search_value = "Sales"
string_in_column = employees.filter(col("Department") == search_value).count() > 0
print(string_in_column)

FAQ

Are there fastest ways to search for specific values in PySpark?

Based on my experience, especially with large DataFrames, using the ANY function delivers way significant search performance.The reason being that this approach evaluates the condition across all columns and stops at the first occurrence of the string / value. Here’s a simple example based on the data that we used before:

search_value = "Sales"
# Check for value across all columns with ANY
value_exists_df_any = employees.select(any(when(col(c) == lit(search_value), True) for c in employees.columns)).collect()[0][0]

print(value_exists_df_any)