How to create a PySpark DataFrame from a Python dictionary?

Step 1: Make sure that PySpark is installed

Firstly, ensure that PySpark is installed in your environment. If not, you can install it using using the Python Package Installer (PIP) or the Anaconda Package Manager.

You can ensure that PySpark is installed by running the following command:

pip list | findstr PySpark

Then if PySpark is not installed use pip install:

pip install PySpark

Note: Failing to install the PySpark package will result in a ModuleNotFound error.

Step 2: Initialize a Spark session

Before using PySpark, a session has to be initialized:

from pyspark.sql import SparkSession

Step 3: Create a Python dictionary

Next, we will define a simple Python dictionary, in our case, made of employee related data:

employee_data = {
    'employee_id': [11, 12, 13],
    'employee_name': ['Juan Perez', 'Ana Gomez', 'Carlos Ruiz'],
    'department': ['Finance', 'Marketing', 'HR]
    }

Step 4: Create a DataFrame

Now, we will construct a DataFrame using the following code:

# Convert the dictionary values to tuples representing the rows
employee_tuples = list(zip(employee_data['employee_id'], employee_data['employee_name'], employee_data['department']))

# Define the column names
columns = ['employee_id', 'employee_name', 'department']

# Create a DataFrame directly using the list of tuples and the column names
employee_df = spark_session.createDataFrame(employee_tuples, schema=columns)

employee_df.show()

Note:

  • I used the statement list(zip(*employee_data.values())) to convert the dictionary values into a list of tuples where each tuple represents a new row in our PySpark DataFrame.
  • The PySpark createDataFrame() method constructs the DataFrame using the provided data and sand column names.

FAQ:

Can i create a PySpark frame with nested dictionaries?

Yes, PySpark can process nested dictionaries.

  • First, define a schema that matches the nested structure using StructType and StructField modules. Note that you will need to import those modules into your code before invoking them.
from pyspark.sql.types import StructType, StructField
  • Your next step is to convert nested dictionaries to a list of Row objects with matching nested Row objects.
  • Last, make use of the createDataFrame() function along with the defined schema to create the DataFrame.