Merge DataFrames based on 3 conditions

Question

Merge DataFrames based on 3 conditions

Sasha Sasha 200

I want to merge two DataFrames in PySpark: df1, which is empty and created from a schema, and df2, which is non-empty and filled from a CSV file with some conditions of course :

If both DataFrames have the same number of columns, merge them directly.
If df2 has additional columns, these should be dropped.
If df2 has fewer columns, the missing columns should be filled with null values in the merged DataFrame.

How can I achieve that ?

Accepted answer

1 additional answer

Your answer

Answer 1

First, ensure that the columns in df2 match the columns in the schema of df1. This means dropping extra columns from df2 and adding missing columns as null.

Then you can safely merge df1 and df2 as they will now have the same schema.

from pyspark.sql.functions import lit
from pyspark.sql import SparkSession
schema = StructType([...])  # Define your schema here
# Create an empty DataFrame with the schema
df1 = spark.createDataFrame([], schema)
# Load df2 from a CSV file
df2 = spark.read.csv("path_to_csv", header=True, inferSchema=True)
# Align columns in df2 with df1's schema
for field in schema.fields:
    if field.name not in df2.columns:
        # Add missing columns as null in df2
        df2 = df2.withColumn(field.name, lit(None).cast(field.dataType))
    else:
        # Cast to the correct data type if the column exists
        df2 = df2.withColumn(field.name, df2[field.name].cast(field.dataType))
# Drop extra columns from df2 that are not in df1's schema
df2 = df2.select([field.name for field in schema.fields])
# Merge the DataFrames
final_df = df1.unionByName(df2)

Answer 2

Sasha Sasha 200

Test Test Test

Share via

Merge DataFrames based on 3 conditions

1 additional answer

Your answer