Merge DataFrames based on 3 conditions

Sasha Sasha 200 Reputation points
2023-11-24T14:55:09.4433333+00:00

I want to merge two DataFrames in PySpark: df1, which is empty and created from a schema, and df2, which is non-empty and filled from a CSV file with some conditions of course :

  1. If both DataFrames have the same number of columns, merge them directly.
  2. If df2 has additional columns, these should be dropped.
  3. If df2 has fewer columns, the missing columns should be filled with null values in the merged DataFrame.

How can I achieve that ?

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,514 questions
0 comments No comments
{count} votes

Accepted answer
  1. Amira Bedhiafi 33,071 Reputation points Volunteer Moderator
    2023-11-24T15:03:16.9266667+00:00

    First, ensure that the columns in df2 match the columns in the schema of df1. This means dropping extra columns from df2 and adding missing columns as null.

    Then you can safely merge df1 and df2 as they will now have the same schema.

    from pyspark.sql.functions import lit
    from pyspark.sql import SparkSession
    schema = StructType([...])  # Define your schema here
    # Create an empty DataFrame with the schema
    df1 = spark.createDataFrame([], schema)
    # Load df2 from a CSV file
    df2 = spark.read.csv("path_to_csv", header=True, inferSchema=True)
    # Align columns in df2 with df1's schema
    for field in schema.fields:
        if field.name not in df2.columns:
            # Add missing columns as null in df2
            df2 = df2.withColumn(field.name, lit(None).cast(field.dataType))
        else:
            # Cast to the correct data type if the column exists
            df2 = df2.withColumn(field.name, df2[field.name].cast(field.dataType))
    # Drop extra columns from df2 that are not in df1's schema
    df2 = df2.select([field.name for field in schema.fields])
    # Merge the DataFrames
    final_df = df1.unionByName(df2)
    
    2 people found this answer helpful.
    0 comments No comments

1 additional answer

Sort by: Most helpful
  1. Sasha Sasha 200 Reputation points
    2023-11-24T15:00:28.9266667+00:00

    Test Test Test

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.