Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,514 questions
This browser is no longer supported.
Upgrade to Microsoft Edge to take advantage of the latest features, security updates, and technical support.
I want to merge two DataFrames in PySpark: df1, which is empty and created from a schema, and df2, which is non-empty and filled from a CSV file with some conditions of course :
How can I achieve that ?
First, ensure that the columns in df2
match the columns in the schema of df1
. This means dropping extra columns from df2
and adding missing columns as null.
Then you can safely merge df1
and df2
as they will now have the same schema.
from pyspark.sql.functions import lit
from pyspark.sql import SparkSession
schema = StructType([...]) # Define your schema here
# Create an empty DataFrame with the schema
df1 = spark.createDataFrame([], schema)
# Load df2 from a CSV file
df2 = spark.read.csv("path_to_csv", header=True, inferSchema=True)
# Align columns in df2 with df1's schema
for field in schema.fields:
if field.name not in df2.columns:
# Add missing columns as null in df2
df2 = df2.withColumn(field.name, lit(None).cast(field.dataType))
else:
# Cast to the correct data type if the column exists
df2 = df2.withColumn(field.name, df2[field.name].cast(field.dataType))
# Drop extra columns from df2 that are not in df1's schema
df2 = df2.select([field.name for field in schema.fields])
# Merge the DataFrames
final_df = df1.unionByName(df2)
Test Test Test