Upsert Data in SQL Server Using Synapse Notebook and Pyspark

Question

Upsert Data in SQL Server Using Synapse Notebook and Pyspark

Aditya Singh 160

How do I use pyspark in a synapse notebook to upsert data in SQL Server? I am able to read the table with the following code:

df = spark.read.jdbc(url=jdbc_url, table="Dim.table_name", properties=properties)

But I am not sure how to upsert data based on key columns in SQL Server.

PRADEEPCHEEKATLA 90,641 Reputation points Moderator

2024-07-25T05:16:09.67+00:00

@Aditya Singh - Just checking in to see if the below answer helped. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

1 answer

Your answer

PRADEEPCHEEKATLA 90,641 Reputation points Moderator

2024-07-25T05:16:09.67+00:00

@Aditya Singh - Just checking in to see if the below answer helped. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Answer 1

You can use the MERGE statement :


from pyspark.sql.functions import when, col

# Read existing data

existing_df = spark.read.jdbc(url=jdbc_url, table="Dim.table_name", properties=properties)

# Prepare your new/updated data

# This is just an example, replace with your actual new data

new_data = [("1", "John", "Doe", "30"), ("2", "Jane", "Smith", "25"), ("3", "New", "Person", "35")]

columns = ["id", "first_name", "last_name", "age"]

new_df = spark.createDataFrame(new_data, columns)

# Perform the merge operation

merge_condition = existing_df.id == new_df.id

# Create a temporary view of the new data

new_df.createOrReplaceTempView("updates")

# Construct the merge statement

merge_stmt = f"""

    MERGE INTO Dim.table_name AS target

    USING updates AS source

    ON target.id = source.id

    WHEN MATCHED THEN

        UPDATE SET

            target.first_name = source.first_name,

            target.last_name = source.last_name,

            target.age = source.age

    WHEN NOT MATCHED THEN

        INSERT (id, first_name, last_name, age)

        VALUES (source.id, source.first_name, source.last_name, source.age)

"""

# Execute the merge statement

spark.sql(merge_stmt)

# Write the updated data back to SQL Server

new_df.write \

    .jdbc(url=jdbc_url, table="Dim.table_name", mode="overwrite", properties=properties)

Anonymous

2025-04-02T09:05:46.82+00:00

This does nothing but overwrites your whole table in MS SQL Server

Share via

Upsert Data in SQL Server Using Synapse Notebook and Pyspark

1 answer

Your answer