Bulk updating Delta tables in Azure Databricks

Dipesh Yogi 0 Reputation points
2023-01-31T14:20:31.2066667+00:00

Hi Team,

I have some data in Delta table with multiple columns and each record has a unique identifier.

I want to update some columns as per the new values coming in for each of these unique records. However updating one record at a time is taking a lot of time.

Is there any way i can perform bulk update on Delta tables for all the unique records and their columns at once?

Thank you.

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,325 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Bhargava-MSFT 31,196 Reputation points Microsoft Employee
    2023-02-02T21:47:58.69+00:00

    Hello @Dipesh Yogi ,

    Welcome to the MS Q&A platform.

    You can use merge statement to bulk updates on Delta tables in Azure Databricks

    The MERGE statement allows updating the existing records in a Delta table based on a matching condition and inserting new records if no match is found.

    SQL :

    MERGE INTO people10m
    USING people10mupdates
    ON people10m.id = people10mupdates.id
    WHEN MATCHED THEN
      UPDATE SET
        id = people10mupdates.id,
        firstName = people10mupdates.firstName,
        middleName = people10mupdates.middleName,
        lastName = people10mupdates.lastName,
        gender = people10mupdates.gender,
        birthDate = people10mupdates.birthDate,
        ssn = people10mupdates.ssn,
        salary = people10mupdates.salary
    WHEN NOT MATCHED
      THEN INSERT (
        id,
        firstName,
        middleName,
        lastName,
        gender,
        birthDate,
        ssn,
        salary
      )
      VALUES (
        people10mupdates.id,
        people10mupdates.firstName,
        people10mupdates.middleName,
        people10mupdates.lastName,
        people10mupdates.gender,
        people10mupdates.birthDate,
        people10mupdates.ssn,
        people10mupdates.salary
      )
    

    Python:

    from delta.tables import *
    
    deltaTablePeople = DeltaTable.forPath(spark, '/tmp/delta/people-10m')
    deltaTablePeopleUpdates = DeltaTable.forPath(spark, '/tmp/delta/people-10m-updates')
    
    dfUpdates = deltaTablePeopleUpdates.toDF()
    
    deltaTablePeople.alias('people') \
      .merge(
        dfUpdates.alias('updates'),
        'people.id = updates.id'
      ) \
      .whenMatchedUpdate(set =
        {
          "id": "updates.id",
          "firstName": "updates.firstName",
          "middleName": "updates.middleName",
          "lastName": "updates.lastName",
          "gender": "updates.gender",
          "birthDate": "updates.birthDate",
          "ssn": "updates.ssn",
          "salary": "updates.salary"
        }
      ) \
      .whenNotMatchedInsert(values =
        {
          "id": "updates.id",
          "firstName": "updates.firstName",
          "middleName": "updates.middleName",
          "lastName": "updates.lastName",
          "gender": "updates.gender",
          "birthDate": "updates.birthDate",
          "ssn": "updates.ssn",
          "salary": "updates.salary"
        }
      ) \
      .execute()
    

    Reference document: https://learn.microsoft.com/en-us/azure/databricks/delta/merge

    I hope this helps. Please let me know if you have any further questions.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.