How to Convert Delta Parquet Files to a Single Parquet File with Latest Version of Delta

Richards, Sam (DG-STL-HQ) 151 Reputation points
2023-03-08T21:16:16.0366667+00:00

Hello,

I am looking for some best practices on how to implement and change to our existing incremental load DW in Azure Synapse. We are using spark notebooks with Delta via pyspark and spark.sql.

Current State:

  1. Get changes from source dbs (on prem SQL Server) into raw as a parquet via last modified date in source dbs and watermark column in our datasource table
  2. Use spark notebook to process these changes from raw -> curated (note our on prem data does not need any changing to match business needs, that is already complete).
  3. We store the output (2.) in a delta lake as parquet files and in a serverless sql lake database using delta.

Desired State:

  1. Take the output of (3.) above and extract a single parquet file that is the latest version of a table in the delta lake

Example:

  1. On Prem Data Warehouse (Dim Activity) -> Get changes on Dim Activity based on last modified date in on Prem table and the watermark column in our Azure SQL Db. This is loaded into the raw part of our datalake daily
  2. Use sparknote book to process the raw parquet involving changes to Dim Activity
  3. Write the updated version of Dim Activity table to a delta lake as parquet files and as a delta table in serverless sql lake database.
  4. (future state) - Add a step in the spark notebook that will take the latest version of Dim Activity in our delta lake or delta table and write out as a single parquet file for downstream usage.

let me know if this is not clear

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
5,385 questions
{count} votes

Accepted answer
  1. Bhargava-MSFT 31,261 Reputation points Microsoft Employee Moderator
    2023-03-20T23:35:33.5133333+00:00

    Hello @Richards, Sam (DG-STL-HQ),

    Welcome to the MS Q&A platform.

    To convert Delta Parquet files to a single Parquet file with the latest version of Delta, you can use Apache Spark and Delta Lake.

    1. Load the Delta Parquet files into a Spark DataFrame

    df = spark.read.format("delta").load(delta_table_path)

    df.show()

    1. Get the latest version of the Delta table:

    delta_table = DeltaTable.forPath(spark, delta_table_path)

    df = delta_table.toDF()

    df.show()

    1. Filter the DataFrame to include only the latest version:

    df = df.filter("version = (SELECT max(version) from delta_table_path)")

    df.show()

    1. Write out the DataFrame as a single Parquet file:

    df.write.parquet("parquet.delta_table_path", mode="overwrite")

    If you have the plain parquet files(not using delta lake format), then you can use the below Apache spark python script to convert the plain parquet files in the folder to a single delta lake format.

    %%pyspark
    from delta.tables import DeltaTable
    deltaTable = DeltaTable.convertToDelta(spark, "parquet.delta_table_path")
    

    Reference documents:

    https://github.com/MicrosoftDocs/azure-docs/blob/main/articles/synapse-analytics/spark/apache-spark-delta-lake-overview.md

    https://github.com/MicrosoftDocs/azure-docs/blob/main/articles/synapse-analytics/sql/query-delta-lake-format.md

    I hope this helps. Please let us know if you have any further questions.

    1 person found this answer helpful.
    0 comments No comments

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.