How to Convert Delta Parquet Files to a Single Parquet File with Latest Version of Delta

Question

How to Convert Delta Parquet Files to a Single Parquet File with Latest Version of Delta

Richards, Sam (DG-STL-HQ) 151

Hello,

I am looking for some best practices on how to implement and change to our existing incremental load DW in Azure Synapse. We are using spark notebooks with Delta via pyspark and spark.sql.

Current State:

Get changes from source dbs (on prem SQL Server) into raw as a parquet via last modified date in source dbs and watermark column in our datasource table
Use spark notebook to process these changes from raw -> curated (note our on prem data does not need any changing to match business needs, that is already complete).
We store the output (2.) in a delta lake as parquet files and in a serverless sql lake database using delta.

Desired State:

Take the output of (3.) above and extract a single parquet file that is the latest version of a table in the delta lake

Example:

On Prem Data Warehouse (Dim Activity) -> Get changes on Dim Activity based on last modified date in on Prem table and the watermark column in our Azure SQL Db. This is loaded into the raw part of our datalake daily
Use sparknote book to process the raw parquet involving changes to Dim Activity
Write the updated version of Dim Activity table to a delta lake as parquet files and as a delta table in serverless sql lake database.
(future state) - Add a step in the spark notebook that will take the latest version of Dim Activity in our delta lake or delta table and write out as a single parquet file for downstream usage.

let me know if this is not clear

Bhargava-MSFT 31,261 Reputation points Microsoft Employee Moderator

2023-03-14T23:54:33.8866667+00:00

Hello @Richards, Sam (DG-STL-HQ),

I am checking to see if you got a chance to look into my earlier response.
Richards, Sam (DG-STL-HQ) 151 Reputation points

2023-03-15T13:10:00.5666667+00:00

@Bhargava-MSFT sorry I missed this. I appreciate the info and will be testing it out asap.

Accepted answer

0 additional answers

Your answer

Bhargava-MSFT 31,261 Reputation points Microsoft Employee Moderator

2023-03-14T23:54:33.8866667+00:00

Hello @Richards, Sam (DG-STL-HQ),

I am checking to see if you got a chance to look into my earlier response.
Richards, Sam (DG-STL-HQ) 151 Reputation points

2023-03-15T13:10:00.5666667+00:00

@Bhargava-MSFT sorry I missed this. I appreciate the info and will be testing it out asap.

Answer 1

Hello @Richards, Sam (DG-STL-HQ),

Welcome to the MS Q&A platform.

To convert Delta Parquet files to a single Parquet file with the latest version of Delta, you can use Apache Spark and Delta Lake.

Load the Delta Parquet files into a Spark DataFrame

df = spark.read.format("delta").load(delta_table_path)

df.show()

Get the latest version of the Delta table:

delta_table = DeltaTable.forPath(spark, delta_table_path)

df = delta_table.toDF()

df.show()

Filter the DataFrame to include only the latest version:

df = df.filter("version = (SELECT max(version) from delta_table_path)")

df.show()

Write out the DataFrame as a single Parquet file:

df.write.parquet("parquet.delta_table_path", mode="overwrite")

If you have the plain parquet files(not using delta lake format), then you can use the below Apache spark python script to convert the plain parquet files in the folder to a single delta lake format.

%%pyspark
from delta.tables import DeltaTable
deltaTable = DeltaTable.convertToDelta(spark, "parquet.delta_table_path")

Reference documents:

https://github.com/MicrosoftDocs/azure-docs/blob/main/articles/synapse-analytics/spark/apache-spark-delta-lake-overview.md

https://github.com/MicrosoftDocs/azure-docs/blob/main/articles/synapse-analytics/sql/query-delta-lake-format.md

I hope this helps. Please let us know if you have any further questions.

Share via

How to Convert Delta Parquet Files to a Single Parquet File with Latest Version of Delta

0 additional answers

Your answer