Staging of Source Data in Parquete vs SQL DB for DWH with ADF for Delta-Loads

Jörg Lang 120

Hi there,
I have some "conceptual" question I would like to discuss.

New DWH should is beeing build on Azure using ADF and SQL for Dimensions/Facts. Today, we have also running the landing on SQL DB.

Also we are only loading delta datasets from source DB (on-prem) with simple query logic like
select max(last_modified) from landing_table which is running quite smooth.

Now the idea comes up to use Azure Data Lake Storage Gen2 in combination with the parquet files to increase our processing speed and reduce space used in SQL DB.

Does it make sense to swicht?
How do we get the max(last_modified) from a parquete file?
Is it right, that we should create new parquete file for any run to ensure, that we are not loosing any staged data from before? How can we then get the last modified?

Thanks for your thoughts on that?
Regards

Joerg

Bhargava-MSFT 28,956 Reputation points Microsoft Employee

2024-07-23T19:16:36.9066667+00:00

Hello Jörg Lang,

Switching to Azure Data Lake Storage Gen2 and using Parquet files can indeed offer benefits in terms of processing speed and storage optimization. Parquet is a columnar storage file format that is optimized for use with Azure Data Lake and is known for its efficiency and performance, especially with large datasets.

To get the max(last_modified) from a Parquet file, you can use a tool like Apache Spark or Azure Databricks to read the Parquet file and perform the necessary aggregation. Both of these tools have built-in support for reading and processing Parquet files.

Regarding creating new Parquet files for each run, it's a common practice to create new Parquet files for each data ingestion run, especially when dealing with delta datasets. This helps in managing data versions and ensures that you have a historical record of data changes.

When you create a new Parquet file for each run, you can maintain a metadata file or a manifest that tracks the last_modified timestamps for each file. This way, you can easily retrieve the most recent last_modified value across all your Parquet files.

I hope this helps. Please let me know if you have any further questions.

Share via

Staging of Source Data in Parquete vs SQL DB for DWH with ADF for Delta-Loads