Staging of Source Data in Parquete vs SQL DB for DWH with ADF for Delta-Loads

Jörg Lang 120 Reputation points
2024-07-23T06:58:39.9333333+00:00

Hi there,
I have some "conceptual" question I would like to discuss.

New DWH should is beeing build on Azure using ADF and SQL for Dimensions/Facts. Today, we have also running the landing on SQL DB.

Also we are only loading delta datasets from source DB (on-prem) with simple query logic like
select max(last_modified) from landing_table which is running quite smooth.

Now the idea comes up to use Azure Data Lake Storage Gen2 in combination with the parquet files to increase our processing speed and reduce space used in SQL DB.

  • Does it make sense to swicht?
  • How do we get the max(last_modified) from a parquete file?
  • Is it right, that we should create new parquete file for any run to ensure, that we are not loosing any staged data from before? How can we then get the last modified?

Thanks for your thoughts on that?
Regards

Joerg

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
10,162 questions
{count} votes