Incremental Load Using COPY INTO or Auto Loader in Synapse

Arbi Nazarian 1 Reputation point
2022-08-17T19:50:15.917+00:00

Hello Everyone,

This is my first time posting, I am fairly new to ETL and Azure Synapse.

In the past couple of months I have been delegated the task of developing an ETL process using Azure synapse. Definitely learned a ton of great concepts and done a lot of research and watched all of @MarkKromer-MSFT 's videos. However, now before the ETL project I want to get some advice on how to use Delta lake files for the entire ETL process. Let me explain:

We get our data from a source that does not implement any type of change tracking. Everyday our source files (extracts) rewrite themselves with the new data.

Not all the tables need to have changes tracked, but for the ones that do, I would like to implement a Hashing technique, track those changes and only update the records that have been changed. This part I understand.

For those that do not need to be tracked I still want to use an incremental load so that the ETL process is faster and more efficient.

In my research I realized that Delta files are the new format that offer all the things we want in our ETL.

My question is that if I implement delta files (bring the files to Bronze from source and convert to delta files in the process).

  1. How can I implement the incremental load feature that is offered in Azure and Delta Files?
  2. Do I need to do this in a notebook, or can I do it in the Synapse UI? (Using Autoloader or Copy into?)
  3. Are there any special requirements that I must be aware of?
  4. Can I use partitioning and use the year/month/date key partition technique to use in my incremental load? Not sure if partitions are to be used for incremental loading (our data is not that
    big so no real need for partitioning, but would be helpful down the line)

In my research I read many articles saying that most ETL is not fully planned out and down the line problems occur This is why, I really want to plan it out as best as possible for those issues down the line and that is exactly why I have not started the process.

Our extracts files are in CSV on our server( no DB just tables), please let me know if you need additional information to help me with this process.

Thank you in advance

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
5,375 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,625 questions
0 comments No comments
{count} votes

2 answers

Sort by: Most helpful
  1. HimanshuSinha-msft 19,486 Reputation points Microsoft Employee Moderator
    2022-08-18T22:31:03.87+00:00

    Hello @Arbi Nazarian ,
    Thanks for the question and using MS Q&A platform.
    As we understand the ask here is how to design the ETL which helps to capture/implement incremental load , please do let us know if its not accurate.
    As per me using delta file for track the changes is the correct approach .

    You asked :

    How can I implement the incremental load feature that is offered in Azure and Delta Files?

    Do I need to do this in a notebook, or can I do it in the Synapse UI? (Using Autoloader or Copy into?)

    You have two options here

    Options #1
    Use mapping data flow ( MDF ) : MDF is used mostly for transformation . You can read the data from the two files , join them and then use function like CRC32 to have the fingerprint of the row of the incoming files . Do a similar thing on the past file and compare the fongerprints and determine the delta .

    Option #2
    You also use Synapse Analytics also ,, it offers something called spark pool ( whcih runs on Apache Spark ) and you can read the file and join the dataframes and use the hash function to determine the delta .

    Are there any special requirements that I must be aware of?

    Since I have not seen the data , i will say on a high level this should be fine .

    Can I use partitioning and use the year/month/date key partition technique to use in my incremental load? Not sure if partitions are to be used for incremental loading (our data is not that
    big so no real need for partitioning, but would be helpful down the line)

    If you have a columns in the existing dataset , whcih always gets updated on INSERT and UPDATE , please go for this options .

    Also just curious how are you planning to handle the deleted records on the source ?

    Please do let me if you have any queries.
    Thanks
    Himanshu


    • Please don't forget to click on 130616-image.png or upvote 130671-image.png button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how
    • Want a reminder to come back and check responses? Here is how to subscribe to a notification
      • If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators
    0 comments No comments

  2. Arbi Nazarian 1 Reputation point
    2022-08-19T17:01:00.093+00:00

    Hello @HimanshuSinha-msft ,

    Thank you for your reply, So yes I want to implement an ETL solution with incremental load, But I want to use Delta files capabilities for appending the data that was not there the day before. where I am confused is the documentation on Copy into & autoloader:
    (https://learn.microsoft.com/en-us/azure/databricks/delta/delta-ingest).

    I saw that autoloader and COPY INTO have this ability. Does it only for files or it works by appending the files into the table?

    My plan so far is this:
    1.So our extracts run once a day. but the file (csv's) is overwritten. So Everyday I get the same data + the new data for that day (The source level)

    2.Take the source data and archive it as my bronze layer in ADLSgen2 then

    3.I want to take the data using Autoloader or COPY INTO to take the files from Bronze To Silver and incrementally update them.

    The COPY INTO SQL command lets you load data from a file location into a Delta table. This is a re-triable and idempotent operation; files in the source location that have already been loaded are skipped.

    4.use logic to only display current row (most recent version) to use in Gold tables. (this is depending on the use case and I know how to do that part)

    So my issues are in the Bronze to silver layers where I want to use the most efficient way to incrementally load only the changed data.

    Ideally I would like our csv files to go into ADLSgen2 then use autoloader to load new records, updated records and deleted records only. The watermark method seems to be the way to go however, I would like to use the capabilities of the delta file if possible for less overhead. I saw that autoloader has this ability. But not sure if it is only for files or it works by appending the files into the table?

    (to explain more for Bronze layer: Since we get the same file everyday with a new modified date as the metadata I cant use modified date. (now I can start archiving the daily extracts and use modified date there to bring in the new data) do you think that this could work?

    If you have a columns in the existing dataset , which always gets updated on INSERT and UPDATE , please go for this options .

    I would have to set up the Insert and update rules. But the tables I want to partition have a date column that I can utilize to partition them by. Not sure if partitioning and incremental load can help each other. Since the data in gets partitioned automatically when using parquet.

    Let me know if you have any more questions on the specifics if I did not explain it well. Thanks

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.