Choosing an Approach for Incremental Loading with Watermark in Azure Data Factory: Efficiency and Cost Considerations

Lucas Medina 20 Reputation points
2024-01-13T01:55:43.14+00:00

Hi all, I'm working on implementing an Azure Data Factory pipeline for incremental data loading using a watermark table approach. I have identified two different approaches but am unsure which one is considered the best practice in terms of cost-effectiveness and efficiency. Approach One: This approach is detailed in a YouTube video here: https://www.youtube.com/watch?v=t1kWzdAP3kk Approach Two: The second approach is outlined in a Stack Overflow answer here: How to Perform Incremental Historical Data Load from SQL On-Premise to Azure Blob Storage with Dynamic Folder Structure? My Question: Which of these approaches is generally considered best practice for incremental loading in Azure Data Factory when considering both cost-effectiveness and efficiency? Are there any significant pros or cons to each method that I should be aware of? Any insights or experiences with either of these methods would be greatly appreciated.

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
9,875 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Harishga 4,875 Reputation points Microsoft Vendor
    2024-01-16T08:34:57.13+00:00

    Hi Lucas Medina

    Welcome to Microsoft Q&A platform and thanks for posting your question here.

    Approach One, as detailed in the YouTube video, involves using a Lookup activity to retrieve the maximum value of the watermark column from the destination table and then using a Copy activity to copy only the new or updated rows from the source table to the destination table. This approach is relatively simple and easy to implement, but it may not be the most cost-effective or efficient approach, especially if the destination table is large and has many partitions.

    Approach Two, as outlined in the Stack Overflow answer, involves using a stored procedure to retrieve the new or updated rows from the source table and then using a Copy activity to copy them to the destination table. This approach allows for more control over the data retrieval process and can be more efficient and cost-effective, especially for large tables with many partitions. However, it requires more setup and configuration, including creating a stored procedure and setting up a linked service to connect to the on-premises SQL Server.

    In terms of cost-effectiveness and efficiency, Approach Two may be the better option for large tables with many partitions, as it allows for more control over the data retrieval process and can be more efficient. However, for smaller tables or tables with fewer partitions, Approach One may be sufficient and easier to implement.

    It's also worth noting that there may be other approaches or variations of these approaches that could be more suitable for specific scenarios. It's important to consider the specific requirements and constraints of the project when deciding on the best approach for incremental loading in Azure Data Factory.

    Hope this helps. Do let us know if you any further queries.


    If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.