Data Factory pipeline that only appends new and updated data

2JK 241 Reputation points
2021-11-24T19:49:21.727+00:00

I currently have a pipeline in Data Factory that copies several tables from an Orcale database to an Azure SQL DB. The pipeline runs once per day. What is currently being done is at each run, the database and the tables under it in the Azure SQL DB, are all dropped first, then the Copy Data tool in the pipeline runs and basically creates a new database and tables.

What I want is, on each run, only new and updated data are 'copied' to the SQL DB. I don't want to drop anything, just update what was updated and append new data.

How can I best implement this?

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
6,306 questions
{count} votes

1 answer

Sort by: Most helpful
  1. svijay-MSFT 5,111 Reputation points Microsoft Employee
    2021-11-25T14:36:47.763+00:00

    Hello @2JK ,

    Thanks for the question and using MS Q&A platform.

    My understanding is that you'd like to copy only the new/updated data from multiple tables of your Oracle and copy it to the destination SQL.

    From my research there is no out of box approach to do this. I shall share a high level approach that could meet your requirement.

    Step 1

    You would need to add a column to all the tables subjected - that keeps track of the last_modified date.

    Everytime an item is added / updated - the last_modified gets updated with the current time.

    **Step 2 **

    You'd need an additional table that would track the last pipeline run for every table

    Step 3

    Pipeline activity - You'd look up the Step 2 table - ​for the last pipeline run activity for each table

    Filter the rows that has last modified date greater than the last pipeline run - This would be your source.

    These filtered rows would be copied to the corresponding SQL table.

    There is step by step detailed documentation for a similar requirement for tables in SQL DB for the above steps outlined : https://learn.microsoft.com/en-us/azure/data-factory/tutorial-incremental-copy-multiple-tables-powershell

    From my research , oracle DB also don't have any inbuilt feature to track the new/updated rows.

    Discussion here: https://community.oracle.com/tech/developers/discussion/2565781/how-to-query-only-updated-or-new-rows-in-oracle

    The approach in this case as well would be similar.

    Hope this Helps