Azure Data Factory Full Load (Change Data Capture)

Singh, Gurprateek 25 Reputation points
2023-08-07T13:12:44.73+00:00

User's image

Hi,

I am using ADF to acquire incremental changes from source depending on updated_at attribute.

Its full load on first run and then incremental.

In future if i again want to do full load, how could it be done.

Could we change value of updated_at stored at ADF level, of which it takes reference while acquiring data in next iteration.

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,437 questions
{count} votes

2 answers

Sort by: Most helpful
  1. Amira Bedhiafi 31,391 Reputation points
    2023-08-07T14:00:24.3566667+00:00

    Azure Data Factory (ADF) allows you to perform incremental data loading by capturing changes from a source system. Typically, this is done by tracking a specific column, such as an updated_at attribute, that indicates when a record was last changed.

    To perform a full load again in the future, you'd essentially need to reset the point from which the incremental load is done. I am assuming the following since you didn't provide enough information :

    If you're using a watermark table or a system variable to track the last updated value, you could manually reset this value to a point in time before your data (e.g., '1970-01-01'). The next time your pipeline runs, it will see that it needs to load all changes since that early date and will, therefore, perform a full load.

    You might control the incremental load through pipeline parameters. In this case, you could alter the parameter to trigger a full load. If you're using the updated_at attribute as a parameter, you could expose a way to manually set that parameter to an earlier date, triggering a full load.

    You might build your pipeline in such a way that you can manually trigger a full load through an additional trigger or manual process. This could involve a separate pipeline or an alteration to the existing one that allows you to bypass the incremental load process.

    If the trigger has been set with the incremental attribute (e.g., a tumbling window trigger), you might need to change the configuration to ensure it pulls the full dataset.

    If all else fails, you could recreate the pipeline or use a different pipeline specifically for full loads. Depending on your design, this could be a last-resort option if the pipeline is not easily reconfigurable.


  2. ShaikMaheer-MSFT 38,521 Reputation points Microsoft Employee
    2023-08-18T06:57:26.4166667+00:00

    Hi Singh, Gurprateek,

    Thank you for posting query in Microsoft Q&A Platform.

    Here you are using updated_at column field as incremental load key. So CDC will try to take values from that column and compare it with sink and consider for further loads. So if we delete all data from sink and rerun then I believe it should do that full load. Kindly try with some dummy and check. Also, you can consider having a separate pipeline with copy activity for full run loads.

    Hope this helps. Please let me know if any further queries.


    Please consider hitting Accept Answer button. Accepted answers help community as well.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.