How to implement check points in my existing design to avoid duplicates in sinkin ADF

Sarag 61 Reputation points
2024-03-22T12:19:32.9666667+00:00

Hi,

In adf one of the data flow is failed due to transist error in the middle of execution.the data flow is having insert ,update and delete operation based on scd2 scenario.To avoid the transist failures we have applied retry option.But while retrying second attempt there is some data loaded in the sink table already in the first attempt and causes duplicate records in the table during the second run for insert scenario.Is there any option to restart the job from the point of failure during the second retry.

I have not implemented cdc at source level that's the reason it seems I did not see any option at sink setting about check points to avoid duplicate Data

 

 Please suggest.

Thanks in advance.

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
9,547 questions
{count} votes

Accepted answer
  1. phemanth 5,735 Reputation points Microsoft Vendor
    2024-03-25T08:48:30.8766667+00:00

    @SaragThanks for reaching out to Microsoft Q&A

    Retrying the entire data flow in ADF with an SCD2 scenario can lead to duplicate inserts due to data already being processed in the first attempt. Here are a few approaches to tackle this challenge without CDC:1. Leverage Data Flow Retry with Integration Runtime State Persistence:

    • ADF Retry: Utilize the built-in retry functionality for the data flow activity within your pipeline. This allows you to define the number of retries and the wait interval between attempts.
    • Integration Runtime State Persistence: Enable state persistence on your Integration Runtime. This stores the data flow's execution state between retries. However, state persistence only works for certain data flow types (like mapping data flows) and might not be available for all scenarios.

    Here's a caveat: While state persistence helps with some transformations, it might not guarantee complete avoidance of duplicates for SCD2 inserts in all cases, especially if the failure happened after some inserts.

    2. Implement a Custom Retry Logic with High Water Mark:

    • High Water Mark Table: Create a separate table to store the "high water mark" of the data flow execution. This table would have a single column that keeps track of the last successfully processed record identifier.
    • Custom Logic in Data Flow: In your data flow, before performing the insert operation, query the high water mark table and filter the incoming data to exclude records processed before the last successful run. Update the high water mark table after a successful insert batch.

    This approach offers more control over retry behavior and avoids duplicate inserts for SCD2 scenarios.

    3. Explore Alternative Data Integration Solutions:

    • Change Data Capture (CDC): If feasible, consider implementing CDC at the source level. This allows ADF to capture only the changes in the source data, eliminating the need for full data transfers and reducing the risk of duplicates during retries.
    • Other Data Integration Tools: Some data integration tools offer more granular control over retries and checkpointing mechanisms specifically designed for SCD scenarios. Evaluate if switching to a different tool might be a viable option for your specific needs.

    Choosing the right approach depends on your specific data flow type, the nature of transient errors, and the complexity of your SCD2 logic. Consider the trade-offs between development effort, performance overhead, and the level of control you require over retry behavior.

    Hope this helps. Do let us know if you any further queries.

    1 person found this answer helpful.
    0 comments No comments

0 additional answers

Sort by: Most helpful