How to implement check points in my existing design to avoid duplicates in sinkin ADF

Question

How to implement check points in my existing design to avoid duplicates in sinkin ADF

Sarag 61

Hi,

In adf one of the data flow is failed due to transist error in the middle of execution.the data flow is having insert ,update and delete operation based on scd2 scenario.To avoid the transist failures we have applied retry option.But while retrying second attempt there is some data loaded in the sink table already in the first attempt and causes duplicate records in the table during the second run for insert scenario.Is there any option to restart the job from the point of failure during the second retry.

I have not implemented cdc at source level that's the reason it seems I did not see any option at sink setting about check points to avoid duplicate Data

Please suggest.

Thanks in advance.

phemanth 15,755 Reputation points Microsoft External Staff Moderator

2024-03-26T11:03:19.72+00:00

@Sarag We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.
Sarag 61 Reputation points

2024-03-26T11:39:28.92+00:00

Thanks for your response.I am trying coustom retry option for we are using mapping data flows.I Am closing this thred and mark as helpfull
phemanth 15,755 Reputation points Microsoft External Staff Moderator

2024-03-27T06:52:55.4433333+00:00

@Sarag Following up to see if the below answer was helpful. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Accepted answer

0 additional answers

Your answer

phemanth 15,755 Reputation points Microsoft External Staff Moderator

2024-03-26T11:03:19.72+00:00

@Sarag We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.
Sarag 61 Reputation points

2024-03-26T11:39:28.92+00:00

Thanks for your response.I am trying coustom retry option for we are using mapping data flows.I Am closing this thred and mark as helpfull
phemanth 15,755 Reputation points Microsoft External Staff Moderator

2024-03-27T06:52:55.4433333+00:00

@Sarag Following up to see if the below answer was helpful. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Answer 1

@SaragThanks for reaching out to Microsoft Q&A

Retrying the entire data flow in ADF with an SCD2 scenario can lead to duplicate inserts due to data already being processed in the first attempt. Here are a few approaches to tackle this challenge without CDC:1. Leverage Data Flow Retry with Integration Runtime State Persistence:

ADF Retry: Utilize the built-in retry functionality for the data flow activity within your pipeline. This allows you to define the number of retries and the wait interval between attempts.
Integration Runtime State Persistence: Enable state persistence on your Integration Runtime. This stores the data flow's execution state between retries. However, state persistence only works for certain data flow types (like mapping data flows) and might not be available for all scenarios.

Here's a caveat: While state persistence helps with some transformations, it might not guarantee complete avoidance of duplicates for SCD2 inserts in all cases, especially if the failure happened after some inserts.

2. Implement a Custom Retry Logic with High Water Mark:

High Water Mark Table: Create a separate table to store the "high water mark" of the data flow execution. This table would have a single column that keeps track of the last successfully processed record identifier.
Custom Logic in Data Flow: In your data flow, before performing the insert operation, query the high water mark table and filter the incoming data to exclude records processed before the last successful run. Update the high water mark table after a successful insert batch.

This approach offers more control over retry behavior and avoids duplicate inserts for SCD2 scenarios.

3. Explore Alternative Data Integration Solutions:

Change Data Capture (CDC): If feasible, consider implementing CDC at the source level. This allows ADF to capture only the changes in the source data, eliminating the need for full data transfers and reducing the risk of duplicates during retries.
Other Data Integration Tools: Some data integration tools offer more granular control over retries and checkpointing mechanisms specifically designed for SCD scenarios. Evaluate if switching to a different tool might be a viable option for your specific needs.

Choosing the right approach depends on your specific data flow type, the nature of transient errors, and the complexity of your SCD2 logic. Consider the trade-offs between development effort, performance overhead, and the level of control you require over retry behavior.

Hope this helps. Do let us know if you any further queries.

Share via

How to implement check points in my existing design to avoid duplicates in sinkin ADF

0 additional answers

Your answer