Azure Data Factory Copy activity retry - if it glitches and retries how to avoid duplicate data getting pulled

JB 196 Reputation points
2025-05-09T18:08:01.98+00:00

We have an ADF pipeline with a Copy activity that pulls data for a table from a Postgresql source server to ADLS and saved the data as parquet files. We have retries on the Copy activity set to 2 and the timeout set to 4 hours. In cases when either the extract times out or other Copy activity failure (and some files have already been written to the target), the retry will write the same data to the ADLS location resulting in duplicate data. The next step loads ALL the parquet files from the target to our Snowflake database which results in that table having duplicate data.

Is there a way to set the Copy activity to not pull all data again to the ADLS target?

Or do we need to program our own retry pipeline so that we can first remove what was partially written to the ADLS target before re-extracting everything?

Or is there a way to set our load to Snowflake step to only pull the full set of files in the most recent Copy activity successful run?

In ADLS the parquet files are written as data_guid_guid_filenumber (see attachment). One of the guids is the runID but the second guid must be auto created by ADF. So I could see potentially trying to get the runID that was successful and trying to filter our load to snowflake (from ADLS) step to files with that runID BUT... if there is an easier way or something that I am missing setting up in the Copy activity to avoid duplicate files on auto-retry, I would be thankful to learn about it!

Thank you!

User's image

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,526 questions
{count} votes

Accepted answer
  1. Chandra Boorla 12,925 Reputation points Microsoft External Staff Moderator
    2025-05-09T18:32:35.1233333+00:00

    @JB

    It looks like you are facing a common issue with Azure Data Factory (ADF) where retries on a Copy activity can lead to duplicates in your data when the activity encounters failures or timeouts.

    Here are a few approaches to help you manage this situation:

    Clean Up Partial Files on Retry - Use a "Pre-Copy Script" in your Copy activity (e.g., using an Azure Function or Data Lake API) to delete any partially written files for the current runID in ADLS before starting the Copy again. This ensures that each retry starts with a clean state.

    Enable Resume Functionality in Copy Activity - If your Copy activity supports resume functionality, enable it to continue from the point of failure without re-copying all data. This avoids duplicates without any extra configuration.

    Filter Files by RunID in Snowflake Load Step - Since your Parquet filenames include the runID, you can set up a Metadata activity in ADF to list files in the target folder, extract the latest runID, and only load files with that runID in Snowflake. This way, even if duplicates exist, only the latest complete set of files will be loaded.

    Use a Custom Retry Logic - Instead of using the Copy activity’s built-in retry, design a custom retry mechanism in your pipeline.

    • Add a "Cleanup" step to delete partial files if the Copy fails.
    • Retry the Copy activity in a clean state.

    Review Fault Tolerance Settings - Check the fault tolerance settings in your Copy activity. Configure it to skip errors (if acceptable) and log them for further review.

    Reference: Fault tolerance of copy activity in Azure Data Factory and Synapse Analytics pipelines

    Conclusion - By combining the resume functionality of the Copy activity with filtering based on the runID of the successful run, you can ensure that only the necessary and non-duplicate data is loaded into your Snowflake database. Additionally, implementing custom retry logic for cleanup before re-extraction can further enhance the data integrity in your pipeline.

    I hope this information helps. Please do let us know if you have any further queries.

    Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.

    Thank you.

    1 person found this answer helpful.
    0 comments No comments

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.