It looks like you are facing a common issue with Azure Data Factory (ADF) where retries on a Copy activity can lead to duplicates in your data when the activity encounters failures or timeouts.
Here are a few approaches to help you manage this situation:
Clean Up Partial Files on Retry - Use a "Pre-Copy Script" in your Copy activity (e.g., using an Azure Function or Data Lake API) to delete any partially written files for the current runID in ADLS before starting the Copy again. This ensures that each retry starts with a clean state.
Enable Resume Functionality in Copy Activity - If your Copy activity supports resume functionality, enable it to continue from the point of failure without re-copying all data. This avoids duplicates without any extra configuration.
Filter Files by RunID in Snowflake Load Step - Since your Parquet filenames include the runID, you can set up a Metadata activity in ADF to list files in the target folder, extract the latest runID, and only load files with that runID in Snowflake. This way, even if duplicates exist, only the latest complete set of files will be loaded.
Use a Custom Retry Logic - Instead of using the Copy activity’s built-in retry, design a custom retry mechanism in your pipeline.
- Add a "Cleanup" step to delete partial files if the Copy fails.
- Retry the Copy activity in a clean state.
Review Fault Tolerance Settings - Check the fault tolerance settings in your Copy activity. Configure it to skip errors (if acceptable) and log them for further review.
Reference: Fault tolerance of copy activity in Azure Data Factory and Synapse Analytics pipelines
Conclusion - By combining the resume functionality of the Copy activity with filtering based on the runID of the successful run, you can ensure that only the necessary and non-duplicate data is loaded into your Snowflake database. Additionally, implementing custom retry logic for cleanup before re-extraction can further enhance the data integrity in your pipeline.
I hope this information helps. Please do let us know if you have any further queries.
Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.
Thank you.