Hi Violet Zeng,
Thanks for reaching out to Microsoft Q&A.
Consider the following strategies if not testing all of these yet.
- Although you mentioned that DIU and parallel copy options may not directly help, increasing the DIUs can still improve performance to some extent. DIUs determine the amount of compute power allocated to your copy activity, so higher DIU's can speed up the processing.
- Implement a staging strategy where you first copy data to an intermediate storage (like azure blob storage) before merging it into the final destination. This can help optimize the performance, especially for large datasets
- Instead of copying all 20TB in a single pipeline run, break the process into smaller batches. This can make the process more manageable and can also help with parallelism.
- Ensure that your Azure integration runtime is located in the same region as your source data store (S3) to minimize latency and maximize throughput. This can be crucial for large data transfers
- Ensure that data transfer between S3 and ADLS Gen2 is optimized, possibly using compressed formats if not already.
- Adjust the
parallelCopies
property in your copy activity settings. This will allow you to specify the number of concurrent copies, which can help optimize throughput. But, be cautious as too many parallel copies might lead to performance degradation if the source or sink cannot handle the load - Databricks/Synapse: For large scale data processing, consider using Azure Databricks or Azure Synapse. These platforms are designed for big data processing and can efficiently handle large data volumes. You can use spark to read, merge, and write Parquet files in parallel.
https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-performance-troubleshooting
Please 'Upvote'(Thumbs-up) and 'Accept' as an answer if the reply was helpful. This will benefit other community members who face the same issue.