How to improve Copy Data runtime on Azure Data Factory

Violet Zeng 30 Reputation points
2024-07-26T21:03:14.7266667+00:00

Hi,

I'm copying data from Amazon S3 to Azure Data Lake Storage Gen2. The source contains mutliple parquet files with same schema and the target is a single parquet file that should contain the merged inforamtion from my source, so I set the copy behavior to "merge file".

I tested my pipeline of Copy Data activity and it took almost 40 min to copy 9 files and 10 gb. However, for my actual project, I need to copy more than 20TB data. Is there any way to improve the processing speed? Or should I use other solutions due to the large data size?

I have looked through the link below on increasing DIU and parallel copy, but since I'm mergin multiple source files to one single sink file, these options doesn't work for my scenario?

Thank you!

Azure
Azure
A cloud computing platform and infrastructure for building, deploying and managing applications and services through a worldwide network of Microsoft-managed datacenters.
1,094 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
10,197 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Vinodh247 13,801 Reputation points
    2024-07-27T10:48:53.28+00:00

    Hi Violet Zeng,

    Thanks for reaching out to Microsoft Q&A.

    Consider the following strategies if not testing all of these yet.

    • Although you mentioned that DIU and parallel copy options may not directly help, increasing the DIUs can still improve performance to some extent. DIUs determine the amount of compute power allocated to your copy activity, so higher DIU's can speed up the processing.
    • Implement a staging strategy where you first copy data to an intermediate storage (like azure blob storage) before merging it into the final destination. This can help optimize the performance, especially for large datasets
    • Instead of copying all 20TB in a single pipeline run, break the process into smaller batches. This can make the process more manageable and can also help with parallelism.
    • Ensure that your Azure integration runtime is located in the same region as your source data store (S3) to minimize latency and maximize throughput. This can be crucial for large data transfers
    • Ensure that data transfer between S3 and ADLS Gen2 is optimized, possibly using compressed formats if not already.
    • Adjust the parallelCopies property in your copy activity settings. This will allow you to specify the number of concurrent copies, which can help optimize throughput. But, be cautious as too many parallel copies might lead to performance degradation if the source or sink cannot handle the load
    • Databricks/Synapse: For large scale data processing, consider using Azure Databricks or Azure Synapse. These platforms are designed for big data processing and can efficiently handle large data volumes. You can use spark to read, merge, and write Parquet files in parallel.

    https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-performance-troubleshooting

    Please 'Upvote'(Thumbs-up) and 'Accept' as an answer if the reply was helpful. This will benefit other community members who face the same issue.

    0 comments No comments