Optimizing Copy Data Activity Time in Azure Data Factory

Axel Culqui 5 Reputation points
2025-05-30T16:57:59.8366667+00:00

Hi, I want to know if there is a way to reduce the time of the copy data activity in Azure Data Factory. In the image below, you can see the size of the source and the sink, as well as the time it is taking. This process decompresses a zip file and pastes the files in the Data Lake Storageimage

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,611 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Chandra Boorla 14,060 Reputation points Microsoft External Staff Moderator
    2025-05-30T17:48:36.48+00:00

    @Axel Culqui

    Thanks for sharing the details and the screenshot. Based on the information provided, it looks like your pipeline is copying a ~3.7 GB ZIP file from an SFTP source and decompressing it into over 120,000 files totaling ~28 GB in Azure Data Lake Storage Gen2. The entire process is taking approximately 3.5 hours, which is understandably longer than expected.

    Here are a few recommendations and optimization suggestions to help optimize and reduce the copy duration:

    Increase Parallel Copies - Currently, only 1 parallel copy is used. Increasing this to 4 or more in the Copy Activity settings can significantly speed up the file writing process to ADLS.

    Use a More Powerful Integration Runtime (IR) - You're using 4 DIUs, which may be limiting throughput. Consider increasing the DIUs or using a more powerful Azure IR or a dedicated Self-hosted IR, especially if the workload is compute-intensive.

    Pre-Decompress the ZIP File (If Possible) - Decompression inside the Copy Activity can be time-consuming. If feasible, consider unzipping the file before the copy process using an Azure Function, Logic App, or Databricks notebook.

    Split Large ZIP into Smaller Ones - If you're able to control the ZIP file generation, splitting it into smaller ZIPs with ~5,000–10,000 files each can help parallelize the process and reduce latency.

    Minimize Small File Writes - Writing 120,000+ small files can be slow due to metadata and file system overhead. Consider batching or consolidating smaller files if your downstream processing supports it.

    Monitor Throughput - The current throughput is ~298 KB/s, which is relatively low. Verifying the network speed and source SFTP performance might reveal bottlenecks outside ADF as well.

    I hope this information helps. Please do let us know if you have any further queries.

    Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.

    Thank you.

    1 person found this answer helpful.

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.