ADF copy activity waiting for source to return data

Jacky 41 Reputation points
2023-03-27T06:13:46.7233333+00:00

Hello,

When my pipeline for copy activity is running, it took very long for the query to return data. Can anyone help advise on how to improve the performance of the run?

Thank you

User's image

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,459 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
10,538 questions
0 comments No comments
{count} votes

Accepted answer
  1. Bhargava-MSFT 30,576 Reputation points Microsoft Employee
    2023-03-27T21:36:10.15+00:00

    Hello Jacky,

    Welcome to the MS Q&A platform.

    Several factors can impact the performance of your copy activity pipeline, such as the size of your data, network bandwidth, and the resources of your self-hosted integration runtime.

    Since you are using a cloud-based data source, using Azure IR in the same or close to your source data source region is recommended.

    Here are a few other things you can consider to improve the copy activity performance.

    • Check the performance of your SHIR: Make sure that the machine running the SHIR has enough resources, such as CPU and memory, to handle the workload and ensure the SHIR is installed on a machine that is close to the source and sink data stores to minimize network latency.
    • Optimize your source database: You can improve the performance of your query by optimizing your source database. This may include creating indexes on the tables you are querying, tuning the query to avoid unnecessary joins or subqueries, and using the appropriate data types.
    • If the data size you want to copy is large, you can adjust your business logic to partition the data further using the slicing mechanism in Data Factory. Then, schedule Copy Activity to run more frequently to reduce the data size for each Copy Activity run
    • Check network bandwidth: Ensure that your bandwidth is sufficient to handle the data you are copying.
    • Parallel copy: You can set parallel copy (parallelCopies property in the JSON definition of the Copy activity, or Degree of parallelism setting in the Settings tab of the Copy activity properties in the user interface) on copy activity to indicate the parallelism that you want the copy activity to use. You can think of this property as the maximum number of threads within the copy activity that read from your source or write to your sink data stores in parallel.
    • Use staging in the destination Azure Data Lake Storage Gen2 (ADLS Gen2) to store data temporarily before loading it into the final destination.
    • Use binary format: If you copy large amounts of data, consider using a binary format such as ORC or Parquet. These formats can compress data and reduce the amount of data transferred during the pipeline run

    Additionally, you can establish a baseline, test against representative data samples, and monitor copy activity performance to tune the performance further.

    This document has Performance tuning tips and troubleshooting copy activity performance issues.

    Other reference documents:

    Copy activity Performance Tuning Steps:

    https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-performance-features

    I hope this helps. Please let me know if you have any further questions.

    If this answers your question, please consider accepting the answer by hitting the Accept answer and up-vote as it helps the community look for answers to similar questions.


0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.