Hello Jacky,
Welcome to the MS Q&A platform.
Several factors can impact the performance of your copy activity pipeline, such as the size of your data, network bandwidth, and the resources of your self-hosted integration runtime.
Since you are using a cloud-based data source, using Azure IR in the same or close to your source data source region is recommended.
Here are a few other things you can consider to improve the copy activity performance.
- Check the performance of your SHIR: Make sure that the machine running the SHIR has enough resources, such as CPU and memory, to handle the workload and ensure the SHIR is installed on a machine that is close to the source and sink data stores to minimize network latency.
- Optimize your source database: You can improve the performance of your query by optimizing your source database. This may include creating indexes on the tables you are querying, tuning the query to avoid unnecessary joins or subqueries, and using the appropriate data types.
- If the data size you want to copy is large, you can adjust your business logic to partition the data further using the slicing mechanism in Data Factory. Then, schedule Copy Activity to run more frequently to reduce the data size for each Copy Activity run
- Check network bandwidth: Ensure that your bandwidth is sufficient to handle the data you are copying.
- Parallel copy: You can set parallel copy (
parallelCopies
property in the JSON definition of the Copy activity, orDegree of parallelism
setting in the Settings tab of the Copy activity properties in the user interface) on copy activity to indicate the parallelism that you want the copy activity to use. You can think of this property as the maximum number of threads within the copy activity that read from your source or write to your sink data stores in parallel. - Use staging in the destination Azure Data Lake Storage Gen2 (ADLS Gen2) to store data temporarily before loading it into the final destination.
- Use binary format: If you copy large amounts of data, consider using a binary format such as ORC or Parquet. These formats can compress data and reduce the amount of data transferred during the pipeline run
Additionally, you can establish a baseline, test against representative data samples, and monitor copy activity performance to tune the performance further.
This document has Performance tuning tips and troubleshooting copy activity performance issues.
Other reference documents:
Copy activity Performance Tuning Steps:
https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-performance-features
I hope this helps. Please let me know if you have any further questions.
If this answers your question, please consider accepting the answer by hitting the Accept answer and up-vote as it helps the community look for answers to similar questions.