Azure Data Factory - sink and settings configuration in Copy data activity to copy 300 million rows from one SQL table to another

Prakhar Parimal 26 Reputation points
2021-12-08T05:48:27.023+00:00

Hi,

I have a requirement to copy a table from one SQL DB to another. The source table have 300 million rows which sums up to 61 GB of data. With default configuration in Copy Data activity in ADF it takes around 13-14 hours to complete the copy task.

How can I determine the values of following fields for better performance -

  1. Write batch size
  2. Data integration unit
  3. Degree of copy parallelism

Or is there any better way in ADF to achieve this?

Thanks

Azure SQL Database
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
10,716 questions
{count} votes

Accepted answer
  1. MartinJaffer-MSFT 26,086 Reputation points
    2021-12-09T17:22:54.573+00:00

    Hello @Prakhar Parimal and welcome to Microsoft Q&A. Looks like you have done some reading already and found the main 'levers to pull'. Generally, the more Data integration unit and the more degree of copy parallelism, the better. However, like anything in life, too much is not a good thing, as the database has a limit on concurrent connections. When values are not specified, Data Factory tries to find optimal settings.

    There is a factor you have left out. Number of data partitions in your table(s). The Data Integration units and Degree of copy parallelism are portioned to each data partition. See below excerpts.

    Copy from partition-option-enabled data stores (including Azure Database for PostgreSQL, Azure SQL Database, Azure SQL Managed Instance, Azure Synapse Analytics, Oracle, Netezza, SQL Server, and Teradata): 2-256 when writing to a folder, and 2-4 when writing to one single file. Note per source data partition can use up to 4 DIUs.

    When copying data into Azure SQL Database or Azure Cosmos DB, default parallel copy also depend on the sink tier (number of DTUs/RUs).
    When copying data from partition-option-enabled data store (including Azure SQL Database, Azure SQL Managed Instance, Azure Synapse Analytics, Oracle, Amazon RDS for Oracle, Netezza, SAP HANA, SAP Open Hub, SAP Table, SQL Server, Amazon RDS for SQL Server and Teradata), default parallel copy is 4.

    The Write batch size is something I have less experience with. It should be at least large enough to contain an entire row / record.

    I recommend you take a look at the Copy activity monitoring to see where the bottlenecks may be happening.

    Recommended articles:
    copy-activity-performance
    copy-activity-performance-features
    copy-activity-performance-troubleshooting
    Azure SQL specific performance troubleshooting

    1 person found this answer helpful.
    0 comments No comments

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.