Parallelizing an ADF copy activity from a Synapse view to a Synapse table

Question

Hi,

is it possible to parallelize a Data Factory copy activity in order to read data from a Synapse Analitics view to write data to a Synapse table?

Thanks

Accepted Answer

In addition to the above answer
Hi @pmscorca

Welcome to Microsoft Q&A platform and thanks for posting your question here.

When you run a Data Factory copy activity, it reads data from a source and writes it to a destination. By default, the copy activity runs a single query to read data from the source and write it to the destination. However, when dealing with large amounts of data, this can be slow and inefficient.

Parallelizing a Data Factory Copy Activity in Azure Synapse Analytics

To improve the performance of a Data Factory copy activity, you can parallelize it. This means that the copy activity runs multiple queries in parallel to read data from the source and write it to the destination. In Azure Synapse Analytics, the Azure Synapse Analytics connector in the copy activity allows for built-in data partitioning to copy data in parallel.

Enabling Partitioned Copy in Azure Synapse Analytics

To enable partitioned copy in Azure Synapse Analytics, you can use the "parallelCopies" setting on the copy activity. This setting specifies the degree of parallelism for the copy activity, which determines how many parallel queries will be generated and run against the Azure Synapse Analytics source to load data by partitions.

Example Scenario

Let's say you have a large table in Azure Synapse Analytics with 1 million rows of data. You want to copy this data to another table in Azure Synapse Analytics using a copy activity. To enable partitioned copy, you can set the "parallelCopies" setting to 4. This means that the copy activity will generate and run 4 parallel queries against the Azure Synapse Analytics source to load data by partitions.

Retrieving Data by Partitions

Each query will retrieve a portion of the data from the Azure Synapse Analytics source, based on your specified partition option and settings.

For example, you could partition the data based on a hash function applied to a specified column or based on a specified range of values in a column.

Optimizing the Copy Activity

It's important to note that the optimal value for the "parallelCopies" property depends on the size of the data, the available resources, and the network bandwidth. Additionally, the copy activity can be further optimized by configuring the batch size and timeout settings, as well as the source and sink settings.

Hope this helps. Do let us know if you any further queries.

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Answer

Based on the documentation :

You can set the parallelCopies property to indicate the parallelism you want the copy activity to use. Think of this property as the maximum number of threads within the copy activity. The threads operate in parallel. The threads either read from your source, or write to your sink data stores.

Share via

Parallelizing an ADF copy activity from a Synapse view to a Synapse table

1 additional answer