Share via

Performance issue with Copy Data activity in Synapse pipeline vs Talend ETL

Michele Puopolo 0 Reputation points
2025-10-10T10:15:27.27+00:00

Hi everyone,

I'm currently rewriting a Talend ETL process using Azure Synapse, specifically for the data extraction part from an application database.

  • Source: Azure Database for MySQL instance containing 226 identical databases.
  • Destination: Azure Data Lake Storage Gen2, Parquet format.
  • Queries: I need to run 21 relatively light queries across all 226 databases, generating around 1,550,000 rows in total.
  • The queries are executed sequentially, but I configured a parallelism of 50 (max value) across the databases.

Since Synapse cannot connect directly to Azure MySQL, I'm using two Self-hosted Integration Runtimes deployed on StandardF32sv2 VMs. I've monitored the VMs and found no RAM issues, only occasional CPU spikes.

The original Talend version completes the extraction in 10 minutes, while the Synapse pipeline takes 1 hour and 30 minutes, despite all the optimizations I could apply.

Most of the time is spent in the Copy Data activity (I’ve attached screenshots showing the settings and execution times).

2025-10-10_12h07_17

2025-10-10_12h07_30

2025-10-10_12h12_21


Questions:

  1. Is this kind of workflow (many fast queries across multiple databases) feasible with Synapse, or is it simply not the right tool for this use case?
  2. Would switching to Azure Integration Runtime improve performance compared to Self-hosted IR?

Thanks in advance for any insights or suggestions!

Azure Synapse Analytics
Azure Synapse Analytics

An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.

0 comments No comments

1 answer

Sort by: Most helpful
  1. Smaran Thoomu 35,375 Reputation points Microsoft External Staff Moderator
    2025-10-10T13:31:00.7666667+00:00

    Hi Michele Puopolo
    Thanks for sharing all the details and screenshots - that helps a lot.

    From what I see, most of the delay is happening in the queue phase of your Copy activity, even though the Self-hosted IR VM has plenty of free CPU and memory. This usually points to a concurrency limit on the IR - by default, only a certain number of copy jobs can run in parallel per node, even if system resources are available. You can review or increase this under your IR configuration (Integration Runtime → Nodes → Edit → Concurrent jobs).

    Also, since you’re running many small queries, the Copy activity overhead in Synapse/ADF can add up - each Copy run has some startup and orchestration time. Tools like Talend handle many quick queries more efficiently because they stream them within the same process.

    You could try:

    Combining several queries per Copy activity (or using a stored procedure).

    Checking whether using an Azure IR gives better results (if your MySQL is public).

    Keeping the Self-hosted

    …IR only if your MySQL is private or connected through a VNet.

    Hope this helps point you in the right direction!


    If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

    Was this answer helpful?

    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.