Slow Data Pipeline Performance - ADF Data Flow to Azure SQL Database

Murlidhar Patil 1 Reputation point
2024-05-20T11:51:35.8133333+00:00

I'm having some performance issues with an Azure Data Factory (ADF) data flow pipeline. The pipeline is designed to move data from a Parquet file and insert/update it into an Azure SQL database table. The data volume is moderate, with batches of 50,000 records out of a total of 10 million.

The strange thing is, the pipeline seems to be processing data quickly up until the insert/update sink stage (either InsertSink or UpdateSink activity). However, once it reaches that point, the process stalls for a very long time (sometime 1 hour, sometime 12 hours and so on).

I've checked the ADF monitoring and it shows the data flow stage itself is taking the longest. There are no errors in the data flow or sink activity, and the batch size seems reasonable.

I'd really appreciate some help in identifying the bottleneck causing the slow performance when inserting/updating data into the SQL database. Any suggestions for optimizing the pipeline to speed up this stage would be greatly appreciated.

The destination table (Azure SQL Table) structure is not big it just has 10 columns, but the volume of data in that table is huge. That table has around 1.1 billion rows.

I have also increased the Azure SQL Server cores to 32 GB which I think should be more than enough and ADF data flow is also running on 32 cores integration runtime.

Thanks in advance for your assistance!

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
9,807 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Harishga 4,485 Reputation points Microsoft Vendor
    2024-05-20T15:24:10.8+00:00

    Hi @Murlidhar Patil
    Welcome to Microsoft Q&A platform and thanks for posting your question here.

    When optimizing the performance of an Azure Data Factory (ADF) data flow pipeline, especially during the insert/update sink stage to an Azure SQL database table with a large volume of data, there are several methods to consider. Here’s an explanation with examples, pros, cons, and suitable methods for this scenario:

    Example: A data flow pipeline is designed to move data from a Parquet file to an Azure SQL database table with 1.1 billion rows. Despite increasing the Azure SQL Server cores to 32 GB and running the ADF data flow on a 32 cores integration runtime, the insert/update sink stage is slow.

    Pros and Cons of Performance Optimization Methods:

    Data Integration Units (DIUs):

    Pros: Can increase parallelism and throughput.

    Suitability: Adjusting DIUs can be beneficial in scenarios where the default parallelism is not sufficient for the data volume.

    Self-hosted Integration Runtime Scalability:

    Pros: Allows for more granular control over resources.

    Suitability: Suitable for scenarios where cloud integration runtime does not meet performance needs.

    Parallel Copy:

    Pros: Can significantly reduce the time required for data movement.

    Suitability: Useful when the bottleneck is data movement rather than compute resources.

    Bulk Insert Techniques:

    Pros: Minimizes roundtrips and log writes, maximizing throughput.

    Suitability: Highly suitable for large batch inserts like in the given scenario.

    Incremental Data Loading:

    Pros: Reduces data movement by only processing delta changes.

    Suitability: Not suitable for the initial load but can be considered for subsequent updates.

    Source Partitioning:

    Pros: Improves parallel processing capabilities.

    Suitability: Suitable for large datasets to improve performance during data movement.

    Recommended Method for the Scenario:

    Given the large volume of data and the structure of the destination table, leveraging bulk insert techniques would be the most suitable method. This approach would allow for the entire dataset to be uploaded to the Azure SQL Database and then execute all the INSERT/UPDATE operations within a single batch, minimizing roundtrips and log writes, thus maximizing throughput.

    In addition to the above method, it’s also recommended to review the partitioning strategy and ensure that the data is partitioned effectively to maximize parallel processing capabilities during the data movement.

    Each method has its trade-offs, and the best approach depends on the specific requirements and constraints of your data flow pipeline. It’s essential to monitor the performance after applying these optimizations and adjust the strategy as needed.

    Reference:
    https://techcommunity.microsoft.com/t5/azure-data-factory-blog/performance-tuning-adf-data-flow-sources-and-sinks/ba-p/1781804

    https://devblogs.microsoft.com/azure-sql/optimize-azure-sql-upsert-scenarios/

    https://medium.com/@rahulgosavi.94/optimizing-performance-in-azure-data-factory-pipelines-best-practices-and-strategies-b607116e2d46

    https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-performance-features

    Hope this helps. Do let us know if you any further queries.


    If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.