ADF Data flow gets timed out while upserting data to Azure SQL DB

Rajas Thakur 61 Reputation points
2024-02-01T23:51:06.9666667+00:00

Hello,

We have a data flow which reads data from source table (Azure SQL DB), does transformations using views and then it is supposed to upsert the data in the destination table (Azure SQL DB). Our data volume has recently become 2X and I believe as a result the data flow is stuck in execution status.

Prior row count - 18 million. Current row count - 36 million.

Following are the solutions that I have tried but haven't worked -

  • Increased vCore count (In the elastic pool which the database is part of) from 6 to 8.
  • Increased core count in the Integration Runtime from 8(+8) to 32(+16). We are currently on General Compute type.
  • Increased partitions in the Optimize data flow options from 20 to 32. Also tried with different methods - current partitioning , set partitioning (round robin, source).

We also have sufficient space in the database to handle increased loads. The data flow shows succeeded in the monitor tab but the activity never gets completed and instead times out after the defined timeout interval of 4 hrs. Attached a screenshot of the scenario. The temp table seen below does have all the data but it is unsuccessful to upsert it to the destination table.

User's image

Has anyone faced similar issues and know what could be a possible solution to tackle this?

Azure SQL Database
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
10,196 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. AnnuKumari-MSFT 32,161 Reputation points Microsoft Employee
    2024-02-05T08:01:35.82+00:00

    Hi Rajas Thakur ,

    Thankyou for using Microsoft Q&A platform and thanks for posting your query.

    Based on your description, it seems like your data flow is stuck in execution status and the volume has doubled after the last run while upserting the data in the destination table.

    Please let me know if that is not the case.

    It would be great if you could share the full dataflow configurations with screenshot of derive col, alter row and sink transformation settings.

    I suspect that the issue of duplicate data is occurring because of 'Allow insert' option being turned on in the sink settings. Even if you have enabled 'Allow upsert' , 'Allow insert' would be given preference if it's not disabled which is by default selection in the dataflow. Kindly validate the same.

    Regarding data flow execution getting stuck, you could try to follow the below points:

    1. Check the data flow execution plan. You can use the "Execution Plan" tab in the data flow designer to view the execution plan for your data flow. The execution plan shows the order of operations and the estimated data size for each operation. You can use this information to identify any bottlenecks or performance issues in your data flow.
    2. Optimize your data flow. Based on the execution plan, you can try optimizing your data flow by adjusting the partitioning, changing the order of operations, or using different transformations. For example, you can try using the "Aggregate" transformation to reduce the number of rows before doing the upsert operation.
    3. Increase the timeout interval. If your data flow is timing out after 4 hours, you can try increasing the timeout interval to a higher value. You can do this in the data flow settings under "Optimize".
    4. Use a staging table. Instead of doing the upsert operation directly on the destination table, you can try using a staging table to store the transformed data first. You can then use a stored procedure or SQL script to do the upsert operation from the staging table to the destination table. This can help reduce the load on the destination table and improve performance.

    I hope this helps. Please accept the answer by clicking on Accept answer button. Thankyou