How to Reduce Data Flow Runtime

Mahesh Madhusanka 176 Reputation points
2021-01-31T07:11:02.083+00:00

Hi Team,

Currently we have setup Pipeline (Attached below) Including copy activity and 2 of data flows for a capture Incremental data from remote source database and loaded(Append to incremental ) data to Azure SQL Server DB, According to our required task working properly, However we have a problem as a for a Dataflows take a much more times such as a 1st dataflow - 10 min and 2nd dataflow 10 min but copy activity tale a below 8 seconds time, Could you pelase check this and advise on this to reduce dataflow run time?

  1. copy activity - Capture the incremental data and loaded into Azure DB incremental Table
  2. Data Flow - 1 - Copy bulk DB Backup to Another Data table (Sama Database another table - only the copy source table to Sink table )
  3. Data Flow - 2 - merge Data Between incremental data table and 2nd stage staging database and loaded to Bulk DB.

Ex -

62108-image.png

1.copy Activity
62236-image.png

262168-image.png. Dataflow 1-

  1. Dataflow 2-62160-image.png
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
9,698 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Nasreen Akter 10,761 Reputation points
    2021-01-31T16:38:45.683+00:00

    Hi @Mahesh Madhusanka ,

    Thank you for asking this question.

    1. According to the description, I see that you have 1 copy activity, and 2 DataFlows, all are executing in sequence. As the DataFlow activities you need to execute in sequence, ADF will spins up separate Spark clusters for each DataFlow . Each cluster will takes 3-5 minutes start-time. You can reduce the runtime by setting a TTL in the Azure IR configuration. ADF will reuse the compute resources resulting in a faster cluster start up time. Details would be found here, https://learn.microsoft.com/en-us/azure/data-factory/concepts-data-flow-performance
    2. I see you have a 1#-DataFlow without any transformation, just source and sink in it. You can reduce the cluster start-up time + execution time by using CopyActivity instead.
    3. 2#DataFlow does not make much sense to me (I could be wrong, as it's hard to tell what's the requirement was just by seeing the boxes), If the requirement was simply to UPSERT latest records from two tables i.e., incremental-DataTable and 2nd-StageStagingTable into the Bulk-DB, then I would do the followings: In the DataFlow, I would avoid the Lookup activity specially with "Match multiple rows" option, as it will create second set of same columns (i.e., 90+90 = 180) + multiple rows. What I would do something like: Source-incremental-DataTable --> DerivedColumns1
      Source-2nd-StageStagingTable --> DerivedColumns2 --> Union (stream: DerivedColumns1) --> Filter1 (I see you have filter in your DF, you can have Filter1 here, or after Filter2)-->Window Activity (Over: PK, Sort: StatusDateTime/CreateDateTime DESC, WindowColumns: Rank = rowNumber() --> Filter2 (Filter on: Rank = 1) -->AlterRow --> Sink (Settings: upsert)

    Hope this helps! Thanks! :)

    3 people found this answer helpful.
    0 comments No comments