How to Reduce Data Flow Runtime

Question

Hi Team,

Currently we have setup Pipeline (Attached below) Including copy activity and 2 of data flows for a capture Incremental data from remote source database and loaded(Append to incremental ) data to Azure SQL Server DB, According to our required task working properly, However we have a problem as a for a Dataflows take a much more times such as a 1st dataflow - 10 min and 2nd dataflow 10 min but copy activity tale a below 8 seconds time, Could you pelase check this and advise on this to reduce dataflow run time?

copy activity - Capture the incremental data and loaded into Azure DB incremental Table
Data Flow - 1 - Copy bulk DB Backup to Another Data table (Sama Database another table - only the copy source table to Sink table )
Data Flow - 2 - merge Data Between incremental data table and 2nd stage staging database and loaded to Bulk DB.

Ex -

1.copy Activity

2. Dataflow 1-

Dataflow 2-

Answer

Hi @Mahesh Madhusanka ,

Thank you for asking this question.

According to the description, I see that you have 1 copy activity, and 2 DataFlows, all are executing in sequence. As the DataFlow activities you need to execute in sequence, ADF will spins up separate Spark clusters for each DataFlow . Each cluster will takes 3-5 minutes start-time. You can reduce the runtime by setting a TTL in the Azure IR configuration. ADF will reuse the compute resources resulting in a faster cluster start up time. Details would be found here, https://learn.microsoft.com/en-us/azure/data-factory/concepts-data-flow-performance
I see you have a 1#-DataFlow without any transformation, just source and sink in it. You can reduce the cluster start-up time + execution time by using CopyActivity instead.
2#DataFlow does not make much sense to me (I could be wrong, as it's hard to tell what's the requirement was just by seeing the boxes), If the requirement was simply to UPSERT latest records from two tables i.e., incremental-DataTable and 2nd-StageStagingTable into the Bulk-DB, then I would do the followings: In the DataFlow, I would avoid the Lookup activity specially with "Match multiple rows" option, as it will create second set of same columns (i.e., 90+90 = 180) + multiple rows. What I would do something like: Source-incremental-DataTable --> DerivedColumns1
Source-2nd-StageStagingTable --> DerivedColumns2 --> Union (stream: DerivedColumns1) --> Filter1 (I see you have filter in your DF, you can have Filter1 here, or after Filter2)-->Window Activity (Over: PK, Sort: StatusDateTime/CreateDateTime DESC, WindowColumns: Rank = rowNumber() --> Filter2 (Filter on: Rank = 1) -->AlterRow --> Sink (Settings: upsert)

Hope this helps! Thanks! :)

Share via

How to Reduce Data Flow Runtime

1 answer