Azure Data Factory Activity wait times extending

Question

We have a set of Azure pipelines that have been running for the past two years without incident. Processing in ADF is orchestration of stored procedures in Azure SQL PaaS and data copies between SQL PaaS databases, all for Data Warehousing.

In the past week, the processing times for the have increased dramatically, 3 to 5 times, although the amount of data and processing that we are looking to do has only increased marginally. We have also implemented changes in our process to accommodate this however the issue is only getting worse. We are seeing significant increased in the time between activities being initiated and the time the activities are taking, although our internal logging in SQL PaaS is showing no increase in time taken for the actual stored procedures to run. This is not isolated to one region as we are processing in Europe as well the US. It seems like there are changes in Azure Data Factory that have / is causing this.

Is there any explanation as to why this is happening and what we can do about it. ADF seems to be very much a black box that cannot be opened when you get to this level.

Thanks

Answer

@HarithaMaddi-MSFT , thanks for replying to my question. Data Factory name is CognitionBITADFProd

The issue that we have been facing is best described below. The first table shows a snapshot of our processing when things were going well, as you can see the start time plus the duration of the task was about equal to the start time of the next task, within a couple of seconds.

3 - Report Database Generic Copy 8/31/20, 12:30:27 PM 8/31/20, 12:34:27 PM 00:03:59
Triggered by Pipeline 36bbcd60-1557-4738-bdd6-006767e8266e

Activities
Copy 8/31/20, 12:34:12 PM 00:00:13 Succeeded DefaultIntegrationRuntime (West Europe) 858ba56c-292a-4b04-a603-090907de8616
Copy 8/31/20, 12:33:44 PM 00:00:25 Succeeded DefaultIntegrationRuntime (West Europe) 7a1ad81c-96c1-4066-96a9-6b2cb5e1efb9
Copy 8/31/20, 12:33:27 PM 00:00:16 Succeeded DefaultIntegrationRuntime (West Europe) 1e601994-4e1f-4a22-9e1f-f94682fd626b
Copy 8/31/20, 12:33:12 PM 00:00:15 Succeeded DefaultIntegrationRuntime (West Europe) 36995e84-5f38-4c32-87a2-b5e361d95781
Copy 8/31/20, 12:32:55 PM 00:00:16 Succeeded DefaultIntegrationRuntime (West Europe) 2745f357-5ea7-4f34-a8be-dec1df9da27d
Copy 8/31/20, 12:32:39 PM 00:00:15 Succeeded DefaultIntegrationRuntime (West Europe) b8576ed4-e60e-4309-8388-e481731a9f21

However over the last week this has changed dramatically... as you can see from the second table the start time plus the duration is nowhere near the start time of the next activity. We have not changed anything yet there is now a dramatic change in both the actual duration of the tasks and (what we are calling) the wait time, this has increased to 15 plus seconds.

3 - Report Database Generic Copy 9/22/20, 12:54:53 PM 9/22/20, 1:33:33 PM 00:38:39
Triggered by Pipeline16f1aea2-495b-4a08-bb7e-5d13525eb7e5

Activities
Copy 9/22/20, 1:32:12 PM 00:00:34 Succeeded DefaultIntegrationRuntime (West Europe) 821bdca0-d884-483d-b6bf-3561fd33fe56
Copy 9/22/20, 1:27:05 PM 00:00:24 Succeeded DefaultIntegrationRuntime (West Europe) 5ae32159-f66b-4ece-9844-53158c6f11df
Copy 9/22/20, 1:26:23 PM 00:00:27 Succeeded DefaultIntegrationRuntime (West Europe) d28dac67-b14a-4eed-8046-0266f26ba5e3
Copy 9/22/20, 1:25:44 PM 00:00:29 Succeeded DefaultIntegrationRuntime (West Europe) 7357f2a1-5c3e-4492-821e-3a4e15fd38a4
Copy 9/22/20, 1:24:58 PM 00:00:30 Succeeded DefaultIntegrationRuntime (West Europe) 3827836e-26b9-4e71-a0af-a1ff24ef366c
Copy 9/22/20, 1:24:12 PM 00:00:30 Succeeded DefaultIntegrationRuntime (West Europe) d3cee068-53f7-4bd5-9f58-fed05b756aeb
Copy 9/22/20, 1:23:25 PM 00:00:34 Succeeded DefaultIntegrationRuntime (West Europe) 1aed4c1f-e967-4d76-8407-edd50a90450b
Copy 9/22/20, 1:22:31 PM 00:00:31 Succeeded DefaultIntegrationRuntime (West Europe) fe383cba-df4e-40fe-9634-7c925c950ced

The issue that we have is that over the last seven days a process that used to take 7-10 minutes is now taking 70 minutes plus. The table below is taken from our internal logging, but illustrates the significant increase in processing times.

Date Total Time Between Processes (Mins)
2020-09-22 75
2020-09-21 72
2020-09-20 30
2020-09-19 35
2020-09-18 38
2020-09-17 22
2020-09-16 15
2020-09-15 15
2020-09-14 11
2020-09-13 8

Our solution is mission critical for a number of our customers and we are trying to process their data during a specific window. With the blow out of the timings this has become a significant issue for us. What we need to understand is why this is happening and what (if anything) there is that we can do about it.

Thanks and appreciate all the help and insight that you can bring to bear.

C.

Answer

Thanks @Clive Trott for your patience!

I discussed with Product team, they investigated and below are details:

Reg customer's pipeline run 77c93656-5187-4d89-b43f-5902ae643415, Looks like there is 4-6s delay between when the previous iteration copy activity succeeded and when the next iteration copy activity was queued.

TIMESTAMP activityRunId activityName status activityType
2020-09-22 00:55:22.8482270 602b2a62-d5b0-405f-ad05-356f845ec555 Generic Data Copy Queued Copy
2020-09-22 00:55:35.1795670 602b2a62-d5b0-405f-ad05-356f845ec555 Generic Data Copy InProgress Copy
2020-09-22 00:55:43.1507126 602b2a62-d5b0-405f-ad05-356f845ec555 Generic Data Copy Succeeded Copy
2020-09-22 00:55:47.6402610 e1bc25ec-9c2d-4a2f-82a5-4404dcc2b558 Generic Data Copy Queued Copy
2020-09-22 00:56:01.8795267 e1bc25ec-9c2d-4a2f-82a5-4404dcc2b558 Generic Data Copy InProgress Copy
2020-09-22 00:56:11.4281916 e1bc25ec-9c2d-4a2f-82a5-4404dcc2b558 Generic Data Copy Succeeded Copy
2020-09-22 00:56:17.3097033 2be31b75-6d3e-42bd-b9d8-aa1fc51ca1dc Generic Data Copy Queued Copy
2020-09-22 00:56:31.0017671 2be31b75-6d3e-42bd-b9d8-aa1fc51ca1dc Generic Data Copy InProgress Copy
2020-09-22 00:56:42.2490113 2be31b75-6d3e-42bd-b9d8-aa1fc51ca1dc Generic Data Copy Succeeded Copy

The backend service (region: WestUS2) was at 100% CPU during the time when the pipeline run was executing. The service is network heavy and usually runs out of network sockets which leads to thread starvation and CPU overload. We did scale out the service in WestUS2 later in day on 9/22 PST and customer should not observe these delays anymore.

I also wanted to call out the activity run duration only accounts for actual "execution" time of the activity. Customer's expectation that "gap between start times of consecutive activity runs should be exactly activity run duration of first activity run" is wrong. We have queue on the activity run because there could be capacity constraints on downstream data movement/or even customer compute resources/IRs. (Similarly, we also have a queue on pipeline runs since there is a limit of 100 concurrent runs for a single pipeline.) We surface the breakdown of the activity run duration in the activity run payload.

Please recheck and confirm if you think the issue still persists with ADF. If further investigation is needed, kindly send mail to AzCommunity[at]microsoft[dot]com with subject: "Attn: Haritha - Azure Data Factory Activity wait times extending" and we will help you move forward.

Share via

Azure Data Factory Activity wait times extending

2 answers