ADF performance troubleshooting

Question

For very basic activities (eg lookup) I'm finding large differences in performance between one execution of a pipeline and the next.

For example the following sometimes runs in 3 seconds and other times runs in 20 seconds.

What can I do to investigate these types of performance inconsistencies? In this case the lookup is running a stored procedure in an azure-hosted sql database via the "AutoResolveIntegrationRuntime". When I monitor xevents in the database, the query executes instantaneously. The performance problems must be happening in ADF or in the integration runtime itself (autoresolve runtime).

Essentially I'm asking where I get profiling information. Preferrably this type of profiling would be available both in the real ADF environment and in the debug environment.

Side question. Is the "auto resolve integration runtime" shared with lots of other customers? Is there a chance that my stuff is getting queued up behind a dozen other customers in the US east region? If so , then is there a way to monitor how busy that integration runtime is at any given moment? Is there a service tier where I don't share my "auto resolve" with other customers?

Accepted Answer

I still don't have a formal document that says so, but Microsoft engineers are standing by their undocumented "SLA" that says any interaction with a stored procedure in SQL may take up to four minutes. This includes an Azure SQL resource in the same azure region.

So, if a stored procedure should execute in 1 second, the ADF IR is reasonably permitted to spend an extra 4 minutes going to sleep, or doing whatever else it feels like doing. In practice I've noticed that the IR can often spend an additional 10 or 20 seconds doing nothing at all. Microsoft claims there are no configuration options (even in the self-hosted IR) to adjust that time that is being wasted.

If you see an IR going to sleep for longer than 4 minutes per activity, then that justifies opening a support ticket to troubleshoot performance.

Here is the complete response from the support engineer (Tyler, Big Data Support Engineer):

As stated from the Product Team, any stored procedure activity that performs within 4 minutes has met SLA within ADF--this SLA covers the overhead of ADF communicating with SQL server. With that said, your current performance within ADF is normal.

Answer

Hi @DavidBeavon-2754,

To answer your question, no, each Azure Data Factory has a dedicated Azure IR assigned.

You can monitor your Integration Runtime to get a bunch of properties as well as the status of your IR :

Monitoring Integration Runtime in ADF

Hope this helps.

Answer

I'm wrapping up a support case with Microsoft related to the performance of ADF. To be more specific , this is related to the performance of a sequence of interactions with Azure SQL. We find that ADF introduces a ton of overhead when performing a sequence of activities. For example, 100 individual activities could add ~10-15 mins of unexpected overhead. The additional delays are taking place within the internal workings of the ADF IR.

In my phone call with Microsoft, it was explained that the overhead in the ADF IR is typically attributed to "task pickup time". The architecture consists of an asynchronous queue that needs to be polled in order for the IR to pick up each individual task.

Unfortunately for a large pipeline, this "task pickup time" can be a substantial percentage of the overall execution. In the example given by @MartinJaffer-MSFT
above, the task pickup time added 10 mins on top of the "real" work - that was only supposed to take one second.

Ideally there would be a way to configure/tune the "task pickup time" - especially for an on-prem self-hosted IR. However Microsoft said there is no supported mechanism for configuring the IR (either on -prem or in azure).

Barring the ability to configure/tune the IR, it would be nice if ADF would at least give us some additional visibility or metrics. ADF should indicate how much of the overall time is spent on "task pickup". If a pipeline takes an extremely long time, customers should be able to determine whether the problem is in their own code, or if the problem is an unavoidable consequence of using ADF.

As it turns out, there is supposedly an "SLA" for ADF when it interacts with SQL. I still have to find a formal reference for this, but I'm told that ADF is supposed to be allowed to contribute an additional four minutes of its own overhead! Ie. according to the SLA, it’s acceptable for ADF to introduce up to four minutes of its own overhead, for a every interaction with SQL - even when a stored proc can be otherwise executed in one second. To me that seems like an extremely lenient SLA and, if it came to that, I'm sure most customers would be very unhappy with a four minute delay for each SQL activity. Even a 5 second delay for every nested pipeline can be a problem, when it accumulates within a sequence of other activities.

Answer

@ChiragMishraMSFT-1092

Thanks for your response. I had looked in the azure portal and I believe that I found visuals for everything that was mentioned in that powershell interface you pointed me to. (The CPU visual was a bit hard to find. I think it is what you see in the azure portal outside of the authoring interface).

I guess the follow-up question is whether the dedicated Azure IR is running on dedicated compute (perhaps a dedicated VM)? If it is not on dedicated compute then it may as well be a shared IR.

How do you recommend that I dig into the reason for a 20 second activity on an ADF IR? The azure SQL server xevents indicate that only 10 ms of that was spent interacting with SQL! Is there a way to get profiling information? When this runs on -prem (self-hosted) there are log files and application log messages in the event log. I'm not sure how to find the corresponding information for the "auto-resolve" version of the ADR IR.

Answer

@ChiragMishraMSFT-1092

Was wondering if you would provide any further clues regarding our performance issues in ADF. We are using ADF for lots of small activities that should execute with sub-second responses. But for some unexplained reason we are watching some of them sit in queue for very long periods of time. Sometimes a sub-second SQL activity will sit in the runtime for a much longer period of time than we care to wait (eg. ~20 seconds instead of ~1 second).

This is fine if a given pipeline was just doing one or two activities, but in many cases we are doing one or two dozen activities. It may be that 12 1 seconds is a reasonable amount of time to wait for a pipeline ... but when that unexpectedly grows to 12 20 seconds then it becomes quite a problem, and can negatively impact developer productivity.

In order to continue troubleshooting, I am happy to run a self-hosted runtime if that would help to isolate the issue. I can dig into the event logs, or run process monitor, or even attach a debugger. It is clear from the event log that, in many cases, there are long and unnecessary delays during the execution of a pipeline activity. The delays seem to be self-imposed by the runtime service itself, and must be due to the way that its workload is queued up and executed. If there were some config value to adjust a polling interval, or some other type of governor, then I'd really like to hear about it. I haven't been able to find anything like that yet on my own...

The problem is quite reproducible. You just need to create a sequence of activities that interact with SQL for a simple request (eg. a sproc that takes ~10 ms or less). If you stack up a lot of these ADF activities in sequence you will see that they execute in an unpredictable amount of time that ranges from one to twenty seconds apiece. This is reproducible both in a self-hosted runtime and in an azure-hosted runtimes.

Is it possible that there was a recent regression in the runtime that is causing it to perform worse than in the past? When we first start building these pipelines they seemed to run faster. Any help would be very much appreciated. It would enhance my productivity if I could repeatedly run these pipelines without spending so much time waiting for them to finish.

Share via

ADF performance troubleshooting

5 additional answers

Your answer