Why almost all the load for a blob triggered Azure Function is handled by one of the three always ready instances in Windows Elastic Premium Plan

Question

Why almost all the load for a blob triggered Azure Function is handled by one of the three always ready instances in Windows Elastic Premium Plan

ZQadir 195

As part of the performance benchmarking to decide the right scaling configuration for our Function Apps on Elastic Premium Plan, we carried out a test to process approximately 10,000 files. Those files are placed on the specific container on an Azure Blob Storage (General Purpose Storage V2 with ZRS). There is an Azure Function which has a blob trigger on that container to processes those file and route them accordingly.

The scale out configuration for that Function App is set to 3 always instances and the Windows Elastic Premium Plan EP2 is configured with Availability Zone support and scaling b/w 3 to 5 nodes (screenshot below for reference):

Scale Out Configuration

Post the completion of file processing, and in order to confirm that the function app is scaling out and in as expected and as per the configuration, we executed the following query in Azure Application Insights.

requests

| where timestamp > ago(1h) and operation_Name == "REDACTED"

| summarize count() by cloud_RoleInstance, bin(timestamp, 1m)

| render timechart

The results from that were a bit un-usual in the sense that out of the 3 always ready instances, only two instances were processing the requests, and moreover the "almost all" the requests were handled by one instance:

Distribution of Load Across Cloud Rle Instances

And here is the summary of which cloud role instance processed how many files in total:

Distribution of Load Across Cloud Rle Instances - Summary

On looking at the CPU Utilization of the Cloud Role Instances, we noticed that the cloud role instance that handled almost all the load was running very high on CPU, touch 90%+ utilization, while the other cloud role instances were running at 10% or lower CPU Utilization. See screenshot below:

CPU Utilization

Moreover, looking at the function execution performance, we can see that their was latency in executions for as much as 5.670 seconds:

Function Execution Performance

Please note that these executions happened in the 11 minute window b/w 02:34 PM to 02:45 PM, and the Elastic Premium Plan did scale out and scaled in as expected.

Elastic Premium Plan Scale Out

This raises a couple of questions for us:

Why only one specific instance ended up in picking almost all the load i.e., processing almost all the the files uploaded on the blob, even when it was running significantly high on CPU, while the other cloud role instances were running fairly light and still not processing the files from the blob trigger?
Why only 2 cloud instances were processing the requests most of the time, when the scale out configuration is set to 3 always ready instances with no scale out limit i.e., it can go up to 5 (the limit set on App Service Plan to avoid any bill shock during this initial benchmarking exercise).

The functions are coded in C#, .NET 6.

Please note that this specific behaviour is noticed in Blob Triggered Functions only. We have seen a fairly balanced distribution of load for Service Bus Triggered and HTTP triggered functions on the same App Service Plan. Not sure if there is any kind of zonal affinity coming into play here (please note that the storage account on which the blob trigger is set up is ZRS, and the App Service Plan is Windows EP2 with Zone Redundancy too)? But what's not understandable is why such imbalanced load distribution is only affecting this blob triggered function and not the Service Bus or HTTP Triggered Functions on the same App Service Plan?

Would appreciate any insights on this behaviour, so that we can adjust our configuration and implementation to make full use of all the available instances in our App Service Plan.

Accepted answer

1 additional answer

Your answer

Answer 1

Hello Zeeshan,

The blob trigger behavior which you witnessed is expected behavior, a single function app instance (the lease owner) regularly reads the storage logs to see if there are new blobs to trigger on, when it finds blobs, it inserts items into a storage queue. All of the function app instances will then process messages from the queue. There is however a backoff logic to prevent excessive polling of the storage queue when its empty - this is to reduce the cost of a customer from frequent storage I/O. This backoff caps off at 60seconds. What's happening is that all of the function app instances have backed off to polling the storage queue every 60s. When there are messages in the storage queue (i.e. blobs to process) the first instance to finish its 60s wait will start processing messages from the storage queue and process all of the blobs. The single instance is handling all of the 2500 blobs. Each time one of the other instances polls the queue {every 60s) they find nothing in the queue as the one instance has drained it and so they continue to backoff for 60s. If a customer has a steady stream of blobs to process, they will see all of the instances start picking up messages from the storage queue and processing blobs.

Answer 2

TP 124.7K Volunteer Moderator

Hi,

Is your blob trigger event-based, polling (default), event grid, manual queue, or something else? Did you have all 10,000 blobs sitting in the storage account before starting the function, or did you dump them in all at once with function already up and running, or did you gradually add xx blobs per second, or something else?

From your test description it seems your usage may be (or will in the future) be considered "high scale" so you should not be using polling. With that in mind, it is possible if you switched to event-based, for example, the issue you are seeing will go away. If you are using polling I would suggest switching and repeating the test.

Please see article below if you haven't already:

Azure Blob storage trigger for Azure Functions

https://learn.microsoft.com/en-us/azure/azure-functions/functions-bindings-storage-blob-trigger?tabs=in-process&pivots=programming-language-csharp

Thanks.

-TP

ZQadir 195 Reputation points

2023-03-03T10:52:44.4333333+00:00

Hi TP,

Thanks for the response.

Regarding the Blob Trigger used, we are using the standard polling based Blob Triggered Azure Function, as up to 10 min latency is acceptable for our use cases (in an unhappy path, as we understand that its hybrid approach using a combination of log scanning and container scanning). Moreover, we didn't classify our use case as High Scale; as high scale is 100 blob updates per second. The edge case that we were benchmarking in the scenario outlined above was for a days worth of files, in case there is outage in the upstream system and all the files are released as soon as the outage on up stream system is resolved.

In regards to the second clarification, the function app was up and running when the external process started publishing the files to the container. The 9,998 files were published over a period of 5 minutes, which makes it approximately 34 files uploaded per second. The files got processed and moved in a matter of 10 minutes in total i.e., it took an additional 5 minutes for the function app to process and move the files after the last file was uploaded by the external process. So, overall performance was well in the acceptable range for a days worth of files.

However, what's not understandable is a biased distribution of load across the instances, with one instance running hot, while other instances are running well below 10% CPU utilization, and why only two instances started the processing the files when the always ready instance count was set to 3, and there was enough load for all three instances to start processing. So, wanted to understand the root cause of such un-expected behaviour.

Having said that, we plan to re-run a few more similar scenarios with different load profiles to see if this behaviour is happening repeatedly for blob triggered function or was it a one off glitch.

As a last resort, we may consider to put Event Grid in between. However, would attempt to make the current set up work as expected first. So, any explanation on the root cause of such behaviour will be extremely helpful.
TP 124.7K Reputation points Volunteer Moderator

2023-03-03T12:29:52.59+00:00

Appreciate the detailed response.

I didn't mention it in my answer above, but I share your concern and have some theories as to why it isn't balancing the load properly. Before making concrete recommendation (besides testing event-based) as to how to change your implementation I would need to dig into the source some.

Hopefully someone from the product group will jump in and provide ready explanation and "easy" fix to correct it.
MikeUrnun 9,777 Reputation points Moderator

2023-04-07T23:04:14.36+00:00

Hello @ZQadir - Sorry for the late reply. I've discussed this issue internally with the Functions team and we think that the root cause will likely be investigated in a more thorough review of your environment which requires going through your logs, etc. If your subscription carries a support plan, could you open a support case? If not, let me know as well and I can help open one for you.

Share via

Why almost all the load for a blob triggered Azure Function is handled by one of the three always ready instances in Windows Elastic Premium Plan

1 additional answer

Your answer