Why almost all the load for a blob triggered Azure Function is handled by one of the three always ready instances in Windows Elastic Premium Plan

ZQadir 190 Reputation points
2023-03-01T18:29:48.6433333+00:00

As part of the performance benchmarking to decide the right scaling configuration for our Function Apps on Elastic Premium Plan, we carried out a test to process approximately 10,000 files. Those files are placed on the specific container on an Azure Blob Storage (General Purpose Storage V2 with ZRS). There is an Azure Function which has a blob trigger on that container to processes those file and route them accordingly.

The scale out configuration for that Function App is set to 3 always instances and the Windows Elastic Premium Plan EP2 is configured with Availability Zone support and scaling b/w 3 to 5 nodes (screenshot below for reference):

Scale Out Configuration

Post the completion of file processing, and in order to confirm that the function app is scaling out and in as expected and as per the configuration, we executed the following query in Azure Application Insights.

requests

| where timestamp > ago(1h) and operation_Name == "REDACTED"

| summarize count() by cloud_RoleInstance, bin(timestamp, 1m)

| render timechart

The results from that were a bit un-usual in the sense that out of the 3 always ready instances, only two instances were processing the requests, and moreover the "almost all" the requests were handled by one instance:

Distribution of Load Across Cloud Rle Instances

And here is the summary of which cloud role instance processed how many files in total:

Distribution of Load Across Cloud Rle Instances - Summary

On looking at the CPU Utilization of the Cloud Role Instances, we noticed that the cloud role instance that handled almost all the load was running very high on CPU, touch 90%+ utilization, while the other cloud role instances were running at 10% or lower CPU Utilization. See screenshot below:

CPU Utilization

Moreover, looking at the function execution performance, we can see that their was latency in executions for as much as 5.670 seconds:

Function Execution Performance

Please note that these executions happened in the 11 minute window b/w 02:34 PM to 02:45 PM, and the Elastic Premium Plan did scale out and scaled in as expected.

Elastic Premium Plan Scale Out

This raises a couple of questions for us:

  1. Why only one specific instance ended up in picking almost all the load i.e., processing almost all the the files uploaded on the blob, even when it was running significantly high on CPU, while the other cloud role instances were running fairly light and still not processing the files from the blob trigger?
  2. Why only 2 cloud instances were processing the requests most of the time, when the scale out configuration is set to 3 always ready instances with no scale out limit i.e., it can go up to 5 (the limit set on App Service Plan to avoid any bill shock during this initial benchmarking exercise).

The functions are coded in C#, .NET 6.

Please note that this specific behaviour is noticed in Blob Triggered Functions only. We have seen a fairly balanced distribution of load for Service Bus Triggered and HTTP triggered functions on the same App Service Plan. Not sure if there is any kind of zonal affinity coming into play here (please note that the storage account on which the blob trigger is set up is ZRS, and the App Service Plan is Windows EP2 with Zone Redundancy too)? But what's not understandable is why such imbalanced load distribution is only affecting this blob triggered function and not the Service Bus or HTTP Triggered Functions on the same App Service Plan?

Would appreciate any insights on this behaviour, so that we can adjust our configuration and implementation to make full use of all the available instances in our App Service Plan.

Azure Functions
Azure Functions
An Azure service that provides an event-driven serverless compute platform.
4,371 questions
0 comments No comments
{count} votes

Accepted answer
  1. KristinTodorovTekExperts-2118 75 Reputation points
    2023-05-15T11:32:58.5966667+00:00

    Hello Zeeshan,

    The blob trigger behavior which you witnessed is expected behavior, a single function app instance (the lease owner) regularly reads the storage logs to see if there are new blobs to trigger on, when it finds blobs, it inserts items into a storage queue. All of the function app instances will then process messages from the queue. There is however a backoff logic to prevent excessive polling of the storage queue when its empty - this is to reduce the cost of a customer from frequent storage I/O. This backoff caps off at 60seconds. What's happening is that all of the function app instances have backed off to polling the storage queue every 60s. When there are messages in the storage queue (i.e. blobs to process) the first instance to finish its 60s wait will start processing messages from the storage queue and process all of the blobs. The single instance is handling all of the 2500 blobs. Each time one of the other instances polls the queue {every 60s) they find nothing in the queue as the one instance has drained it and so they continue to backoff for 60s. If a customer has a steady stream of blobs to process, they will see all of the instances start picking up messages from the storage queue and processing blobs.

    0 comments No comments

1 additional answer

Sort by: Most helpful
  1. TP 78,976 Reputation points
    2023-03-02T18:43:34.87+00:00

    Hi,

    Is your blob trigger event-based, polling (default), event grid, manual queue, or something else? Did you have all 10,000 blobs sitting in the storage account before starting the function, or did you dump them in all at once with function already up and running, or did you gradually add xx blobs per second, or something else?

    From your test description it seems your usage may be (or will in the future) be considered "high scale" so you should not be using polling. With that in mind, it is possible if you switched to event-based, for example, the issue you are seeing will go away. If you are using polling I would suggest switching and repeating the test.

    Please see article below if you haven't already:

    Azure Blob storage trigger for Azure Functions

    https://learn.microsoft.com/en-us/azure/azure-functions/functions-bindings-storage-blob-trigger?tabs=in-process&pivots=programming-language-csharp

    Thanks.

    -TP