Azure Durable Functions Fan/Out slow performance

Zava 25 Reputation points
2023-03-29T21:21:40.1266667+00:00

Hello!

I've developed an ETL service using Azure Durable Functions, which involves the following components:

  • A time-triggered function (Starter)
  • The main orchestrator
  • The etl sub-orchestrator
  • Activity functions for reading, transforming, and writing data
  • A durable entity for storing additional data.

Here's how the workflow works:

  • The time-triggered function starts the main orchestrator.
  • The main orchestrator calls an activity function that returns the number of data to be processed (70,000 JSON files).
  • The main orchestrator creates a matrix of ETL-Orchestrators, with each group containing 50 ETL-Orchestrators.
  • The main orchestrator calls each array of ETL-Orchestrators (using task.all) sequentially.
     while (index < groupedEtlOrchestrators.length) {
         yield context.df.Task.all(groupedEtlOrchestrators[index]);
         index += 1;
       }
  • The ETL-Orchestrator:
    • Starts the reader activity (reads 1000 json files)
    • Starts the transformer activity (transform 1000 json files)
    • Starts the writer activity (sequentially insert/update 1000 json files to CosmosDb)
    • Starts 2 durable activities in parallel that deduplicate additional database and store it using the state.
  • After all sub-orchestrators are done
    • The main orchestrator starts 2 durable activities that contain 10,000 JSON files. (it takes all data)
    • The main orchestrator starts 10 activity functions, with each function inserting/updating 1,000 JSON files.
    • The main orchestrator starts 1 activity function for changing CosmosDB throughput.
    • The main orchestrator starts 2 durable entities and 1 activity function in parallel to release resources used by durable activities and close database connections.

Here's what happens when the function is executed:

  • The first group of ETL Orchestrators (10 sub-orchestrators) are executed quickly and complete in 2 minutes (including reading/transforming/writing and temp-storage activities).
  • For the other groups, there are significant delays between executions (The execution time for each group increases every time the previous group is completed). It appears that a lot of activities are added to a queue for an instance that is still in use, as the other instances are not being utilized.
  • While I have 10 instances, only 8 are used for the first group, and the number of used instances decreases significantly for the subsequent groups.
  • After all sub-orchestrators are done, it takes a long time to start the next activities from the main orchestrator. At this point, the number of instances is reduced to 1-2.

When I start the service with only 1 group of 10 ETL-Orchestrators, it takes 2 minutes to complete. However, when I have 7 groups, it does not take 7 * 2 minutes to finish.

Configuration:

  • Plan: Premium - 3.5GB memory for each instance
  • Instances: 10
  • Host.json
    • Because there is an activity function in the ETL Orchestrator that uses more CPU to process all data, I believe it is better to limit the number of concurrentActivity functions that can run on the same instance. Should I reduce this number to 2? (I need 2 because there will be 2 durable entities running in parallel); should I also reduce maxConcurrentOrchestratorFunctions to 1? But I don't think this is helpful because the orchestrator is "done" once it begins a new activity.
  "extensions": {
    "durableTask": {
      "tracing": {
        "traceInputsAndOutputs": false,
        "traceReplayEvents": false
      },
      "maxConcurrentActivityFunctions": 5,
      "maxConcurrentOrchestratorFunctions": 5
    }
  }

Although the expectation was to maintain similar levels of CPU usage, number of instances used, and number of requests throughout the execution, I observed the following:

  • The number of requests decreased.
  • The number of instances used decreased.
  • The CPU usage decreased.

User's image

I've checked the logs and I found these:

  • 3 x Singleton lock renewal failed for blob 'myfunc/host' with error code 409: LeaseIdMismatchWithLeaseOperation. The last successful renewal completed at 2023-03-29T19:33:29.954Z (11695 milliseconds ago) with a duration of 10884 milliseconds. The lease period was 15000 milliseconds. ( I have only one Function App in the service plan)
  • 18 x Possible thread pool starvation detected.
  • 125 x [HostMonitor] Host CPU threshold exceeded (91 >= 80)

This is a significant problem because for each execution, I increase the throughput of Cosmos DB. Due to the delay and non-execution of activities, the time it takes has increased from 30 minutes to 90 minutes. Consequently, Cosmos DB costs increase even if the Request Units (RU/s) are not utilized.

Azure Functions
Azure Functions
An Azure service that provides an event-driven serverless compute platform.
5,106 questions
{count} votes

1 answer

Sort by: Most helpful
  1. David Justo 5 Reputation points Microsoft Employee
    2023-04-06T20:59:22.3066667+00:00

    Hi @Zava .

    Thanks for the detailed description. In this case, I think there might be a few things at play here. My first instinct is to rule out whether your orchestrator is suffering from a History blow up problem: all inputs and outputs to Durable APIs are serialized into a History data structure, which is loaded on every orchestration replay. If you're dealing with thousands of JSON files, I wonder if your History is becoming rather large due to multiple large inputs and outputs to APIs, which would slow down orchestrator replay. This would also explain why you're seeing your orchestrator become progressively slower after each ETL group completes: each group grows the history size.

    We can do a simple, but perhaps extreme, test to validate this theory. Can you refactor your code so that the orchestrator never receives this large JSON payload? The trick is to load and store the data in the Activities themselves, and only communicate lightweight identifiers of that data to the orchestrator, which the next Activity can then use to fetch its payload from the DB. If this completely removes the problem, then we know we've identified the right problem. From there, we can try to determine a middle-of-the-road solution.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.