Share via

Orchestrations get stuck awaiting completions of activities that are launched in parallel

Paul D'hertoghe 20 Reputation points
2026-03-18T16:50:25.8966667+00:00

The bottom line is, I want to implement an Orchestration semaphore using a Durable Entity, that limits the number of concurrent calls to a certain resource provider (in a real world function app this will limit the number of concurrent LLM calls).

In the demo app I attached, I added a REST function that can be called with a POST request (e.g. http://localhost:7238/api/start_orchestration/123456), and this starts an Orchestration of type MainOrchestrator. That orchestration starts some page processing activities in parallel (but batched to a certain limit), when all completed 2 orchestrations of type SubOrchestrator are started. These also start some text processing activities in parallel (but batched to a certain limit).
Each activity, i.e. the page processing activities and the text processing activities, are guarded by the orchestration semaphore (cfr. helper class GlobalLlmLimiterSemaphore, limits max concurrent activities to 100).
The problem is this: I often notice, in local dev and on Azure, that the orchestrations get stuck awaiting the completions of activities, while all activities DO complete and there are no exceptions of any kind.
It seems like the completion of activities does not trigger the orchestration function to replay at a certain point.
Can someone explain what I am doing wrong, or is this a bug?

The demo solution can be downloaded from https://www.dropbox.com/scl/fi/0l9abvmnec4r5i4bc3e3s/OrchestrationSemaphore.zip?rlkey=hsi4xq9wu0z3jr8aadyf47328&dl=0
I have also attached a screenshot from the Visual Studio Code Durable Functions extension, that shows how one SubOrchestrator completed and another did not; since one SubOrchestrator never completes, this results in the MainOrchestrator to also not complete.

Azure Functions
Azure Functions

An Azure service that provides an event-driven serverless compute platform.

0 comments No comments

3 answers

Sort by: Most helpful
  1. Paul D'hertoghe 20 Reputation points
    2026-03-25T15:55:09.3533333+00:00

    Root Cause

    In your OrchestratorSemaphore entity, you inject DurableTaskClient via DI and call client.RaiseEventAsync(instanceId, eventName) from within the entity to send events back to orchestrations. This is not a supported communication pattern.

    TaskEntityContext intentionally exposes only two outbound operations:

    • SignalEntity — send one-way messages to other entities
    • ScheduleNewOrchestration — start new orchestrations

    There is no RaiseEvent capability on the entity context by design. Calling DurableTaskClient.RaiseEventAsync from inside an entity bypasses the entity framework — it's an external gRPC call that goes outside the entity's transactional execution model. This means:

    1. The call is not transactional — if the entity operation fails and its state rolls back, the RaiseEventAsync has already been sent and cannot be reverted.
    2. The NotFound error you observed (No instance with ID '...' was found) comes from the backend when trying to deliver the event through the external API. While the orchestration does exist, the external gRPC RaiseEvent API path may have different visibility constraints than internal framework messaging, particularly for sub-orchestrations.

    Replace the SignalEntity + WaitForExternalEvent + RaiseEventAsync pattern with CallEntityAsync<T>, which is the built-in two-way (request-response) communication from orchestrations to entities:

    // In the orchestration — replaces SignalEntity + WaitForExternalEvent
    
    // In the entity — much simpler, no DurableTaskClient needed
    

    With this approach:

    • The return value from CallEntityAsync<bool> is the communication back to the orchestration — no external events needed
    • Entity operations run serially (single-threaded by design), so no race conditions on the counter
    • No need for PendingAcks, ResendIfNoAck, LeaseVersions, or any of the other compensating mechanisms
    • No DurableTaskClient injection in the entity

    References

    Key quote from the docs:

    Only orchestrations can call entities and get a response, which can be a return value or an exception. Client functions that use the client binding can only signal entities.

    Root Cause

    In your OrchestratorSemaphore entity, you inject DurableTaskClient via DI and call client.RaiseEventAsync(instanceId, eventName) from within the entity to send events back to orchestrations. This is not a supported communication pattern. TaskEntityContext intentionally exposes only two outbound operations:

    • SignalEntity — send one-way messages to other entities
    • ScheduleNewOrchestration — start new orchestrations There is no RaiseEvent capability on the entity context by design. Calling DurableTaskClient.RaiseEventAsync from inside an entity bypasses the entity framework — it's an external gRPC call that goes outside the entity's transactional execution model. This means:
    1. The call is not transactional — if the entity operation fails and its state rolls back, the RaiseEventAsync has already been sent and cannot be reverted.
    2. The NotFound error you observed (No instance with ID '...' was found) comes from the backend when trying to deliver the event through the external API. While the orchestration does exist, the external gRPC RaiseEvent API path may have different visibility constraints than internal framework messaging, particularly for sub-orchestrations.

    Replace the SignalEntity + WaitForExternalEvent + RaiseEventAsync pattern with CallEntityAsync<T>, which is the built-in two-way (request-response) communication from orchestrations to entities:

    // In the orchestration — replaces SignalEntity + WaitForExternalEvent
    var entityId = new EntityInstanceId("LlmConcurrencyLimiter", "global-llm-limiter");
    
    // Poll until a slot is available
    while (true)
    {
        bool acquired = await context.Entities.CallEntityAsync<bool>(entityId, "TryAcquire");
        if (acquired) break;
        await context.CreateTimer(context.CurrentUtcDateTime.AddSeconds(2), CancellationToken.None);
    }
    
    try
    {
        return await context.CallActivityAsync<T>(activityName, input, options);
    }
    finally
    {
        await context.Entities.SignalEntityAsync(entityId, "Release");
    }
    
    // In the entity — much simpler, no DurableTaskClient needed
    public class LlmConcurrencyLimiter : TaskEntity<SemaphoreState>
    {
        public bool TryAcquire()
        {
            if (this.State.Current < this.State.MaxConcurrent)
            {
                this.State.Current++;
                return true; // returned directly to the calling orchestration
            }
            return false;
        }
    
        public void Release()
        {
            if (this.State.Current > 0)
                this.State.Current--;
        }
    }
    

    With this approach:

    • The return value from CallEntityAsync<bool> is the communication back to the orchestration — no external events needed
    • Entity operations run serially (single-threaded by design), so no race conditions on the counter
    • No need for PendingAcksResendIfNoAckLeaseVersions, or any of the other compensating mechanisms
    • No DurableTaskClient injection in the entity

    References

    Only orchestrations can call entities and get a response, which can be a return value or an exception. Client functions that use the client binding can only signal entities.

    0 comments No comments

  2. Pravallika KV 13,135 Reputation points Microsoft External Staff Moderator
    2026-03-18T18:21:54.2933333+00:00

    Hi @Paul D'hertoghe ,

    Thanks for reaching out to Microsoft Q&A.

    In most cases this behavior points at one of a few common culprits in Durable Functions:

    • Control‐queue backpressure or poison messages
    • A transient bug or deadlock in the Functions host or extension
    • Mismatches or mis‐registrations in the Task Hub/worker registration
    • Eventual‐consistency delays in Azure Storage message dequeueing

    Here’s a checklist you can work through:

    1. Restart your Function App
      • Sometimes a deadlock in the Durable extension or the Functions host can block progress. A quick restart often “unsticks” orchestrations.
    2. Inspect your Azure Storage control queues
      • Use the Storage Explorer or the KQL query in the Durable Functions troubleshooting guide to check for growing queues or poison messages.
      • If you find poison messages in a control‐queue partition, purge them so the orchestration can continue.
    3. Review Application Insights / Trace logs
      • Look for “Dequeuing” or “Checkpoint” warnings/errors for your orchestration instance ID.
      • Filter on the orchestration’s control‐queue partition ID to spot failures in dequeuing messages.
    4. Confirm your Durable Task SDK registrations
      • Make sure every orchestrator and activity function in your demo app is actually registered with the worker. A missing registration can block replay.
    5. Upgrade or pin your Durable Functions extension version
      • Some versions have known bugs around queue handling or replay. Check the version you’re running and consider moving to the latest stable release.
    6. Switch your App Service to 64-bit
    • If you hit memory limits, orchestrations can stall in “Pending” or “Running”. Switching to a 64-bit process gives the host more headroom.

    References:

    1. Durable Functions troubleshooting guide -Orchestration doesn’t complete or is stuck in the Running state
    2. Troubleshoot Durable Task SDKs – Orchestration issues

    Hope this helps!


    If the resolution was helpful, kindly take a moment to click on 210246-screenshot-2021-12-10-121802.pngand click on Yes for was this answer helpful. And, if you have any further query do let us know.


  3. Q&A Assist
    2026-03-18T16:50:47.7+00:00
    AI generated content. This question contains an answer created with AI from Q&A Assist. Learn more

    In Durable Functions, an orchestration that “gets stuck” while activities have actually completed is almost always a symptom of the orchestration instance being stuck in Running state and not progressing its message processing.

    From the Durable Functions troubleshooting guidance, an orchestration that remains in Running typically means it is still waiting on a task to complete (activity, durable timer, or external event). If logs or the Function App UI show that all activities finished successfully but the orchestration does not advance, the orchestration is considered a “stuck orchestration.”

    Recommended checks and mitigations:

    1. Restart the Function App
      • Restarting the Function App can clear transient bugs or deadlocks in the app or Durable Functions extension that prevent the orchestrator from picking up completion messages and replaying.
    2. Inspect Azure Storage control queues
      • Durable Functions (Azure Storage provider) uses internal control queues to drive orchestration progress.
      • Check whether any control queue is growing continuously or has messages that are not being dequeued.
      • A single unhealthy control queue partition can cause some orchestrations (for example, one of the SubOrchestrator instances) to stop making progress while others complete.
      • If the issue appears isolated to a specific control queue/VM instance, scaling up or down the Function App to move off the unhealthy instance can help.
    3. Use Application Insights queries for the affected control queue
      • Use the Application Insights KQL query from the Azure Storage messaging troubleshooting section to filter by the control queue partition ID and look for errors or warnings related to that partition.
      • This can reveal issues where completion messages are not being processed even though activities finished.
    4. Check for versioning / breaking changes
      • If the orchestration or entity code (for the semaphore) has changed while instances are in-flight, breaking changes can cause stuck orchestrations.
      • Review the Durable Functions versioning guidance to ensure no incompatible changes were deployed while orchestrations were running.
    5. Consider orchestration design constraints
      • Orchestrator functions must be deterministic and should not perform non-deterministic or blocking work directly.
      • All resource-limiting logic (like a semaphore) that depends on external state must be implemented using Durable Entities or activities in a way that preserves determinism in the orchestrator. Any violation can manifest as replay issues or stuck instances.

    If, after these checks, orchestrations still intermittently get stuck while activities complete, treat it as a stuck-orchestration issue per the above guidance: restart the app, verify control queues and storage health, and validate that no in-flight instances were broken by code changes.


    References:

    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.