Azure Durable Functions

Question

Azure Durable Functions

Vempadapu Akhil Kumar 5

Azure Durable Function orchestration occasionally remains in Running state even though the activity function completes successfully and output artifacts are generated and stored. No exceptions are logged. What Durable diagnostics, traces, or storage queues should we investigate to determine where the orchestration is getting stuck?

0 comments

1 answer

Your answer

Answer 1

Rakesh Mishra 9,680 Microsoft External Staff Moderator

Hello @Vempadapu Akhil Kumar ,

Thank you for reaching out on the Microsoft Q&A forum!

When a Durable Functions orchestrator remains in the Running state even after the underlying Activity function has successfully completed (and generated its artifacts), the issue almost always lies within the orchestrator's state machine execution, replay behavior, or the underlying storage provider queues.

Here are the most common reasons and steps to resolve this:

1. Orchestrator Code Constraints (Non-Deterministic Behavior): The most common reason an orchestrator gets stuck after an activity completes is a violation of orchestrator code constraints. Because orchestrators replay their execution from the beginning to rebuild state, they must be strictly deterministic.

Check for Blocking Calls: If you have Thread.Sleep(), synchronous I/O, or blocking network calls after the activity completes, the orchestrator thread can hang or crash silently.
Quote from Official Documentation: > "Orchestrator code must never block. For example, it must not use Thread.Sleep or equivalent APIs. To delay execution, use the CreateTimer method of the orchestration trigger binding." (Source: Durable Functions Code Constraints)

2. Storage Provider Queue Issues (Invisible/Stuck Messages): Durable Functions uses storage queues (by default, Azure Storage) to drive execution. When the activity finishes, it places a message back into the orchestrator's control queue.

Check your Azure Storage Account (the one configured in AzureWebJobsStorage). Look at the queues named [taskhubname]-control-xx.
If you see messages piling up or moving to a poison queue, the orchestrator is failing to process the activity completion event. This is often due to an unhandled exception thrown in the orchestrator immediately after the await call.

3. Application Insights / Kusto Queries: To pinpoint exactly where the orchestrator stops, check your Application Insights logs. Run this Kusto query to trace the exact lifecycle of your specific instance:

traces
| where customDimensions.prop__instanceId == "<Your-Instance-ID>"
| order by timestamp asc

Look for an ActivityCompleted trace. If the trace immediately following it is an error, or if there are no subsequent traces, your orchestrator code is failing to progress past the await statement.

Next Steps for You:

Review the code immediately following the await call for your Activity function. Look for any DateTime.Now, Guid.NewGuid(), or synchronous/blocking code.
Wrap the orchestrator code in a try-catch block and log the exception using the safe logger (ILogger passed into the function) to see if an error is occurring during the replay.

Please let me know in comments if reviewing the code constraints or the Application Insights logs reveals the culprit.

Note: This response is drafted with the help of AI systems.

Vempadapu Akhil Kumar 5 Reputation points

2026-06-15T15:28:29.46+00:00
Thanks for the response.

Our Durable Function flow is very small and simple:

HTTP starter receives the request.

Orchestrator starts.

Orchestrator calls one activity function.

Activity function generates the extracts text, runs extraction, uploads output artifacts to blob storage, and returns a small JSON response.

Orchestrator only waits for the activity result and returns it.

The orchestrator output is also very small. We are not returning large files from the orchestrator. The actual artifacts are uploaded to Blob Storage, and the response contains only small metadata/output.

Our main requirement is this:

If the orchestration remains in Running state for more than 15 minutes, we want it to automatically timeout and the Durable runtime status should become either Failed or Terminated.

We have already added functionTimeout in host.json as 15 minutes, but the orchestration can still remain in Running state instead of moving to Failed or Terminated.

Could you please confirm whether functionTimeout applies to the Durable orchestration lifecycle itself, or only to the individual function execution/activity execution?

Also, what is the recommended way to make a Durable orchestration automatically fail or terminate if it exceeds 15 minutes?

Do we need to implement this timeout manually using Durable timers inside the orchestrator, or is there any host/runtime setting that can enforce this at the orchestration instance level?

Note: This response is drafted with the help of AI systems.
Rakesh Mishra 9,680 Reputation points Microsoft External Staff Moderator

2026-06-15T16:15:57.4733333+00:00
Hello @Vempadapu Akhil Kumar , thanks for providing the additional context. Your flow is indeed straightforward and understanding how timeouts work in Durable Functions is crucial here.

To answer your questions directly:

1. Does functionTimeout apply to the Durable orchestration lifecycle itself? No, it does not. The functionTimeout setting in your host.json applies only to a single, continuous execution block of a function.

For an Activity function, it means that specific execution cannot exceed 15 minutes.

For an Orchestrator function, it means a single "replay" (from waking up to hitting the next await) cannot exceed 15 minutes. However, it does not apply to the overall end-to-end wall-clock time of the orchestration instance. Because Durable Functions are designed for long-running workflows, the orchestration can remain in the Running state for days or months as it waits for activities to finish, regardless of the functionTimeout setting.

2. Do we need to implement this timeout manually, or is there a host/runtime setting?

You must implement this manually inside the orchestrator using Durable Timers. There is no built-in host.json or runtime setting that will automatically terminate the entire orchestration instance based on end-to-end wall-clock time.

The recommended approach to achieve this is called the Timeout Pattern. You create a Durable Timer for 15 minutes and run it concurrently with your Activity function. You then race them against each other using Task.WhenAny. Whichever completes first determines the outcome.

Reference: Use durable timers for timeouts

Below are the steps and the code required to enforce a strict 15-minute timeout on your orchestration, ensuring it transitions to a Failed state if the activity takes too long or hangs.

Step 1: Update the Orchestrator Code

You will use a CancellationTokenSource linked to a Durable Timer. This allows you to cancel the timer if the activity finishes early (saving you from ghost timers remaining in your storage queues).

Here is the C# implementation:

[FunctionName("Orchestrator_WithTimeout")] public static async Task<string> RunOrchestrator( [OrchestrationTrigger] IDurableOrchestrationContext context, ILogger log) { // 1. Define the maximum allowed duration for the orchestration (15 minutes) DateTime deadline = context.CurrentUtcDateTime.AddMinutes(15); // 2. Create a CancellationTokenSource to cancel the timer if the activity succeeds early using (var cts = new CancellationTokenSource()) { // 3. Start the Durable Timer task Task timeoutTask = context.CreateTimer(deadline, cts.Token); // 4. Start your Activity task (do NOT await it directly yet) Task<string> activityTask = context.CallActivityAsync<string>("ExtractText_Activity", null); // 5. Race the two tasks against each other Task winner = await Task.WhenAny(activityTask, timeoutTask); if (winner == activityTask) { // SUCCESS: The activity finished before the 15 minutes were up! // Cancel the timer so it doesn't fire later. cts.Cancel(); // Await the activity task to get the result string result = await activityTask; return result; } else { // TIMEOUT: The 15-minute timer fired before the activity finished. // Throwing an exception here will instantly move the orchestration to the "Failed" state. throw new TimeoutException("The orchestration exceeded the 15-minute time limit."); // Note: If you prefer the status to be "Completed" but return a timeout message, // you can just return a custom JSON object instead of throwing an exception. } } }

Step 2: Test the Happy Path (Success)

Trigger the Orchestrator via your HTTP starter.

If your ExtractText_Activity finishes in, for example, 2 minutes, Task.WhenAny will see that activityTask won the race.

The cancellation token is triggered, clearing the 15-minute timer from the backend queue.

The orchestrator returns your small JSON payload, and the instance status becomes Completed.

Step 3: Test the Timeout Path (Failure)

Trigger the Orchestrator again.

Artificially delay your ExtractText_Activity (e.g., using Thread.Sleep inside the activity, or simulate a hang).

The orchestrator will safely sleep. After exactly 15 minutes, the Durable Task runtime wakes the orchestrator up.

Task.WhenAny sees that timeoutTask won the race.

The TimeoutException is thrown, instantly marking the orchestration instance status as Failed in your Durable Task hub.

By implementing this pattern, you take full control of the orchestration lifecycle and guarantee it will not hang indefinitely in the Running state.

Please try above and let me know if it works or any further questions.
Vempadapu Akhil Kumar 5 Reputation points

2026-06-15T17:45:52.5933333+00:00
Thanks for the explanation.

We deployed the Durable Timer timeout pattern in our Python Durable Function app.

After deployment, we can see the timeout exception in Log Stream, so the timeout logic is getting triggered. However, when we check the orchestration status using the statusQueryGetUri, the runtimeStatus still remains Running.

So the current behavior is:

Orchestration starts successfully.

Activity execution either takes longer than the timeout or orchestration reaches the timeout path.

Timeout exception is visible in Function App Log Stream.

But the Durable orchestration instance status does not transition to Failed.

statusQueryGetUri continues to show runtimeStatus: Running.

This is the part that is not working as expected.

Our expectation was that once the orchestrator raises an exception after the Durable Timer wins, the Durable runtime should mark the orchestration instance as Failed.

Could you please confirm why the exception is visible in logs but the Durable instance status still remains Running?

Is there anything specific in Python Durable Functions where exceptions inside the orchestrator may be logged but not persisted to the Durable history/status table?

Also, are there any known issues or configuration settings related to the Python worker, Durable extension, task hub storage queues, or history table that could prevent the runtime status from being updated from Running to Failed?

Below is the code for your reference. We have implemented this timeout pattern in Python Durable Functions.

We have also configured functionTimeout as 20 minutes in host.json. We understand from your previous response that this does not enforce an end-to-end orchestration timeout, but we are sharing it for completeness. Even with this setting and the Durable Timer pattern, the timeout exception is visible in Log Stream, but statusQueryGetUri still shows runtimeStatus: Running.

After deploying this code, the timeout exception is visible in Function App Log Stream. However, the orchestration status returned by statusQueryGetUri still remains Running instead of changing to Failed.

Could you please confirm why the exception is logged but the Durable instance status is not being updated?

{ "version": "2.0", "functionTimeout": "00:20:00", "logging": { "applicationInsights": { "samplingSettings": { "isEnabled": true, "excludedTypes": "Request" } } }, "extensionBundle": { "id": "Microsoft.Azure.Functions.ExtensionBundle", "version": "[4.*, 5.0.0)" } }

from datetime import timedelta @app.orchestration_trigger(context_name="context") def MyOrchestrator(context): try: input_data = context.get_input() if isinstance(input_data, dict) and "items" in input_data: items = input_data["items"] elif isinstance(input_data, dict): items = [input_data] else: raise ValueError("Unexpected input format received.") activity_tasks = [ context.call_activity("MyActivity", item) for item in items ] all_activities_task = context.task_all(activity_tasks) timeout_limit_minutes = 15 expiration_time = ( context.current_utc_datetime + timedelta(minutes=timeout_limit_minutes) ) timeout_task = context.create_timer(expiration_time) winner = yield context.task_any([ all_activities_task, timeout_task ]) if winner == timeout_task: raise Exception("TimeoutExceeded: Activity took too long to complete.") timeout_task.cancel() return all_activities_task.result except Exception as e: raise RuntimeError(f"Orchestrator failed: {str(e)}")
Rakesh Mishra 9,680 Reputation points Microsoft External Staff Moderator

2026-06-15T18:32:02.58+00:00
Hello @Vempadapu Akhil Kumar ,

Thank you for sharing the code snippet. The behavior you are observing where the exception is visible in the logs but the orchestration remains stuck in the Running state is a known architectural quirk related to how the Azure Functions Python worker handles exceptions and orchestrator state updates.

Why the Instance Remains "Running"

In Durable Functions, there is a strict distinction between an Orchestration Failure (a logical failure in your workflow) and a Function Worker Crash (an infrastructure-level exception).

Worker Crash vs. Graceful Failure: When your code catches the timeout exception and re-raises it as a top-level RuntimeError, the Python language worker may fail to gracefully serialize this exception state back to the Durable Task Framework.

Message Abandonment: Because the host receives an un-serializable crash from the Python worker (rather than a clean orchestration failure payload), it assumes this was a transient infrastructure issue.

The Replay Loop: The host abandons the orchestrator control message and puts it back on the queue to retry. This cycle repeats until the message hits the poison queue. During this entire cycle, the instance status remains Running because a terminal state (Failed) was never successfully committed to the underlying Azure Storage History table.

Possible Fix

To ensure the orchestration status correctly transitions to Failed, you must let the Durable Functions SDK properly package the exception.

Avoid wrapping your orchestrator logic in a blanket try-except that raises a RuntimeError. When you simply raise a standard Exception without re-wrapping it, the Durable Python SDK can successfully intercept it, serialize the error, and communicate to the host that the workflow should transition to the Failed state. Please try and let me know if it works.
Vempadapu Akhil Kumar 5 Reputation points

2026-06-15T19:02:28.77+00:00
Thank you for the clarification.

We tested the change by removing the blanket try-except block from the orchestrator and allowing the original exception to propagate directly.

The behavior improved, but it is still not fully as expected.

For testing, we configured the Durable Timer timeout as 3 minutes.

Observed behavior:

The orchestration started normally.

After 3 minutes, the timeout exception was logged in Function App Log Stream.

At that point, when we checked statusQueryGetUri, the runtimeStatus was still showing Running.

Without sending the request again from our side, the same file/activity processing appeared to start again automatically.

After the timeout happened again the second time, the statusQueryGetUri finally showed runtimeStatus: Failed.

So it looks like the first timeout exception did not immediately commit the terminal Failed state. Instead, the orchestration/control message appears to have been retried/replayed, and only on a later attempt did the status become Failed.

Could you please help us understand this behavior?

Specifically:

Is it expected for the orchestrator to retry/replay after a timeout exception before the Failed status is committed?

Why would statusQueryGetUri continue to show Running after the first timeout exception is logged?

Could this be related to control queue message abandonment/retry behavior?

Is there any Durable Functions setting to prevent repeated execution/reprocessing after the timeout exception?

For Python Durable Functions, what is the recommended way to guarantee that once the timer wins, the instance immediately transitions to Failed or Terminated and does not re-run the same work?

Below is the simplified orchestrator code we are using:

from datetime import timedelta @app.orchestration_trigger(context_name="context") def MyOrchestrator(context): input_data = context.get_input() if isinstance(input_data, dict) and "items" in input_data: input_items = input_data["items"] elif isinstance(input_data, dict): input_items = [input_data] else: raise ValueError("Unexpected input format received.") tasks = [ context.call_activity("MyActivity", item) for item in input_items ] all_activities_task = context.task_all(tasks) timeout_limit_minutes = 3 expiration_time = ( context.current_utc_datetime + timedelta(minutes=timeout_limit_minutes) ) timeout_task = context.create_timer(expiration_time) winner = yield context.task_any([ all_activities_task, timeout_task ]) if winner == timeout_task: raise Exception("TimeoutExceeded: The activity took too long to complete.") timeout_task.cancel() return all_activities_task.result

Share via

Azure Durable Functions

1 answer

Your answer