The running time of Durable Functions orchestrator seems to be unstable in some cases

Question

The running time of Durable Functions orchestrator seems to be unstable in some cases

PandaAwAke 0

When I was trying to learn, use, and test Azure Durable Functions, I've found that in some specific cases, the Orchestrator's running time will intermittently experience anomalies.

Specifically, the situations are as follows:

Notes

The language is Python 3.11.
The results of all tests here are from the execution on Azure Function Apps (i.e., in the cloud) rather than locally. The results of local tests differ significantly from those of cloud-based tests.
In the subsequent section, there will be three activities: add1, add2, and add3. Their logic is very simple, each adding 1, 2, and 3 to the input number, respectively. For example, the code for add1 is as follows:

@myApp.activity_trigger(input_name="number")
def add1(number: int):
    return number + 1

Anomaly 1: Using sub-orchestrations with different orchestrator input size

The following code is my first test. Its logic is as follows: the parent orchestrator input_orchestrator_sub1 runs 30 loop iterations, each time calling a sub-orchestrator sub_orchestrator_total, which in turn sequentially calls 3 activities add1, add2 and add3. This results in a total of 90 activity calls. Additionally, the parent orchestrator will pass its own input to the sub-orchestrator.

@myApp.orchestration_trigger(context_name="context")
def input_orchestrator_sub1(context: df.DurableOrchestrationContext):
    args: dict = context.get_input()
    if args is None:
        args = {}

    number = 0

    start_time = context.current_utc_datetime.timestamp()

    for round in range(30):
        sub_args = {
            'number': number,
            'round': round,
        }
        number = yield context.call_sub_orchestrator('sub_orchestrator_total', {**args, **sub_args})
        context.set_custom_status('Round: {}'.format(round))
    
    end_time = context.current_utc_datetime.timestamp()
    context.set_custom_status('Time used: {} seconds'.format(start_time, end_time, end_time - start_time))
    
    return number

@myApp.orchestration_trigger(context_name="context")
def sub_orchestrator_total(context: df.DurableOrchestrationContext):
    args: dict = context.get_input()
    number: int = args.get('number', 0)
    round: int = args.get('round', 0)

    number = yield context.call_activity('add1', number)
    number = yield context.call_activity('add2', number)
    number = yield context.call_activity('add3', number)
    return number

Next, I generated input parameter payloads of different sizes to invoke the parent orchestrator (which will also be passed to the sub-orchestrator) to test the runtime performance of the Orchestrator under different payload sizes. Example:

def _create_load(kb: int):
    return {
        "large_data": "x" * (kb * 1024)
    }

@myApp.route(route="orchestrators/128kb/{functionName}")
@myApp.durable_client_input(client_name="client")
async def http_start_128kb(req: func.HttpRequest, client: df.DurableOrchestrationClient):
    function_name = req.route_params.get('functionName')
    instance_id = await client.start_new(orchestration_function_name=function_name, client_input=_create_load(kb=128))
    response = client.create_check_status_response(req, instance_id)
    return response

I tested the runtime performance under multiple rounds with payload sizes of 0, 64KB, 128KB, and 256KB. For the first three payloads, no anomalous situations occurred, and the Orchestrator's runtime increased moderately with the payload size, which met my expectations. However, with a 256KB payload, the Orchestrator's runtime suddenly became very long and exhibited unstable behavior. The results of my tests are shown in the following table:

PayloadNumberTime0117.1347214.9246315.9231416.5848515.277264KB119.15052219.264081319.422669418.443175519.096956128KB122.514228221.698232321.113562422.499068523.042327256KB1587.7134782403.6056313634.085384360.601282530.717049I don't know what happened, and I can't explain this phenomenon. I looked up the trace related to the first anomaly, and the images are as follows:

enter image description here

Anomaly 2: Using continue_as_new with task_any

Similar runtime anomalies occurred in another scenario. The test code is as follows:

@myApp.orchestration_trigger(context_name="context")
def parallel_orchestrator_can(context: df.DurableOrchestrationContext):
    args: dict = context.get_input()
    if args is None:
        args = {}
    number: int = args.get('number', 0)
    round: int = args.get('round', 0)
    start_time: float = args.get('startTime', context.current_utc_datetime.timestamp())

    while round < 30:
        tasks = []
        tasks.append(context.call_activity('add1', number))
        tasks.append(context.call_activity('add2', number))
        tasks.append(context.call_activity('add3', number))

        results = yield context.task_all(tasks)
        number = max(results)

        round += 1

        args['number'] = number
        args['round'] = round
        args['startTime'] = start_time
        context.continue_as_new(args)
        return
    
    end_time = context.current_utc_datetime.timestamp()
    context.set_custom_status('Time used: {} seconds'.format(start_time, end_time, end_time - start_time))

    return number

The code segment is similar to the previous one, still consisting of 30 iterations, with 3 concurrent Activity calls in each iteration (using task_any). It uses continue_as_new to restart itself at the end of each iteration, resulting in a total of 90 Activity calls.

I tested other functionally equivalent Orchestrators (which also trigger 30 concurrent calls to 3 Activities), including:

Not using continue_as_new
Not using continue_as_new and moving the concurrent calls to 3 Activities into a sub-orchestrator

The final test results for the three scenarios showed that only when using continue_as_new did the runtime exhibit anomalies. The specific timings are as follows:

CaseNumberTimecontinue_as_new111.067338250.0692923112.86382547.53638057.036399640.051117769.4385908101.01024597.725570108.211219No continue_as_new17.73971427.05780437.44994447.66764657.053254Use sub-orchestrator, No continue_as_new111.157281211.351533311.242825410.810705510.567263 enter image description here

enter image description here

In above 2 tests, I did not find any logs with a severity level of warning or higher in the traces. All "LogLevel"s are "Information". Everything appeared to run normally, except for the abnormal runtime performance.

Is this normal for the Durable Functions platform? Could this be a performance bug?

Share via

The running time of Durable Functions orchestrator seems to be unstable in some cases

Notes

Anomaly 1: Using sub-orchestrations with different orchestrator input size

Anomaly 2: Using continue_as_new with task_any

Your answer