Fix task any bug #538

andystaples · 2025-04-07T19:48:01Z

There currently exists two bugs in task_any.

Resolves #536

First Bug

Certain orchestrations will throw the following error:

[2025-04-07T19:23:54.346Z] cfb77c3e2ec645f0b7d1d42ccd291f34: Function 'sample_orchestrator (Orchestrator)' failed with an error. Reason: DurableTask.Core.Exceptions.OrchestrationFailureException
   at Microsoft.Azure.WebJobs.Extensions.DurableTask.OutOfProcOrchestrationShim.ScheduleDurableTaskEvents(OrchestrationInvocationResult result) in /_/src/WebJobs.Extensions.DurableTask/Listener/OutOfProcOrchestrationShim.cs:line 88
[2025-04-07T19:23:54.348Z]    at Microsoft.Azure.WebJobs.Extensions.DurableTask.OutOfProcOrchestrationShim.HandleDurableTaskReplay(OrchestrationInvocationResult executionJson) in /_/src/WebJobs.Extensions.DurableTask/Listener/OutOfProcOrchestrationShim.cs:line 65
[2025-04-07T19:23:54.349Z]    at Microsoft.Azure.WebJobs.Extensions.DurableTask.TaskOrchestrationShim.TraceAndReplay(Object result, Exception ex) in /_/src/WebJobs.Extensions.DurableTask/Listener/TaskOrchestrationShim.cs:line 246
[2025-04-07T19:23:54.350Z]    at Microsoft.Azure.WebJobs.Extensions.DurableTask.TaskOrchestrationShim.InvokeUserCodeAndHandleResults(RegisteredFunctionInfo orchestratorInfo, OrchestrationContext innerContext) in /_/src/WebJobs.Extensions.DurableTask/Listener/TaskOrchestrationShim.cs:line 183. IsReplay: False. State: Failed. RuntimeStatus: Failed. HubName: TestHubName. AppName: . SlotName: . ExtensionVersion: 2.13.7. SequenceNumber: 150. TaskEventId: -1
[2025-04-07T19:23:54.352Z] Executed 'Functions.sample_orchestrator' (Failed, Id=875e647d-cf7e-478c-bfa4-dcf318642afe, Duration=24ms)
[2025-04-07T19:23:54.353Z] System.Private.CoreLib: Exception while executing function: Functions.sample_orchestrator. Microsoft.Azure.WebJobs.Extensions.DurableTask: Orchestrator function 'sample_orchestrator' failed: 'AtomicTask' object has no attribute 'append'.

This error is thrown in DurableOrchestrationContext.py, in _add_to_open_tasks, when certain criteria are met:

The user yields back a WhenAnyTask
The WhenAnyTask contains an AtomicTask that was yielded back in a previous WhenAnyTask, but was not scheduled to run at that time.

This can happen if:

A WhenAnyTask ran once with a set of AtomicTasks, and scheduled them all
More than one of these AtomicTasks completed before the orchestrator was scheduled for replay
These completed AtomicTasks were yielded back in a new WhenAnyTask along with a new AtomicTask that has not been scheduled/completed.
The orchestrator yields a WhenAnyTask containing only the not-yet-scheduled AtomicTask

Example orchestrator - assume that the result of sample_suborchestrator is just the input task_id:

@bp.orchestration_trigger(context_name="context")
def sample_orchestrator(context: df.DurableOrchestrationContext):
    tasks = {}
    for i in range(3):
        task_id = yield context.call_activity("generate_job_id", str(i))
        tasks[task_id] = context.call_sub_orchestrator("sample_suborchestrator", task_id, task_id)
    added_task = False
    while len(tasks) > 0:
        completed_task = yield context.task_any(list(tasks.values()))
        task_result = completed_task.result
        if task_result in tasks:
            tasks.pop(task_result)
        if not added_task:
            task_id = yield context.call_activity("generate_job_id", "4")
            tasks[task_id] = context.call_sub_orchestrator("sample_suborchestrator", task_id, task_id)
            added_task = True
    return "Success"

If these circumstances are met, the second WhenAnyTask will immediately return the result from the completed subtask, without scheduling the new AtomicTask. However, the new AtomicTask will have been registered in the orchestration context's open task list, and the next time this new task is yielded back, when the context tries to register it, it encounters some old ReplaySchema.V1 logic, tries to call .append() on the previously registered AtomicTask, and throws.

This PR adds a check to this .append() call to make sure that the value in the open task list is not already referencing the incoming task before attempting to append. I have tested this to make sure that using this logic, the orchestrations complete as expected.

Second bug

In this same case, when a WhenAnyTask is called with a task that has already completed, and a new AtomicTask that has not yet been scheduled, there is an unexpected interaction. When this happens, the TaskOrchestrationExecutor will register all of the subtasks for the new WhenAnyTask internally, but then it checks if any of the subtasks have received a result. If they have, the Executor assumes this WhenAnyTask was already scheduled and replays it with the completed task as the result. This results in the new AtomicTask never getting scheduled with the WebJobs extension and so it will never execute. We can call this the "limbo" task.

This is fine, until the user creates a new AtomicTask and schedules it. When the result of this final Task is received, the WebJobs extension and the Python extension are improperly indexed due to the limbo Task from before, and the Python extension assigns the result of the final task to the Limbo task. It then waits for a result for the final task, but since as far as WebJobs is concerned, it has already sent that result, the orchestration becomes stuck in "Running" state.

This PR addresses this problem as well by ensuring that all Tasks are scheduled before resuming replay.

azure/durable_functions/models/DurableOrchestrationContext.py

* Fix long timer bug * Also fix task_any scheduling issue

andystaples added 3 commits April 7, 2025 13:32

Fix long timer bug

ece8cbf

Enhance check ensuring replaycontext.v1 works

86585f0

Even more specific check

8ca7dc4

andystaples requested review from bachuv, cgillum and nytian April 7, 2025 20:03

Also fix task_any scheduling issue

d0e5926

davidmrdavid reviewed Apr 17, 2025

View reviewed changes

azure/durable_functions/models/DurableOrchestrationContext.py Show resolved Hide resolved

Add comment describing code path logic

51fdd6d

cgillum approved these changes Apr 18, 2025

View reviewed changes

andystaples marked this pull request as ready for review April 21, 2025 22:37

Merge branch 'dev' into andystaples/fix-task_any-bug

b0b226f

andystaples merged commit 9dd2654 into dev Apr 21, 2025
2 checks passed

andystaples deleted the andystaples/fix-task_any-bug branch April 21, 2025 22:38

greenie-msft pushed a commit to greenie-msft/azure-functions-durable-python that referenced this pull request Sep 25, 2025

Fix task_any bug (Azure#538)

bcaef1a

* Fix long timer bug * Also fix task_any scheduling issue

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix task any bug #538

Fix task any bug #538

Uh oh!

andystaples commented Apr 7, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix task any bug #538

Fix task any bug #538

Uh oh!

Conversation

andystaples commented Apr 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

First Bug

Second bug

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

andystaples commented Apr 7, 2025 •

edited

Loading