Skip to content

Conversation

andystaples
Copy link
Collaborator

@andystaples andystaples commented Apr 7, 2025

There currently exists two bugs in task_any.

Resolves #536

First Bug

Certain orchestrations will throw the following error:

[2025-04-07T19:23:54.346Z] cfb77c3e2ec645f0b7d1d42ccd291f34: Function 'sample_orchestrator (Orchestrator)' failed with an error. Reason: DurableTask.Core.Exceptions.OrchestrationFailureException
   at Microsoft.Azure.WebJobs.Extensions.DurableTask.OutOfProcOrchestrationShim.ScheduleDurableTaskEvents(OrchestrationInvocationResult result) in /_/src/WebJobs.Extensions.DurableTask/Listener/OutOfProcOrchestrationShim.cs:line 88
[2025-04-07T19:23:54.348Z]    at Microsoft.Azure.WebJobs.Extensions.DurableTask.OutOfProcOrchestrationShim.HandleDurableTaskReplay(OrchestrationInvocationResult executionJson) in /_/src/WebJobs.Extensions.DurableTask/Listener/OutOfProcOrchestrationShim.cs:line 65
[2025-04-07T19:23:54.349Z]    at Microsoft.Azure.WebJobs.Extensions.DurableTask.TaskOrchestrationShim.TraceAndReplay(Object result, Exception ex) in /_/src/WebJobs.Extensions.DurableTask/Listener/TaskOrchestrationShim.cs:line 246
[2025-04-07T19:23:54.350Z]    at Microsoft.Azure.WebJobs.Extensions.DurableTask.TaskOrchestrationShim.InvokeUserCodeAndHandleResults(RegisteredFunctionInfo orchestratorInfo, OrchestrationContext innerContext) in /_/src/WebJobs.Extensions.DurableTask/Listener/TaskOrchestrationShim.cs:line 183. IsReplay: False. State: Failed. RuntimeStatus: Failed. HubName: TestHubName. AppName: . SlotName: . ExtensionVersion: 2.13.7. SequenceNumber: 150. TaskEventId: -1
[2025-04-07T19:23:54.352Z] Executed 'Functions.sample_orchestrator' (Failed, Id=875e647d-cf7e-478c-bfa4-dcf318642afe, Duration=24ms)
[2025-04-07T19:23:54.353Z] System.Private.CoreLib: Exception while executing function: Functions.sample_orchestrator. Microsoft.Azure.WebJobs.Extensions.DurableTask: Orchestrator function 'sample_orchestrator' failed: 'AtomicTask' object has no attribute 'append'.

This error is thrown in DurableOrchestrationContext.py, in _add_to_open_tasks, when certain criteria are met:

  1. The user yields back a WhenAnyTask
  2. The WhenAnyTask contains an AtomicTask that was yielded back in a previous WhenAnyTask, but was not scheduled to run at that time.

This can happen if:

  1. A WhenAnyTask ran once with a set of AtomicTasks, and scheduled them all
  2. More than one of these AtomicTasks completed before the orchestrator was scheduled for replay
  3. These completed AtomicTasks were yielded back in a new WhenAnyTask along with a new AtomicTask that has not been scheduled/completed.
  4. The orchestrator yields a WhenAnyTask containing only the not-yet-scheduled AtomicTask

Example orchestrator - assume that the result of sample_suborchestrator is just the input task_id:

@bp.orchestration_trigger(context_name="context")
def sample_orchestrator(context: df.DurableOrchestrationContext):
    tasks = {}
    for i in range(3):
        task_id = yield context.call_activity("generate_job_id", str(i))
        tasks[task_id] = context.call_sub_orchestrator("sample_suborchestrator", task_id, task_id)
    added_task = False
    while len(tasks) > 0:
        completed_task = yield context.task_any(list(tasks.values()))
        task_result = completed_task.result
        if task_result in tasks:
            tasks.pop(task_result)
        if not added_task:
            task_id = yield context.call_activity("generate_job_id", "4")
            tasks[task_id] = context.call_sub_orchestrator("sample_suborchestrator", task_id, task_id)
            added_task = True
    return "Success"

If these circumstances are met, the second WhenAnyTask will immediately return the result from the completed subtask, without scheduling the new AtomicTask. However, the new AtomicTask will have been registered in the orchestration context's open task list, and the next time this new task is yielded back, when the context tries to register it, it encounters some old ReplaySchema.V1 logic, tries to call .append() on the previously registered AtomicTask, and throws.

This PR adds a check to this .append() call to make sure that the value in the open task list is not already referencing the incoming task before attempting to append. I have tested this to make sure that using this logic, the orchestrations complete as expected.

Second bug

In this same case, when a WhenAnyTask is called with a task that has already completed, and a new AtomicTask that has not yet been scheduled, there is an unexpected interaction. When this happens, the TaskOrchestrationExecutor will register all of the subtasks for the new WhenAnyTask internally, but then it checks if any of the subtasks have received a result. If they have, the Executor assumes this WhenAnyTask was already scheduled and replays it with the completed task as the result. This results in the new AtomicTask never getting scheduled with the WebJobs extension and so it will never execute. We can call this the "limbo" task.

This is fine, until the user creates a new AtomicTask and schedules it. When the result of this final Task is received, the WebJobs extension and the Python extension are improperly indexed due to the limbo Task from before, and the Python extension assigns the result of the final task to the Limbo task. It then waits for a result for the final task, but since as far as WebJobs is concerned, it has already sent that result, the orchestration becomes stuck in "Running" state.

This PR addresses this problem as well by ensuring that all Tasks are scheduled before resuming replay.

@andystaples andystaples requested review from bachuv, cgillum and nytian April 7, 2025 20:03
@andystaples andystaples marked this pull request as ready for review April 21, 2025 22:37
@andystaples andystaples merged commit 9dd2654 into dev Apr 21, 2025
2 checks passed
@andystaples andystaples deleted the andystaples/fix-task_any-bug branch April 21, 2025 22:38
greenie-msft pushed a commit to greenie-msft/azure-functions-durable-python that referenced this pull request Sep 25, 2025
* Fix long timer bug
* Also fix task_any scheduling issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Non-linear (fan-out) workflows trigger "AtomicTask object has no attribute 'append'" due to duplicate task ID handling

3 participants