Fix the race condition for continue-as-new with extended sessions enabled#1303
Fix the race condition for continue-as-new with extended sessions enabled#1303
Conversation
…ommit any outbound messages
There was a problem hiding this comment.
Pull request overview
Fixes a race condition in the Azure Storage backend when extended sessions are enabled and an orchestration performs ContinueAsNew, where activity responses can arrive before the new execution ID is checkpointed, causing a stuck orchestration.
Changes:
- Moves
session.UpdateRuntimeState(runtimeState)earlier inCompleteTaskOrchestrationWorkItemAsyncso the in-memory session reflects the new execution ID before outbound messages are committed.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // update the runtime state and execution id stored in the session | ||
| session.UpdateRuntimeState(runtimeState); | ||
|
|
There was a problem hiding this comment.
This change fixes a subtle race in checkpointing order for ContinueAsNew + extended sessions, but there’s no regression test added to ensure the out-of-order TaskCompleted scenario is handled (message abandoned/retried rather than deleted and leaving the instance stuck). Consider adding an AzureStorage end-to-end test that enables extended sessions and forces trackingStore.UpdateStateAsync to be delayed while an activity completion message is delivered for the new execution ID, asserting the orchestration still completes and no control-queue message is lost.
Currently we have a subtle race condition that can occur if a user has extended sessions enabled and an orchestration that attempts to continue-as-new. The flow is as follows
TaskOrchestrationDispatchercallsCompleteTaskOrchestrationWorkItemAsync.TaskScheduledevent to start a new Activity.TaskCompletedevent back to the orchestration, all before theCompleteTaskOrchestrationWorkItemAsynchas updated the orchestration's state in storage to reflect the new execution ID.LockNextTaskOrchestrationWorkItemAsyncis made which retrieves theTaskCompletedevent. TheTaskCompletedevent is addressed to the new execution ID, but since the orchestration's state has not yet been updated in storage, there is no record for that execution ID. This call to determine out of order messages should detect that this is potentially an "out of order"TaskCompletedmessage, since the instance does "not yet exist". However,IsOutOfOrderMessageuses the in-memory state of the session. The in-memory state (which exists, since extended sessions are enabled) has the information for the old execution ID, so it does not detect that the instance "does not yet exist". It thinks the message is valid, and proceeds.LockNextTaskOrchestrationWorkItemAsyncmethod, when we attempt to retrieve information about this orchestration instance with the new execution ID in storage, we find none, and fail at this point. We delete theTaskCompletedevent, which leaves the orchestration permanently stuck in a running state.The core of the issue is that the in-memory session state is not updated to reflect the new execution ID before outbound messages are committed, which prevents the
IsOutOfOrderMessagelogic from functioning correctly. This PR moves the placement of thesession.UpdateRuntimeStatecall to be before outbound messages are committed to fix this issue.Resolves #1302