Skip to content

Comments

Fix the race condition for continue-as-new with extended sessions enabled#1303

Open
sophiatev wants to merge 1 commit intomainfrom
stevosyan/fix-continue-as-new-with-extended-sessions-race-condition
Open

Fix the race condition for continue-as-new with extended sessions enabled#1303
sophiatev wants to merge 1 commit intomainfrom
stevosyan/fix-continue-as-new-with-extended-sessions-race-condition

Conversation

@sophiatev
Copy link
Contributor

Currently we have a subtle race condition that can occur if a user has extended sessions enabled and an orchestration that attempts to continue-as-new. The flow is as follows

  1. An orchestration continues-as-new with a new execution ID, and the TaskOrchestrationDispatcher calls CompleteTaskOrchestrationWorkItemAsync.
  2. In the completion call, outbound messages are committed. Say one of these is a TaskScheduled event to start a new Activity.
  3. The Activity completes and sends a TaskCompleted event back to the orchestration, all before the CompleteTaskOrchestrationWorkItemAsync has updated the orchestration's state in storage to reflect the new execution ID.
  4. A call to LockNextTaskOrchestrationWorkItemAsync is made which retrieves the TaskCompleted event. The TaskCompleted event is addressed to the new execution ID, but since the orchestration's state has not yet been updated in storage, there is no record for that execution ID. This call to determine out of order messages should detect that this is potentially an "out of order" TaskCompleted message, since the instance does "not yet exist". However, IsOutOfOrderMessage uses the in-memory state of the session. The in-memory state (which exists, since extended sessions are enabled) has the information for the old execution ID, so it does not detect that the instance "does not yet exist". It thinks the message is valid, and proceeds.
  5. Later on, in the LockNextTaskOrchestrationWorkItemAsync method, when we attempt to retrieve information about this orchestration instance with the new execution ID in storage, we find none, and fail at this point. We delete the TaskCompleted event, which leaves the orchestration permanently stuck in a running state.

The core of the issue is that the in-memory session state is not updated to reflect the new execution ID before outbound messages are committed, which prevents the IsOutOfOrderMessage logic from functioning correctly. This PR moves the placement of the session.UpdateRuntimeState call to be before outbound messages are committed to fix this issue.

Resolves #1302

Copilot AI review requested due to automatic review settings February 23, 2026 21:58
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a race condition in the Azure Storage backend when extended sessions are enabled and an orchestration performs ContinueAsNew, where activity responses can arrive before the new execution ID is checkpointed, causing a stuck orchestration.

Changes:

  • Moves session.UpdateRuntimeState(runtimeState) earlier in CompleteTaskOrchestrationWorkItemAsync so the in-memory session reflects the new execution ID before outbound messages are committed.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1210 to +1212
// update the runtime state and execution id stored in the session
session.UpdateRuntimeState(runtimeState);

Copy link

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change fixes a subtle race in checkpointing order for ContinueAsNew + extended sessions, but there’s no regression test added to ensure the out-of-order TaskCompleted scenario is handled (message abandoned/retried rather than deleted and leaving the instance stuck). Consider adding an AzureStorage end-to-end test that enables extended sessions and forces trackingStore.UpdateStateAsync to be delayed while an activity completion message is delivered for the new execution ID, asserting the orchestration still completes and no control-queue message is lost.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Orchestration is stuck in the Running state

1 participant