Skip to content

Workflow /output endpoint can return spurious 404 during partition reconfiguration #4513

@tillrohrmann

Description

@tillrohrmann

Summary

The workflow /output endpoint (GET /restate/workflow/{service}/{key}/output) can return a spurious HTTP 404 "invocation not found" for a workflow that has already completed. This occurs during partition reconfiguration when the request is routed to a node whose partition processor has not yet replayed the log entries containing the invocation.

This is not just a test instability — it is a correctness issue that can affect production clusters during partition rebalancing.

Observed Failure

Root Cause Analysis

The code path

The /output endpoint uses GetInvocationOutputResponseMode::ReplyIfNotReady, which is the correct mode for non-blocking output queries (returning HTTP 470 "not ready" for in-flight invocations, as opposed to /attach which blocks). However, the ReplyIfNotReady path performs a bare point-read against the local partition store with no leadership check:

  1. crates/ingress-http/src/handler/workflow.rs:109 — calls dispatcher.get_invocation_output()
  2. crates/ingress-http/src/rpc_request_dispatcher.rs:130 — wraps in retry loop (execute_rpc(is_idempotent=true, ...))
  3. crates/core/src/worker_api/partition_processor_rpc_client.rs:340 — resolves partition → target node, sends RPC with ReplyIfNotReady
  4. crates/core/src/partitions.rs:42-54get_node_by_partition() prefers known leader, falls back to any alive node
  5. crates/worker/src/partition/rpc/get_invocation_output.rs:136-141ReplyIfNotReady mode does direct point-read from self.storage
  6. crates/worker/src/partition/rpc/get_invocation_output.rs:70-73 — reads invocation_status; if Free (no record) → returns NotFound

The problem: stale reads from non-leader partition processors

The ReplyIfNotReady mode is needed — it provides the correct non-blocking semantics for /output. The problem is that the point-read it performs has no guard against serving stale data:

  1. No leadership check: The point-read is served by whichever partition processor receives the RPC, including followers whose local store may be arbitrarily behind. (get_invocation_output.rs:136-141)

  2. NotFound is Ok, not Err: The response Ok(NotFound) passes through the ingress retry layer without being retried, because retries only fire on Err. (partition_processor_rpc_client.rs:350, rpc_request_dispatcher.rs:58-73)

This means a point-read on a follower PP that hasn't replayed the invocation yet will return FreeNotFound → HTTP 404, and this response is treated as authoritative and returned to the client without retry.

Contrast this with the BlockWhenNotReady mode (used by /attach): it also does an optimistic point-read first, but only short-circuits on Output(...) (invocation completed — safe, since you can't go back from completed to non-existent). For all other results (NotFound, NotReady, errors), it falls through to handle_rpc_proposal_command, which requires leadership and returns NotLeader on followers — triggering the ingress retry loop.

The interleaving

In the observed failure (3-node cluster, 4 partitions, replication factor 1):

Time Event
T0 Workflow BlockAndWaitWorkflow/run submitted to partition P (leader: N2)
T1 Workflow executes and completes on N2
T2 /attach returns 200 ✅, first /output returns 200 ✅ (served from N2)
T3 Scheduler reconfigures partition P: {N2} → {N3}
T4 Second /output request arrives at N1's ingress. get_node_by_partition(P) → no leader known (N3 hasn't announced leadership yet) → falls back to first_alive_node() → picks N3
T5 N3's PP for partition P serves point-read. N3's PP is still a follower and has not replayed the log entries containing the completed workflow. get_invocation_status() returns FreeNotFoundHTTP 404
T6 N3 gets leader epoch, starts leadership campaign — too late

Why the partition moved

The scheduler uses a consistent hash (xxh3) seeded by (partition_id, node_id) to determine partition placement, but only considers nodes that are currently alive. N3 joined the alive set ~200ms after the initial placement, causing the scheduler to recompute the ideal placement and immediately reconfigure partition P from N2 to N3.

Suggested Fix

Gate the ReplyIfNotReady point-read on leadership. If the PP is not leader, return PartitionProcessorRpcError::NotLeader instead of serving a potentially stale read. This error is already handled by the ingress retry loop (is_idempotent=true, retries every 50ms), so the request will be retried and eventually reach the leader.

This preserves the non-blocking /output semantics (ReplyIfNotReady still returns NotReady / HTTP 470 for in-flight invocations) while ensuring reads are always served from the authoritative leader.

Changes needed

  • crates/worker/src/partition/rpc/get_invocation_output.rs — In the ReplyIfNotReady branch, check self.proposer.is_leader() before serving the point-read. If not leader, reply with Err(NotLeader(partition_id)).
  • crates/worker/src/partition/rpc/mod.rs — Add is_leader() and partition_id() to the Actuator trait (with impl on LeadershipState which already has both).

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions