-
Notifications
You must be signed in to change notification settings - Fork 137
Description
Summary
The workflow /output endpoint (GET /restate/workflow/{service}/{key}/output) can return a spurious HTTP 404 "invocation not found" for a workflow that has already completed. This occurs during partition reconfiguration when the request is routed to a node whose partition processor has not yet replayed the log entries containing the invocation.
This is not just a test instability — it is a correctness issue that can affect production clusters during partition rebalancing.
Observed Failure
- Test run: https://github.com/restatedev/restate/actions/runs/23308947715/job/67792723984
- Test:
threeNodes => WorkflowAPI => Set and resolve durable promise leads to completion - Error:
IngressException: [GET .../output][Status: 404] invocation not found
Root Cause Analysis
The code path
The /output endpoint uses GetInvocationOutputResponseMode::ReplyIfNotReady, which is the correct mode for non-blocking output queries (returning HTTP 470 "not ready" for in-flight invocations, as opposed to /attach which blocks). However, the ReplyIfNotReady path performs a bare point-read against the local partition store with no leadership check:
crates/ingress-http/src/handler/workflow.rs:109— callsdispatcher.get_invocation_output()crates/ingress-http/src/rpc_request_dispatcher.rs:130— wraps in retry loop (execute_rpc(is_idempotent=true, ...))crates/core/src/worker_api/partition_processor_rpc_client.rs:340— resolves partition → target node, sends RPC withReplyIfNotReadycrates/core/src/partitions.rs:42-54—get_node_by_partition()prefers known leader, falls back to any alive nodecrates/worker/src/partition/rpc/get_invocation_output.rs:136-141—ReplyIfNotReadymode does direct point-read fromself.storagecrates/worker/src/partition/rpc/get_invocation_output.rs:70-73— readsinvocation_status; ifFree(no record) → returnsNotFound
The problem: stale reads from non-leader partition processors
The ReplyIfNotReady mode is needed — it provides the correct non-blocking semantics for /output. The problem is that the point-read it performs has no guard against serving stale data:
-
No leadership check: The point-read is served by whichever partition processor receives the RPC, including followers whose local store may be arbitrarily behind. (
get_invocation_output.rs:136-141) -
NotFoundisOk, notErr: The responseOk(NotFound)passes through the ingress retry layer without being retried, because retries only fire onErr. (partition_processor_rpc_client.rs:350,rpc_request_dispatcher.rs:58-73)
This means a point-read on a follower PP that hasn't replayed the invocation yet will return Free → NotFound → HTTP 404, and this response is treated as authoritative and returned to the client without retry.
Contrast this with the BlockWhenNotReady mode (used by /attach): it also does an optimistic point-read first, but only short-circuits on Output(...) (invocation completed — safe, since you can't go back from completed to non-existent). For all other results (NotFound, NotReady, errors), it falls through to handle_rpc_proposal_command, which requires leadership and returns NotLeader on followers — triggering the ingress retry loop.
The interleaving
In the observed failure (3-node cluster, 4 partitions, replication factor 1):
| Time | Event |
|---|---|
| T0 | Workflow BlockAndWaitWorkflow/run submitted to partition P (leader: N2) |
| T1 | Workflow executes and completes on N2 |
| T2 | /attach returns 200 ✅, first /output returns 200 ✅ (served from N2) |
| T3 | Scheduler reconfigures partition P: {N2} → {N3} |
| T4 | Second /output request arrives at N1's ingress. get_node_by_partition(P) → no leader known (N3 hasn't announced leadership yet) → falls back to first_alive_node() → picks N3 |
| T5 | N3's PP for partition P serves point-read. N3's PP is still a follower and has not replayed the log entries containing the completed workflow. get_invocation_status() returns Free → NotFound → HTTP 404 |
| T6 | N3 gets leader epoch, starts leadership campaign — too late |
Why the partition moved
The scheduler uses a consistent hash (xxh3) seeded by (partition_id, node_id) to determine partition placement, but only considers nodes that are currently alive. N3 joined the alive set ~200ms after the initial placement, causing the scheduler to recompute the ideal placement and immediately reconfigure partition P from N2 to N3.
Suggested Fix
Gate the ReplyIfNotReady point-read on leadership. If the PP is not leader, return PartitionProcessorRpcError::NotLeader instead of serving a potentially stale read. This error is already handled by the ingress retry loop (is_idempotent=true, retries every 50ms), so the request will be retried and eventually reach the leader.
This preserves the non-blocking /output semantics (ReplyIfNotReady still returns NotReady / HTTP 470 for in-flight invocations) while ensuring reads are always served from the authoritative leader.
Changes needed
crates/worker/src/partition/rpc/get_invocation_output.rs— In theReplyIfNotReadybranch, checkself.proposer.is_leader()before serving the point-read. If not leader, reply withErr(NotLeader(partition_id)).crates/worker/src/partition/rpc/mod.rs— Addis_leader()andpartition_id()to theActuatortrait (with impl onLeadershipStatewhich already has both).