Workflow /output endpoint can return spurious 404 during partition reconfiguration

## Summary

The workflow `/output` endpoint (`GET /restate/workflow/{service}/{key}/output`) can return a spurious HTTP 404 "invocation not found" for a workflow that has already completed. This occurs during partition reconfiguration when the request is routed to a node whose partition processor has not yet replayed the log entries containing the invocation.

This is **not** just a test instability — it is a correctness issue that can affect production clusters during partition rebalancing.

## Observed Failure

- **Test run:** https://github.com/restatedev/restate/actions/runs/23308947715/job/67792723984
- **Test:** `threeNodes => WorkflowAPI => Set and resolve durable promise leads to completion`
- **Error:** `IngressException: [GET .../output][Status: 404] invocation not found`

## Root Cause Analysis

### The code path

The `/output` endpoint uses `GetInvocationOutputResponseMode::ReplyIfNotReady`, which is the correct mode for non-blocking output queries (returning HTTP 470 "not ready" for in-flight invocations, as opposed to `/attach` which blocks). However, the `ReplyIfNotReady` path performs a **bare point-read against the local partition store with no leadership check**:

1. `crates/ingress-http/src/handler/workflow.rs:109` — calls `dispatcher.get_invocation_output()`
2. `crates/ingress-http/src/rpc_request_dispatcher.rs:130` — wraps in retry loop (`execute_rpc(is_idempotent=true, ...)`)
3. `crates/core/src/worker_api/partition_processor_rpc_client.rs:340` — resolves partition → target node, sends RPC with `ReplyIfNotReady`
4. `crates/core/src/partitions.rs:42-54` — `get_node_by_partition()` prefers known leader, **falls back to any alive node**
5. `crates/worker/src/partition/rpc/get_invocation_output.rs:136-141` — `ReplyIfNotReady` mode does **direct point-read** from `self.storage`
6. `crates/worker/src/partition/rpc/get_invocation_output.rs:70-73` — reads `invocation_status`; if `Free` (no record) → returns `NotFound`

### The problem: stale reads from non-leader partition processors

The `ReplyIfNotReady` mode is needed — it provides the correct non-blocking semantics for `/output`. The problem is that the point-read it performs has **no guard against serving stale data**:

1. **No leadership check:** The point-read is served by whichever partition processor receives the RPC, including **followers** whose local store may be arbitrarily behind. (`get_invocation_output.rs:136-141`)

2. **`NotFound` is `Ok`, not `Err`:** The response `Ok(NotFound)` passes through the ingress retry layer without being retried, because retries only fire on `Err`. (`partition_processor_rpc_client.rs:350`, `rpc_request_dispatcher.rs:58-73`)

This means a point-read on a follower PP that hasn't replayed the invocation yet will return `Free` → `NotFound` → HTTP 404, and this response is treated as authoritative and returned to the client without retry.

Contrast this with the `BlockWhenNotReady` mode (used by `/attach`): it also does an optimistic point-read first, but only short-circuits on `Output(...)` (invocation completed — safe, since you can't go back from completed to non-existent). For all other results (`NotFound`, `NotReady`, errors), it falls through to `handle_rpc_proposal_command`, which requires leadership and returns `NotLeader` on followers — triggering the ingress retry loop.

### The interleaving

In the observed failure (3-node cluster, 4 partitions, replication factor 1):

| Time | Event |
|------|-------|
| T0 | Workflow `BlockAndWaitWorkflow/run` submitted to partition P (leader: N2) |
| T1 | Workflow executes and **completes** on N2 |
| T2 | `/attach` returns 200 ✅, first `/output` returns 200 ✅ (served from N2) |
| **T3** | **Scheduler reconfigures partition P: {N2} → {N3}** |
| T4 | Second `/output` request arrives at N1's ingress. `get_node_by_partition(P)` → no leader known (N3 hasn't announced leadership yet) → falls back to `first_alive_node()` → picks **N3** |
| **T5** | **N3's PP for partition P serves point-read.** N3's PP is still a follower and has not replayed the log entries containing the completed workflow. `get_invocation_status()` returns `Free` → `NotFound` → **HTTP 404** |
| T6 | N3 gets leader epoch, starts leadership campaign — **too late** |

### Why the partition moved

The scheduler uses a consistent hash (`xxh3`) seeded by `(partition_id, node_id)` to determine partition placement, but only considers nodes that are currently **alive**. N3 joined the alive set ~200ms after the initial placement, causing the scheduler to recompute the ideal placement and immediately reconfigure partition P from N2 to N3.

## Suggested Fix

Gate the `ReplyIfNotReady` point-read on leadership. If the PP is not leader, return `PartitionProcessorRpcError::NotLeader` instead of serving a potentially stale read. This error is already handled by the ingress retry loop (`is_idempotent=true`, retries every 50ms), so the request will be retried and eventually reach the leader.

This preserves the non-blocking `/output` semantics (`ReplyIfNotReady` still returns `NotReady` / HTTP 470 for in-flight invocations) while ensuring reads are always served from the authoritative leader.

### Changes needed

- `crates/worker/src/partition/rpc/get_invocation_output.rs` — In the `ReplyIfNotReady` branch, check `self.proposer.is_leader()` before serving the point-read. If not leader, reply with `Err(NotLeader(partition_id))`.
- `crates/worker/src/partition/rpc/mod.rs` — Add `is_leader()` and `partition_id()` to the `Actuator` trait (with impl on `LeadershipState` which already has both).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workflow /output endpoint can return spurious 404 during partition reconfiguration #4513

Summary

Observed Failure

Root Cause Analysis

The code path

The problem: stale reads from non-leader partition processors

The interleaving

Why the partition moved

Suggested Fix

Changes needed

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Time	Event
T0	Workflow `BlockAndWaitWorkflow/run` submitted to partition P (leader: N2)
T1	Workflow executes and completes on N2
T2	`/attach` returns 200 ✅, first `/output` returns 200 ✅ (served from N2)
T3	Scheduler reconfigures partition P: {N2} → {N3}
T4	Second `/output` request arrives at N1's ingress. `get_node_by_partition(P)` → no leader known (N3 hasn't announced leadership yet) → falls back to `first_alive_node()` → picks N3
T5	N3's PP for partition P serves point-read. N3's PP is still a follower and has not replayed the log entries containing the completed workflow. `get_invocation_status()` returns `Free` → `NotFound` → HTTP 404
T6	N3 gets leader epoch, starts leadership campaign — too late

Workflow /output endpoint can return spurious 404 during partition reconfiguration #4513

Description

Summary

Observed Failure

Root Cause Analysis

The code path

The problem: stale reads from non-leader partition processors

The interleaving

Why the partition moved

Suggested Fix

Changes needed

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions