agents-server restart: runtime fails to recover cleanly on reconnection

## Summary

When the `agents-server` is restarted while a connected runtime (`agents-runtime`) is still running, the runtime appears to "bug out" rather than transparently reconnecting. The runtime side has no retry/reconnect logic for its HTTP calls to the server, and several recovery code paths on the server side rely on Electric shape-stream cursors / handles that may not be valid post-restart.

This issue is filed from a code review pass — it documents specific failure modes in the reconnect path that are visible in the source today, rather than a single confirmed reproduction. Anyone hitting this in practice: please attach logs from both the runtime and the server during the restart window.

## Symptom (reported)

After bouncing `agents-server`, the runtime that was already attached doesn't fully recover — wakes don't fire, entity updates appear stuck, or HTTP calls into the server start failing without recovering.

## Code-level evidence

### 1. Runtime has no retry/reconnect on HTTP calls

`packages/agents-runtime/src/runtime-server-client.ts:138-140` is the only request path; it's a plain `fetch()` with no retry, no backoff, and no idempotency-aware re-issue:

```ts
const request = (path: string, init?: RequestInit): Promise<Response> => {
  return track(fetchImpl(`${config.baseUrl}${path}`, init))
}
```

Any call in flight while the server is down (`spawnEntity`, `sendEntityMessage`, `registerWake`, `upsertCronSchedule`, etc.) fails with a network error and is surfaced to the caller. Type registration (`packages/agents-runtime/src/create-handler.ts:355-472`) is also only invoked once at runtime startup — there's no mechanism to re-register if the server later forgets us.

In practice this is mostly fine, because most server-side state *is* persisted (entity types, subscription webhooks, wake registrations, scheduled tasks all live in Postgres). But the runtime never finds out the server came back, so any in-flight request during the restart window is lost forever from the caller's perspective.

### 2. Entity bridge resumes from a possibly-stale shape handle

`packages/agents-server/src/entity-bridge-manager.ts:153-160` (in `start()`):

```ts
if (this.initialShapeHandle && this.initialShapeOffset) {
  const initialOffset = parseElectricOffset(this.initialShapeOffset)
  if (initialOffset) {
    this.startLiveStream(initialOffset, this.initialShapeHandle)
    return
  }
}
await this.resync(`startup`)
```

On restart we try to resume an Electric shape subscription using the last persisted `shape_handle` + `shape_offset`. If Electric has rotated/compacted the shape since then, the handle is invalid. The error path:

- `createShapeStream` `onError` at `entity-bridge-manager.ts:302-311` only logs a warning and returns `{}` — it does not trigger a resync.
- `must-refetch` is handled correctly at `entity-bridge-manager.ts:332-342` (clears cursor, rescans).
- The subscription-error callback at `entity-bridge-manager.ts:373-388` does call `requestResync(\`subscription-error\`)`.

So whether we recover depends on which of those paths Electric drives us down. A silent staleness in the live stream (no `must-refetch`, no subscription error, just no new messages) would leave the bridge as a zombie — in-memory but not receiving updates. Wake conditions on entity changes would then never fire.

### 3. Wake registry recovery is best-effort

`packages/agents-server/src/wake-registry.ts:238-278` (`recoverSync`) handles shape-stream errors by stopping, reloading registrations from Postgres, and re-subscribing. This is solid for the *catastrophic error* case, but the same caveat as above applies: a silent live-stream staleness would not be caught.

### 4. Wake delivery during the restart window

Wake delivery is split across:

- Wakes are appended to the subscriber's durable stream (`electric-agents-manager.ts:946-952`) — durable, OK.
- Subscribers are notified via `subscription_webhooks` lookup at `/_electric/webhook-forward/<id>` (`server.ts:1478-1520`).

If durable-streams tries to deliver a webhook *while* agents-server is restarting, success depends on durable-streams' retry behaviour. Worth confirming this doesn't drop events.

### 5. What does work

For balance — these are all explicitly covered:

- Scheduled / delayed sends survive restart: `packages/agents-server/test/scheduler-integration.test.ts:117` (`delayed_send survives server restart and lands exactly once`).
- Re-registering an entity type after restart updates it: `scheduler-integration.test.ts:158`.
- Entity bridge handles `must-refetch` from Electric: `entity-bridge-manager.test.ts:362`.
- Wake registrations and webhook subscriptions are persisted in Postgres and rebuilt on startup (`server.ts:405-409`).

There is no test exercising the *runtime* side of restart — i.e. a runtime that's already attached when `agents-server` bounces.

## Suggested repro

`packages/agents-server/docker-compose.dev.yml` provides Postgres + Electric. To exercise the runtime path:

1. Start the dev stack (Postgres + Electric + durable-streams + agents-server).
2. Start a runtime (e.g. via `packages/agents`) pointing at the agents-server. Register an entity type that has a wake on entity changes from another tag-filtered entity.
3. Trigger something from the runtime (spawn entity, register wake) and confirm it works.
4. Restart **only** the `agents-server` process (leave Electric, Postgres, durable-streams, and the runtime running).
5. From the runtime, retry the same operations, and additionally trigger entity changes that should fire wakes. Observe:
   - Does any HTTP call from the runtime fail without recovering?
   - Do wakes still fire after a few seconds?
   - Do entity bridge updates still propagate?

## Ideas for the fix

- Add retry-with-backoff to `runtime-server-client.ts` for idempotent operations (and handle non-idempotent ones explicitly).
- Add a periodic health-probe / heartbeat from runtime → server so the runtime can re-register types if the server returns a "you don't know me" response.
- For entity bridges, add a periodic liveness check (e.g. compare last-seen offset to Electric's current head) to catch silent staleness, not just explicit errors.
- Add an integration test mirroring `delayed_send survives server restart` but covering a live runtime: keep the runtime up across the restart and assert wakes still fire and HTTP calls succeed.

## Notes

This issue is opened from static analysis — symptom matches but I have not bisected to a concrete failure log. Anyone with logs from a live recurrence: please attach.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

agents-server restart: runtime fails to recover cleanly on reconnection #4197

Summary

Symptom (reported)

Code-level evidence

1. Runtime has no retry/reconnect on HTTP calls

2. Entity bridge resumes from a possibly-stale shape handle

3. Wake registry recovery is best-effort

4. Wake delivery during the restart window

5. What does work

Suggested repro

Ideas for the fix

Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

agents-server restart: runtime fails to recover cleanly on reconnection #4197

Description

Summary

Symptom (reported)

Code-level evidence

1. Runtime has no retry/reconnect on HTTP calls

2. Entity bridge resumes from a possibly-stale shape handle

3. Wake registry recovery is best-effort

4. Wake delivery during the restart window

5. What does work

Suggested repro

Ideas for the fix

Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions