Skip to content

agents-server restart: runtime fails to recover cleanly on reconnection #4197

@icehaunter

Description

@icehaunter

Summary

When the agents-server is restarted while a connected runtime (agents-runtime) is still running, the runtime appears to "bug out" rather than transparently reconnecting. The runtime side has no retry/reconnect logic for its HTTP calls to the server, and several recovery code paths on the server side rely on Electric shape-stream cursors / handles that may not be valid post-restart.

This issue is filed from a code review pass — it documents specific failure modes in the reconnect path that are visible in the source today, rather than a single confirmed reproduction. Anyone hitting this in practice: please attach logs from both the runtime and the server during the restart window.

Symptom (reported)

After bouncing agents-server, the runtime that was already attached doesn't fully recover — wakes don't fire, entity updates appear stuck, or HTTP calls into the server start failing without recovering.

Code-level evidence

1. Runtime has no retry/reconnect on HTTP calls

packages/agents-runtime/src/runtime-server-client.ts:138-140 is the only request path; it's a plain fetch() with no retry, no backoff, and no idempotency-aware re-issue:

const request = (path: string, init?: RequestInit): Promise<Response> => {
  return track(fetchImpl(`${config.baseUrl}${path}`, init))
}

Any call in flight while the server is down (spawnEntity, sendEntityMessage, registerWake, upsertCronSchedule, etc.) fails with a network error and is surfaced to the caller. Type registration (packages/agents-runtime/src/create-handler.ts:355-472) is also only invoked once at runtime startup — there's no mechanism to re-register if the server later forgets us.

In practice this is mostly fine, because most server-side state is persisted (entity types, subscription webhooks, wake registrations, scheduled tasks all live in Postgres). But the runtime never finds out the server came back, so any in-flight request during the restart window is lost forever from the caller's perspective.

2. Entity bridge resumes from a possibly-stale shape handle

packages/agents-server/src/entity-bridge-manager.ts:153-160 (in start()):

if (this.initialShapeHandle && this.initialShapeOffset) {
  const initialOffset = parseElectricOffset(this.initialShapeOffset)
  if (initialOffset) {
    this.startLiveStream(initialOffset, this.initialShapeHandle)
    return
  }
}
await this.resync(`startup`)

On restart we try to resume an Electric shape subscription using the last persisted shape_handle + shape_offset. If Electric has rotated/compacted the shape since then, the handle is invalid. The error path:

  • createShapeStream onError at entity-bridge-manager.ts:302-311 only logs a warning and returns {} — it does not trigger a resync.
  • must-refetch is handled correctly at entity-bridge-manager.ts:332-342 (clears cursor, rescans).
  • The subscription-error callback at entity-bridge-manager.ts:373-388 does call requestResync(\subscription-error`)`.

So whether we recover depends on which of those paths Electric drives us down. A silent staleness in the live stream (no must-refetch, no subscription error, just no new messages) would leave the bridge as a zombie — in-memory but not receiving updates. Wake conditions on entity changes would then never fire.

3. Wake registry recovery is best-effort

packages/agents-server/src/wake-registry.ts:238-278 (recoverSync) handles shape-stream errors by stopping, reloading registrations from Postgres, and re-subscribing. This is solid for the catastrophic error case, but the same caveat as above applies: a silent live-stream staleness would not be caught.

4. Wake delivery during the restart window

Wake delivery is split across:

  • Wakes are appended to the subscriber's durable stream (electric-agents-manager.ts:946-952) — durable, OK.
  • Subscribers are notified via subscription_webhooks lookup at /_electric/webhook-forward/<id> (server.ts:1478-1520).

If durable-streams tries to deliver a webhook while agents-server is restarting, success depends on durable-streams' retry behaviour. Worth confirming this doesn't drop events.

5. What does work

For balance — these are all explicitly covered:

  • Scheduled / delayed sends survive restart: packages/agents-server/test/scheduler-integration.test.ts:117 (delayed_send survives server restart and lands exactly once).
  • Re-registering an entity type after restart updates it: scheduler-integration.test.ts:158.
  • Entity bridge handles must-refetch from Electric: entity-bridge-manager.test.ts:362.
  • Wake registrations and webhook subscriptions are persisted in Postgres and rebuilt on startup (server.ts:405-409).

There is no test exercising the runtime side of restart — i.e. a runtime that's already attached when agents-server bounces.

Suggested repro

packages/agents-server/docker-compose.dev.yml provides Postgres + Electric. To exercise the runtime path:

  1. Start the dev stack (Postgres + Electric + durable-streams + agents-server).
  2. Start a runtime (e.g. via packages/agents) pointing at the agents-server. Register an entity type that has a wake on entity changes from another tag-filtered entity.
  3. Trigger something from the runtime (spawn entity, register wake) and confirm it works.
  4. Restart only the agents-server process (leave Electric, Postgres, durable-streams, and the runtime running).
  5. From the runtime, retry the same operations, and additionally trigger entity changes that should fire wakes. Observe:
    • Does any HTTP call from the runtime fail without recovering?
    • Do wakes still fire after a few seconds?
    • Do entity bridge updates still propagate?

Ideas for the fix

  • Add retry-with-backoff to runtime-server-client.ts for idempotent operations (and handle non-idempotent ones explicitly).
  • Add a periodic health-probe / heartbeat from runtime → server so the runtime can re-register types if the server returns a "you don't know me" response.
  • For entity bridges, add a periodic liveness check (e.g. compare last-seen offset to Electric's current head) to catch silent staleness, not just explicit errors.
  • Add an integration test mirroring delayed_send survives server restart but covering a live runtime: keep the runtime up across the restart and assert wakes still fire and HTTP calls succeed.

Notes

This issue is opened from static analysis — symptom matches but I have not bisected to a concrete failure log. Anyone with logs from a live recurrence: please attach.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions