Skip to content

Client WebSocket connections stops working after runner restart #3040

@vkartaviy

Description

@vkartaviy

When a runner process restarts, client WebSocket connections become stuck in a broken state. Clients remain connected
to the engine gateway but cannot send/receive messages because the tunnel to the runner is broken. Clients only
reconnect if the page is manually refreshed.

Environment

  • rivet-kit version: 2.0.8
  • rivet-engine version: rivet-dev/engine:local-20250926-165615
  • Driver: Engine driver
  • Platform: Docker (engine + postgres) + Node.js (runner) + React (client)

Steps to Reproduce

  1. Start the rivet-engine (e.g., in Docker)
  2. Start a runner with an actor definition (use counter example):
  3. Connect a client (React app) to the actor:
  4. Verify the client is connected and can call actions
  5. Restart the runner process (Ctrl+C, then restart)
  6. Observe: Client still shows "Connected" but actions fail silently - no errors thrown, no events received

Expected Behavior

When the runner restarts:

  1. Engine should detect runner disconnection
  2. Engine gateway should close all client WebSocket connections for actors on that runner
  3. Clients should auto-reconnect (rivetkit already has this logic)
  4. New tunnels should be established to the restarted runner
  5. Actions and events should work normally

Actual Behavior

When the runner restarts:

  1. ✅ Engine detects runner disconnection
  2. ✅ Actor workflows receive Lost signal and reschedule actors
  3. ✅ Runner receives CommandStartActor again on reconnect
  4. ✅ Actor state (including persisted data) is restored correctly
  5. ✅ Actor connections are restored from persisted data (lines 720-739 in instance.ts)
  6. ❌ Client WebSocket connections are NOT closed by the engine gateway
  7. ❌ Client remains connected to gateway with broken tunnel to runner
  8. ❌ Actions sent by client go nowhere (no error, no response)
  9. ❌ Events broadcast by actor are not received by client
  10. ✅ Manual page refresh creates new WebSocket connection and everything works again

Root Cause Analysis

Client-Side (rivetkit)

The client's WebSocket to the engine gateway never closes when the runner restarts, so the reconnection logic never triggers.

Engine-Side (rivet-engine)

When a runner disconnects (runner.rs:228-277), the engine:

  1. Calls fetch_remaining_actors to get actors on that runner
  2. Sends Lost signal to each actor workflow
  3. Actor workflows reschedule and send new CommandStartActor to the runner

But there's no code to close client WebSocket connections. The gateway has no mechanism to detect that actor connections need to be reset.

The Broken State

Before restart:

  Client WS → Engine Gateway → Tunnel (UPS) → Runner A → Actor ✓

After restart:

  Client WS → Engine Gateway → [Broken Tunnel] ❌
                                ↓
                            Runner A (restarted) → Actor (restored)

The client WebSocket is still connected to the gateway, but the gateway's tunnel to the runner is stale/broken.

Browser Behavior

  • WebSocket shows "OPEN" state in DevTools Network tab
  • Actions silently fail (no errors in console with try/catch)
  • Events are not received
  • counter.connection remains truthy (shows "✓ Connected")

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions