-
Notifications
You must be signed in to change notification settings - Fork 121
Description
When a runner process restarts, client WebSocket connections become stuck in a broken state. Clients remain connected
to the engine gateway but cannot send/receive messages because the tunnel to the runner is broken. Clients only
reconnect if the page is manually refreshed.
Environment
- rivet-kit version: 2.0.8
- rivet-engine version: rivet-dev/engine:local-20250926-165615
- Driver: Engine driver
- Platform: Docker (engine + postgres) + Node.js (runner) + React (client)
Steps to Reproduce
- Start the rivet-engine (e.g., in Docker)
- Start a runner with an actor definition (use counter example):
- Connect a client (React app) to the actor:
- Verify the client is connected and can call actions
- Restart the runner process (Ctrl+C, then restart)
- Observe: Client still shows "Connected" but actions fail silently - no errors thrown, no events received
Expected Behavior
When the runner restarts:
- Engine should detect runner disconnection
- Engine gateway should close all client WebSocket connections for actors on that runner
- Clients should auto-reconnect (rivetkit already has this logic)
- New tunnels should be established to the restarted runner
- Actions and events should work normally
Actual Behavior
When the runner restarts:
- ✅ Engine detects runner disconnection
- ✅ Actor workflows receive Lost signal and reschedule actors
- ✅ Runner receives CommandStartActor again on reconnect
- ✅ Actor state (including persisted data) is restored correctly
- ✅ Actor connections are restored from persisted data (lines 720-739 in instance.ts)
- ❌ Client WebSocket connections are NOT closed by the engine gateway
- ❌ Client remains connected to gateway with broken tunnel to runner
- ❌ Actions sent by client go nowhere (no error, no response)
- ❌ Events broadcast by actor are not received by client
- ✅ Manual page refresh creates new WebSocket connection and everything works again
Root Cause Analysis
Client-Side (rivetkit)
The client's WebSocket to the engine gateway never closes when the runner restarts, so the reconnection logic never triggers.
Engine-Side (rivet-engine)
When a runner disconnects (runner.rs:228-277), the engine:
- Calls
fetch_remaining_actors
to get actors on that runner - Sends
Lost
signal to each actor workflow - Actor workflows reschedule and send new
CommandStartActor
to the runner
But there's no code to close client WebSocket connections. The gateway has no mechanism to detect that actor connections need to be reset.
The Broken State
Before restart:
Client WS → Engine Gateway → Tunnel (UPS) → Runner A → Actor ✓
After restart:
Client WS → Engine Gateway → [Broken Tunnel] ❌
↓
Runner A (restarted) → Actor (restored)
The client WebSocket is still connected to the gateway, but the gateway's tunnel to the runner is stale/broken.
Browser Behavior
- WebSocket shows "OPEN" state in DevTools Network tab
- Actions silently fail (no errors in console with try/catch)
- Events are not received
- counter.connection remains truthy (shows "✓ Connected")