Add WebSocket Reconnection with Exponential Backoff and Signaling Heartbeat#69
Merged
Add WebSocket Reconnection with Exponential Backoff and Signaling Heartbeat#69
Conversation
Workers now survive transient signaling server disconnections. On WebSocket close, the worker retries with exponential backoff (1s, 2s, 4s... capping at 5 min) and re-registers with its current status (busy/available). Peer ID is stable across reconnections. Mesh connections are left untouched. - Refactor run_worker() into one-time setup + reconnection loop - Add --max-reconnect-time CLI option (e.g. '30m', '2h') - Add open_timeout=20 to websockets.connect to prevent hangs - Check timeout before sleep, not after - 27 tests covering all reconnection scenarios Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the plain === log block on reconnection with the same styled banner used for the initial connection. On reconnection it shows "Worker <name> reconnected! (attempt N, was disconnected for Xs)" instead of "ready! Waiting for client requests...". Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The admin handler loop reads from mesh_coordinator.websocket, which became stale after reconnection. The old handler task was already done, so awaiting it returned immediately, causing handle_connection to return and triggering an infinite reconnect loop. Fix: update mesh_coordinator.websocket to the fresh connection and call _start_admin_websocket_handler() to create a new handler task. Also move the admin handler await outside the if/else so it runs for both first connection and reconnection. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…nections
Cloudflare proxies WebSocket connections and answers TCP-level pings
on behalf of a dead backend, so the worker never detects a server
restart. The signaling server now sends {"type": "ping"} every 30s.
A watchdog task on the worker checks if pings stop arriving — if none
for 90s, it closes the websocket to trigger the reconnection loop.
- Add _signaling_heartbeat_watchdog() to RTCWorkerClient
- Handle "ping" messages in handle_connection and admin handler
- Start/cancel watchdog in reconnection loop
- Add design doc and 3 watchdog tests (30 total)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The admin WebSocket handler's broad `except Exception` was catching the effects of Ctrl+C, causing the handler to return normally instead of propagating the interrupt. This made the reconnection loop think the connection ended and reconnect instead of exiting. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
After KeyboardInterrupt corrupts the asyncio event loop, subsequent
reconnection attempts fail with RuntimeError('no running event loop')
which is caught by the broad except Exception handler, causing an
infinite retry loop. Detect this specific error and break.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
KeyboardInterrupt was being swallowed by broad except handlers deep in the admin handler and mesh coordinator stack, causing the worker to reconnect instead of exiting. Installing a SIGINT handler that sets shutting_down=True and closes the websocket ensures the reconnection loop exits on its next iteration regardless of where the interrupt lands. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…e connection websocket.close() sends a close frame through Cloudflare, which may not break the admin handler's pending read. Instead, cancel the admin handler task directly and force-close the underlying transport. Same fix applied to the SIGINT handler. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
_handle_job_assigned was awaiting the entire training pipeline inline, blocking the admin handler's message loop for the full duration of training. No WebSocket messages (including heartbeat pings) could be processed during this time, causing the watchdog to false-trigger and breaking progress relay to the dashboard. Run the job as an asyncio.create_task so the admin handler loop continues reading messages during training. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…nnection RelayChannel captured the websocket at job start time. After a signaling server restart and reconnection, progress messages were sent through the old dead websocket. Now references the mesh coordinator's current websocket dynamically, so progress relay survives reconnection. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The admin promotion and demotion re-registration messages referenced self.worker.id_token which doesn't exist with account key auth, causing 'RTCWorkerClient has no attribute id_token' errors during multi-worker reconnection scenarios. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Includes Caddy config with all API routes, handle_path for relay, env vars for Docker, Elastic IP reassociation steps, troubleshooting table, and warning about Terraform-triggered instance replacement. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Workers on GPU clusters (RunAI, Slurm) die when the signaling server restarts or has a brief network blip. Since these are headless containers, a dead worker requires manual intervention to restart — which is unacceptable for production deployments. Workers need to survive transient signaling server disconnections automatically.
Additionally, Cloudflare proxies the WebSocket connection between workers and the signaling server, answering TCP-level pings on behalf of a dead backend. This means workers can't detect a server restart using standard WebSocket keepalive — they need application-level heartbeat detection.
Changes
Reconnection loop (
worker_class.py)run_worker()into one-time setup + reconnection loopopen_timeout=20onwebsockets.connect()to prevent indefinite hangs--max-reconnect-timeis accurateCLI (
cli.py)--max-reconnect-timeoption (e.g.30m,2h,90s,3600)_parse_duration()helper with validationSignaling heartbeat watchdog (
worker_class.py,mesh_coordinator.py){"type": "ping"}from the serverhandle_connectionand admin handler message pathsReconnection banner
Clean shutdown
shutting_down=True, cancels admin handler task, and force-closes websocket transportBug fixes discovered during E2E testing
_handle_job_assignedwasawaiting the entire training pipeline inline, blocking all WebSocket message processing (including heartbeat pings and progress relay). Now runs as a background task.mesh_coordinator.websocket.id_tokenin mesh re-registration: Admin promotion/demotion re-registration usedself.worker.id_tokenwhich doesn't exist with account key auth. Replaced withself.worker.api_key.except Exceptionin the admin handler caught interrupt side-effects, causing reconnection instead of exit.Known issues (not addressed in this PR)
fs_list_reqcan be unhandled if admin status is lost. Follow-up needed for admin state recovery.delete-peercleanup API returns 500 (Cognito legacy code, not related to reconnection)'int' object is not callableerror in second worker during reconnection (needs traceback to diagnose)Server-side companion
webRTC-connect PR #30 — adds per-connection
_ping_loopthat sends{"type": "ping"}every 30s to each registered peer.E2E test results
--max-reconnect-time 2mcauses exitTest plan
tests/test_worker_reconnection.py--max-reconnect-time→ clean exit after timeout🤖 Generated with Claude Code