Skip to content

Add WebSocket Reconnection with Exponential Backoff and Signaling Heartbeat#69

Merged
alicup29 merged 15 commits intomainfrom
amick/worker-try-reconnect
Mar 19, 2026
Merged

Add WebSocket Reconnection with Exponential Backoff and Signaling Heartbeat#69
alicup29 merged 15 commits intomainfrom
amick/worker-try-reconnect

Conversation

@alicup29
Copy link
Copy Markdown
Collaborator

@alicup29 alicup29 commented Mar 19, 2026

Motivation

Workers on GPU clusters (RunAI, Slurm) die when the signaling server restarts or has a brief network blip. Since these are headless containers, a dead worker requires manual intervention to restart — which is unacceptable for production deployments. Workers need to survive transient signaling server disconnections automatically.

Additionally, Cloudflare proxies the WebSocket connection between workers and the signaling server, answering TCP-level pings on behalf of a dead backend. This means workers can't detect a server restart using standard WebSocket keepalive — they need application-level heartbeat detection.

Changes

Reconnection loop (worker_class.py)

  • Refactored run_worker() into one-time setup + reconnection loop
  • Exponential backoff: 1s, 2s, 4s, 8s... capping at 5 minutes (300s)
  • Peer ID generated once and reused across reconnections
  • Re-registers with current status (busy/available) to preserve mid-job state
  • Mesh connections left untouched (independent WebRTC peer connections)
  • open_timeout=20 on websockets.connect() to prevent indefinite hangs
  • Timeout check before sleep (not after) so --max-reconnect-time is accurate

CLI (cli.py)

  • Added --max-reconnect-time option (e.g. 30m, 2h, 90s, 3600)
  • Default: no limit (retry forever)
  • _parse_duration() helper with validation

Signaling heartbeat watchdog (worker_class.py, mesh_coordinator.py)

  • Worker tracks when it last received a {"type": "ping"} from the server
  • Watchdog task checks every 30s — if no ping for 90s, force-closes the websocket and cancels the admin handler task to trigger reconnection
  • Handles both handle_connection and admin handler message paths
  • Companion server PR: webRTC-connect#30 adds the server-side ping loop (30s interval)

Reconnection banner

  • Rich-formatted banner matching the initial connection style
  • Shows "reconnected! (attempt N, was disconnected for Xs)" instead of "ready!"
  • Restarts admin WebSocket handler with fresh websocket on reconnection

Clean shutdown

  • SIGINT handler sets shutting_down=True, cancels admin handler task, and force-closes websocket transport
  • Detects damaged event loop from repeated Ctrl+C and exits cleanly

Bug fixes discovered during E2E testing

  • Job handler blocking admin loop: _handle_job_assigned was awaiting the entire training pipeline inline, blocking all WebSocket message processing (including heartbeat pings and progress relay). Now runs as a background task.
  • Stale websocket in RelayChannel: Progress messages during training were sent through a websocket reference captured at job start. After reconnection, this was dead. Now dynamically references mesh_coordinator.websocket.
  • Legacy id_token in mesh re-registration: Admin promotion/demotion re-registration used self.worker.id_token which doesn't exist with account key auth. Replaced with self.worker.api_key.
  • KeyboardInterrupt swallowed by admin handler: Broad except Exception in the admin handler caught interrupt side-effects, causing reconnection instead of exit.

Known issues (not addressed in this PR)

  • Admin election state not fully recovered after multi-worker reconnection — fs_list_req can be unhandled if admin status is lost. Follow-up needed for admin state recovery.
  • delete-peer cleanup API returns 500 (Cognito legacy code, not related to reconnection)
  • 'int' object is not callable error in second worker during reconnection (needs traceback to diagnose)

Server-side companion

webRTC-connect PR #30 — adds per-connection _ping_loop that sends {"type": "ping"} every 30s to each registered peer.

E2E test results

Test Result
Basic reconnection (server restart) ✅ Pass
Mid-job reconnection (training continues + progress relays after reconnect) ✅ Pass
--max-reconnect-time 2m causes exit ✅ Pass
Ctrl+C during backoff exits cleanly ✅ Pass
Multi-worker reconnection (peer discovery) ✅ Pass
Heartbeat pings continue during training ✅ Pass

Test plan

  • 30 unit tests in tests/test_worker_reconnection.py
  • E2E: server restart → worker reconnects with banner
  • E2E: mid-job restart → training continues, progress updates resume
  • E2E: --max-reconnect-time → clean exit after timeout
  • E2E: Ctrl+C → immediate clean shutdown
  • E2E: two workers reconnect and discover each other

🤖 Generated with Claude Code

alicup29 and others added 13 commits March 18, 2026 14:12
Workers now survive transient signaling server disconnections. On
WebSocket close, the worker retries with exponential backoff (1s, 2s,
4s... capping at 5 min) and re-registers with its current status
(busy/available). Peer ID is stable across reconnections. Mesh
connections are left untouched.

- Refactor run_worker() into one-time setup + reconnection loop
- Add --max-reconnect-time CLI option (e.g. '30m', '2h')
- Add open_timeout=20 to websockets.connect to prevent hangs
- Check timeout before sleep, not after
- 27 tests covering all reconnection scenarios

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the plain === log block on reconnection with the same
styled banner used for the initial connection. On reconnection it
shows "Worker <name> reconnected! (attempt N, was disconnected for Xs)"
instead of "ready! Waiting for client requests...".

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The admin handler loop reads from mesh_coordinator.websocket, which
became stale after reconnection. The old handler task was already done,
so awaiting it returned immediately, causing handle_connection to
return and triggering an infinite reconnect loop.

Fix: update mesh_coordinator.websocket to the fresh connection and
call _start_admin_websocket_handler() to create a new handler task.
Also move the admin handler await outside the if/else so it runs for
both first connection and reconnection.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…nections

Cloudflare proxies WebSocket connections and answers TCP-level pings
on behalf of a dead backend, so the worker never detects a server
restart. The signaling server now sends {"type": "ping"} every 30s.
A watchdog task on the worker checks if pings stop arriving — if none
for 90s, it closes the websocket to trigger the reconnection loop.

- Add _signaling_heartbeat_watchdog() to RTCWorkerClient
- Handle "ping" messages in handle_connection and admin handler
- Start/cancel watchdog in reconnection loop
- Add design doc and 3 watchdog tests (30 total)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The admin WebSocket handler's broad `except Exception` was catching
the effects of Ctrl+C, causing the handler to return normally instead
of propagating the interrupt. This made the reconnection loop think
the connection ended and reconnect instead of exiting.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
After KeyboardInterrupt corrupts the asyncio event loop, subsequent
reconnection attempts fail with RuntimeError('no running event loop')
which is caught by the broad except Exception handler, causing an
infinite retry loop. Detect this specific error and break.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
KeyboardInterrupt was being swallowed by broad except handlers deep
in the admin handler and mesh coordinator stack, causing the worker
to reconnect instead of exiting. Installing a SIGINT handler that
sets shutting_down=True and closes the websocket ensures the
reconnection loop exits on its next iteration regardless of where
the interrupt lands.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…e connection

websocket.close() sends a close frame through Cloudflare, which may
not break the admin handler's pending read. Instead, cancel the admin
handler task directly and force-close the underlying transport. Same
fix applied to the SIGINT handler.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
_handle_job_assigned was awaiting the entire training pipeline inline,
blocking the admin handler's message loop for the full duration of
training. No WebSocket messages (including heartbeat pings) could be
processed during this time, causing the watchdog to false-trigger and
breaking progress relay to the dashboard.

Run the job as an asyncio.create_task so the admin handler loop
continues reading messages during training.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…nnection

RelayChannel captured the websocket at job start time. After a
signaling server restart and reconnection, progress messages were
sent through the old dead websocket. Now references the mesh
coordinator's current websocket dynamically, so progress relay
survives reconnection.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The admin promotion and demotion re-registration messages referenced
self.worker.id_token which doesn't exist with account key auth,
causing 'RTCWorkerClient has no attribute id_token' errors during
multi-worker reconnection scenarios.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@alicup29 alicup29 changed the title Add WebSocket reconnection with exponential backoff and signaling heartbeat Add WebSocket Reconnection with Exponential Backoff and Signaling Heartbeat Mar 19, 2026
alicup29 and others added 2 commits March 19, 2026 13:28
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Includes Caddy config with all API routes, handle_path for relay,
env vars for Docker, Elastic IP reassociation steps, troubleshooting
table, and warning about Terraform-triggered instance replacement.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@alicup29 alicup29 merged commit 753defe into main Mar 19, 2026
5 checks passed
@alicup29 alicup29 deleted the amick/worker-try-reconnect branch March 19, 2026 21:04
@alicup29 alicup29 mentioned this pull request Mar 24, 2026
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant