Add WebSocket Reconnection with Exponential Backoff and Signaling Heartbeat by alicup29 · Pull Request #69 · talmolab/sleap-rtc

alicup29 · 2026-03-19T20:20:37Z

Motivation

Workers on GPU clusters (RunAI, Slurm) die when the signaling server restarts or has a brief network blip. Since these are headless containers, a dead worker requires manual intervention to restart — which is unacceptable for production deployments. Workers need to survive transient signaling server disconnections automatically.

Additionally, Cloudflare proxies the WebSocket connection between workers and the signaling server, answering TCP-level pings on behalf of a dead backend. This means workers can't detect a server restart using standard WebSocket keepalive — they need application-level heartbeat detection.

Changes

Reconnection loop (`worker_class.py`)

Refactored run_worker() into one-time setup + reconnection loop
Exponential backoff: 1s, 2s, 4s, 8s... capping at 5 minutes (300s)
Peer ID generated once and reused across reconnections
Re-registers with current status (busy/available) to preserve mid-job state
Mesh connections left untouched (independent WebRTC peer connections)
open_timeout=20 on websockets.connect() to prevent indefinite hangs
Timeout check before sleep (not after) so --max-reconnect-time is accurate

CLI (`cli.py`)

Added --max-reconnect-time option (e.g. 30m, 2h, 90s, 3600)
Default: no limit (retry forever)
_parse_duration() helper with validation

Signaling heartbeat watchdog (`worker_class.py`, `mesh_coordinator.py`)

Worker tracks when it last received a {"type": "ping"} from the server
Watchdog task checks every 30s — if no ping for 90s, force-closes the websocket and cancels the admin handler task to trigger reconnection
Handles both handle_connection and admin handler message paths
Companion server PR: webRTC-connect#30 adds the server-side ping loop (30s interval)

Reconnection banner

Rich-formatted banner matching the initial connection style
Shows "reconnected! (attempt N, was disconnected for Xs)" instead of "ready!"
Restarts admin WebSocket handler with fresh websocket on reconnection

Clean shutdown

SIGINT handler sets shutting_down=True, cancels admin handler task, and force-closes websocket transport
Detects damaged event loop from repeated Ctrl+C and exits cleanly

Bug fixes discovered during E2E testing

Job handler blocking admin loop: _handle_job_assigned was awaiting the entire training pipeline inline, blocking all WebSocket message processing (including heartbeat pings and progress relay). Now runs as a background task.
Stale websocket in RelayChannel: Progress messages during training were sent through a websocket reference captured at job start. After reconnection, this was dead. Now dynamically references mesh_coordinator.websocket.
Legacy id_token in mesh re-registration: Admin promotion/demotion re-registration used self.worker.id_token which doesn't exist with account key auth. Replaced with self.worker.api_key.
KeyboardInterrupt swallowed by admin handler: Broad except Exception in the admin handler caught interrupt side-effects, causing reconnection instead of exit.

Known issues (not addressed in this PR)

Admin election state not fully recovered after multi-worker reconnection — fs_list_req can be unhandled if admin status is lost. Follow-up needed for admin state recovery.
delete-peer cleanup API returns 500 (Cognito legacy code, not related to reconnection)
'int' object is not callable error in second worker during reconnection (needs traceback to diagnose)

Server-side companion

webRTC-connect PR #30 — adds per-connection _ping_loop that sends {"type": "ping"} every 30s to each registered peer.

E2E test results

Test	Result
Basic reconnection (server restart)	✅ Pass
Mid-job reconnection (training continues + progress relays after reconnect)	✅ Pass
`--max-reconnect-time 2m` causes exit	✅ Pass
Ctrl+C during backoff exits cleanly	✅ Pass
Multi-worker reconnection (peer discovery)	✅ Pass
Heartbeat pings continue during training	✅ Pass

Test plan

30 unit tests in tests/test_worker_reconnection.py
E2E: server restart → worker reconnects with banner
E2E: mid-job restart → training continues, progress updates resume
E2E: --max-reconnect-time → clean exit after timeout
E2E: Ctrl+C → immediate clean shutdown
E2E: two workers reconnect and discover each other

🤖 Generated with Claude Code

Workers now survive transient signaling server disconnections. On WebSocket close, the worker retries with exponential backoff (1s, 2s, 4s... capping at 5 min) and re-registers with its current status (busy/available). Peer ID is stable across reconnections. Mesh connections are left untouched. - Refactor run_worker() into one-time setup + reconnection loop - Add --max-reconnect-time CLI option (e.g. '30m', '2h') - Add open_timeout=20 to websockets.connect to prevent hangs - Check timeout before sleep, not after - 27 tests covering all reconnection scenarios Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace the plain === log block on reconnection with the same styled banner used for the initial connection. On reconnection it shows "Worker <name> reconnected! (attempt N, was disconnected for Xs)" instead of "ready! Waiting for client requests...". Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The admin handler loop reads from mesh_coordinator.websocket, which became stale after reconnection. The old handler task was already done, so awaiting it returned immediately, causing handle_connection to return and triggering an infinite reconnect loop. Fix: update mesh_coordinator.websocket to the fresh connection and call _start_admin_websocket_handler() to create a new handler task. Also move the admin handler await outside the if/else so it runs for both first connection and reconnection. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…nections Cloudflare proxies WebSocket connections and answers TCP-level pings on behalf of a dead backend, so the worker never detects a server restart. The signaling server now sends {"type": "ping"} every 30s. A watchdog task on the worker checks if pings stop arriving — if none for 90s, it closes the websocket to trigger the reconnection loop. - Add _signaling_heartbeat_watchdog() to RTCWorkerClient - Handle "ping" messages in handle_connection and admin handler - Start/cancel watchdog in reconnection loop - Add design doc and 3 watchdog tests (30 total) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The admin WebSocket handler's broad `except Exception` was catching the effects of Ctrl+C, causing the handler to return normally instead of propagating the interrupt. This made the reconnection loop think the connection ended and reconnect instead of exiting. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

After KeyboardInterrupt corrupts the asyncio event loop, subsequent reconnection attempts fail with RuntimeError('no running event loop') which is caught by the broad except Exception handler, causing an infinite retry loop. Detect this specific error and break. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

KeyboardInterrupt was being swallowed by broad except handlers deep in the admin handler and mesh coordinator stack, causing the worker to reconnect instead of exiting. Installing a SIGINT handler that sets shutting_down=True and closes the websocket ensures the reconnection loop exits on its next iteration regardless of where the interrupt lands. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…e connection websocket.close() sends a close frame through Cloudflare, which may not break the admin handler's pending read. Instead, cancel the admin handler task directly and force-close the underlying transport. Same fix applied to the SIGINT handler. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

_handle_job_assigned was awaiting the entire training pipeline inline, blocking the admin handler's message loop for the full duration of training. No WebSocket messages (including heartbeat pings) could be processed during this time, causing the watchdog to false-trigger and breaking progress relay to the dashboard. Run the job as an asyncio.create_task so the admin handler loop continues reading messages during training. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…nnection RelayChannel captured the websocket at job start time. After a signaling server restart and reconnection, progress messages were sent through the old dead websocket. Now references the mesh coordinator's current websocket dynamically, so progress relay survives reconnection. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The admin promotion and demotion re-registration messages referenced self.worker.id_token which doesn't exist with account key auth, causing 'RTCWorkerClient has no attribute id_token' errors during multi-worker reconnection scenarios. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Includes Caddy config with all API routes, handle_path for relay, env vars for Docker, Elastic IP reassociation steps, troubleshooting table, and warning about Terraform-triggered instance replacement. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

alicup29 and others added 13 commits March 18, 2026 14:12

debug: add ping received logging to diagnose watchdog false triggers

4cf789f

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

debug: yellow coloring for heartbeat ping logs

b496cef

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

alicup29 changed the title ~~Add WebSocket reconnection with exponential backoff and signaling heartbeat~~ Add WebSocket Reconnection with Exponential Backoff and Signaling Heartbeat Mar 19, 2026

alicup29 and others added 2 commits March 19, 2026 13:28

style: run black formatter

d1b1a74

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

alicup29 merged commit 753defe into main Mar 19, 2026
5 checks passed

alicup29 deleted the amick/worker-try-reconnect branch March 19, 2026 21:04

alicup29 mentioned this pull request Mar 24, 2026

Fix Worker Room Reconnection Bugs #72

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add WebSocket Reconnection with Exponential Backoff and Signaling Heartbeat#69

Add WebSocket Reconnection with Exponential Backoff and Signaling Heartbeat#69
alicup29 merged 15 commits intomainfrom
amick/worker-try-reconnect

alicup29 commented Mar 19, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

alicup29 commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Changes

Reconnection loop (worker_class.py)

CLI (cli.py)

Signaling heartbeat watchdog (worker_class.py, mesh_coordinator.py)

Reconnection banner

Clean shutdown

Bug fixes discovered during E2E testing

Known issues (not addressed in this PR)

Server-side companion

E2E test results

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

alicup29 commented Mar 19, 2026 •

edited

Loading

Reconnection loop (`worker_class.py`)

CLI (`cli.py`)

Signaling heartbeat watchdog (`worker_class.py`, `mesh_coordinator.py`)