Skip to content

Fix Worker Room Reconnection Bugs#72

Merged
alicup29 merged 4 commits intomainfrom
amick/worker-reconnection-bugs
Mar 24, 2026
Merged

Fix Worker Room Reconnection Bugs#72
alicup29 merged 4 commits intomainfrom
amick/worker-reconnection-bugs

Conversation

@alicup29
Copy link
Copy Markdown
Collaborator

Summary

Fixes four bugs that occur after workers reconnect to the signaling server (e.g., after a server restart or network disruption):

  • File browser empty after reconnection — Dashboard cached stale worker metadata (missing mounts). Now re-fetches worker data from the server each time the file browser opens in Submit Job.
  • Worker name shows as peer_id on dashboard — Re-registration messages after admin promotion/demotion were missing worker_name, mounts, hostname, and other metadata. Extracted _build_registration_msg() helper to ensure all re-registrations include the same full metadata as the initial registration.
  • File browser frozen on non-admin workers — The non-admin WebSocket handler only processed mesh signaling messages (mesh_answer, ice_candidate). Dashboard requests like fs_list_req, job_assigned, use_worker_path, and fs_check_videos were silently ignored. Now the non-admin handler processes all relay-forwarded requests.
  • Ctrl+C doesn't exit after reconnection — SIGINT handler only cancelled the admin handler task and main websocket. After reconnection with admin churn, the mesh coordinator has additional active tasks (_non_admin_handler_task, its own websocket, heartbeat watchdog) that weren't being cancelled. Now all active tasks and websockets are cleaned up.

Motivation

Workers deployed on GPU clusters reconnect automatically after signaling server restarts (PR #69). But after reconnection, admin re-election causes workers to cycle through promotion/demotion, which left several subsystems in broken states. These fixes ensure that the full worker lifecycle — file browsing, job submission, dashboard display, and clean shutdown — works correctly after any reconnection event.

Test plan

  • All 852 tests pass
  • E2E: restart signaling server with 2 workers → both reconnect, names show correctly
  • E2E: file browsing works on both admin and non-admin workers after reconnection
  • E2E: training job submission works after reconnection
  • E2E: Ctrl+C exits cleanly after reconnection

🤖 Generated with Claude Code

alicup29 and others added 4 commits March 23, 2026 17:43
After worker reconnection, the dashboard's cached worker metadata is
stale and missing mount info, causing file browser to request path=/
which triggers ACCESS_DENIED. Re-fetch worker data from the server
each time the file browser opens to ensure fresh mount configuration.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
SIGINT handler now cancels all active mesh coordinator tasks (both
admin and non-admin handler tasks), closes the mesh coordinator's
own websocket (which may differ from the main worker websocket after
promotion/demotion churn), and cancels the heartbeat watchdog task.
Previously only the admin handler task and main websocket were cleaned
up, leaving background tasks running after Ctrl+C.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…demotion

Re-registration messages after admin promotion/demotion were missing
worker_name, mounts, hostname, sleap_version, and other metadata that
the initial registration includes. This caused two bugs after
reconnection:
- Worker showed as peer_id instead of name on the dashboard
- File browser was frozen/empty because mounts were missing from metadata

Extract _build_registration_msg() helper to ensure consistent metadata
across initial registration and all re-registrations.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Non-admin workers ignored dashboard requests (fs_list_req, use_worker_path,
fs_check_videos, job_assigned, ping) because the non-admin handler loop
only processed mesh signaling messages. After reconnection when a worker
gets demoted to non-admin, its WebSocket is managed by the non-admin
handler — which must handle all the same relay-forwarded requests that
the admin handler does.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@alicup29 alicup29 changed the title Fix worker reconnection bugs: file browsing, metadata, Ctrl+C, non-admin routing Fix Worker Room Reconnection Bugs Mar 24, 2026
@alicup29 alicup29 merged commit ffa3ef3 into main Mar 24, 2026
10 checks passed
@alicup29 alicup29 deleted the amick/worker-reconnection-bugs branch March 24, 2026 20:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant