Merged
Conversation
After worker reconnection, the dashboard's cached worker metadata is stale and missing mount info, causing file browser to request path=/ which triggers ACCESS_DENIED. Re-fetch worker data from the server each time the file browser opens to ensure fresh mount configuration. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
SIGINT handler now cancels all active mesh coordinator tasks (both admin and non-admin handler tasks), closes the mesh coordinator's own websocket (which may differ from the main worker websocket after promotion/demotion churn), and cancels the heartbeat watchdog task. Previously only the admin handler task and main websocket were cleaned up, leaving background tasks running after Ctrl+C. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…demotion Re-registration messages after admin promotion/demotion were missing worker_name, mounts, hostname, sleap_version, and other metadata that the initial registration includes. This caused two bugs after reconnection: - Worker showed as peer_id instead of name on the dashboard - File browser was frozen/empty because mounts were missing from metadata Extract _build_registration_msg() helper to ensure consistent metadata across initial registration and all re-registrations. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Non-admin workers ignored dashboard requests (fs_list_req, use_worker_path, fs_check_videos, job_assigned, ping) because the non-admin handler loop only processed mesh signaling messages. After reconnection when a worker gets demoted to non-admin, its WebSocket is managed by the non-admin handler — which must handle all the same relay-forwarded requests that the admin handler does. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes four bugs that occur after workers reconnect to the signaling server (e.g., after a server restart or network disruption):
worker_name,mounts,hostname, and other metadata. Extracted_build_registration_msg()helper to ensure all re-registrations include the same full metadata as the initial registration.mesh_answer,ice_candidate). Dashboard requests likefs_list_req,job_assigned,use_worker_path, andfs_check_videoswere silently ignored. Now the non-admin handler processes all relay-forwarded requests._non_admin_handler_task, its own websocket, heartbeat watchdog) that weren't being cancelled. Now all active tasks and websockets are cleaned up.Motivation
Workers deployed on GPU clusters reconnect automatically after signaling server restarts (PR #69). But after reconnection, admin re-election causes workers to cycle through promotion/demotion, which left several subsystems in broken states. These fixes ensure that the full worker lifecycle — file browsing, job submission, dashboard display, and clean shutdown — works correctly after any reconnection event.
Test plan
🤖 Generated with Claude Code