Implement Dashboard Job Submission with Relay Server#64
Merged
Conversation
Replace direct WebRTC peer connections with HTTP + SSE relay architecture for all dashboard↔worker communication (file browsing, path validation, video resolution, job submission/status). Changes: - Add SSE helpers (sseConnect) and HTTP API wrappers for worker messaging, filesystem listing, job submit/cancel - Remove ~200 lines of WebRTC code (connectToWorker, disconnectFromWorker, sendFsMessage, onDataChannelMessage) - Rewrite Step 3 with full validation flow: browse .slp → validate path → check videos → resolve missing videos → submit with path_mappings - Add validation status and missing videos resolution UI (HTML + CSS) - Update config.js with RELAY_SERVER and signaling.sleap.ai endpoints - Add worker mounts to registration properties for metadata enrichment Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Workers register with status "available" (not "idle"), which caused them to appear grayed out and unselectable in the job submission modal. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The worker's admin WebSocket handler was ignoring fs_list_req, use_worker_path, and fs_check_videos messages from the signaling server (logged as "Admin ignoring message type"). Added handlers that translate between JSON relay format and the worker's existing :: protocol methods. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The infinite scroll was appending ?offset=100 to the filesystem path (e.g., /root/vast/amick?offset=100), which failed the mount check. Now sends offset as a separate JSON field through the relay chain. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two fixes: 1. File browser reset: fs_list_res with no matching pending req_id was falling through to colIndex=1, resetting the browser. Now ignored. 2. Job submission: worker had no handler for job_assigned messages from the signaling server. Added RelayChannel shim that translates channel.send() protocol messages to job_status JSON via WebSocket, enabling the full job lifecycle (accept/reject/progress/complete/fail) to flow through the relay to the dashboard SSE. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. Removed infinite scroll pagination from file browser - it was re-rendering the current column when triggered, wiping out any deeper columns the user had navigated into. 2. Added readyState='open' to RelayChannel shim so the job executor's channel.readyState checks don't crash with AttributeError. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… dashboard The spread operator in RelayChannel.send() was letting the worker's internal per-model job_id (e.g. job_xxx_0) overwrite the signaling server's job_id (e.g. job_xxx) that the dashboard SSE channel is subscribed to. Reversed spread order so _base (with correct job_id) always wins. Also added INFERENCE_BEGIN/COMPLETE/FAILED message forwarding so the dashboard shows proper stage transitions (Training → Running inference → Complete). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Filter PROGRESS_REPORT messages in RelayChannel to forward only train_begin, epoch_end, and train_end events. Add a training log terminal to the dashboard that shows epoch summaries with loss metrics and wandb links in real time. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update TestDashboardJobSubmission smoke tests to assert the new relay-based patterns (sseConnect, apiFsList, job_status/job_progress SSE handlers) instead of the removed WebRTC patterns (connectToWorker, sendFsMessage, JOB_SUBMIT protocol strings). Fix Black formatting in mesh_coordinator.py. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replaces the direct WebRTC data channel approach for dashboard↔worker communication with an HTTP + SSE relay architecture. The dashboard can now browse worker filesystems, submit training jobs, and monitor live epoch-level progress — all without establishing WebRTC peer connections.
Motivation
The original dashboard job submission used WebRTC data channels between the browser and HPC workers. This worked on local networks but consistently failed in production HPC environments due to:
After 8+ commits debugging ICE/STUN issues on the earlier
add-dashboard-job-submissionbranch, we redesigned the architecture around outbound-only connections from workers, routing all communication through the signaling server and a lightweight SSE relay on the EC2 instance. This eliminates NAT traversal entirely — workers connect outbound via WebSocket, and browsers connect outbound via HTTP/SSE.See the companion server-side PR: webRTC-connect#29
Architecture
fs_listHTTP request → signaling pushesfs_list_reqto worker WS → worker responds withfs_list_res→ signaling forwards to relay → dashboard receives via SSE/api/jobs/submit→ signaling pushesjob_assignedto worker → worker runs job, sendsjob_status/job_progressover WS → relay → dashboard SSERelayChannelshim filtersPROGRESS_REPORT::messages, forwarding only epoch-level events (train_begin,epoch_end,train_end) to avoid flooding the relayChanges
Dashboard (
dashboard/)connectToWorker,disconnectFromWorker,sendFsMessage,onDataChannelMessage) with SSE helpers (sseConnect) and HTTP API wrappers. Added full Step 3 validation flow (browse.slp→ validate path → check videos → resolve missing → submit). Addedjob_progressSSE handler with training log terminal display.#sj-training-logcontainer for live epoch outputWorker (
sleap_rtc/worker/)fs_list_req,use_worker_path,fs_check_videos,job_assigned) that translate between JSON relay format and the worker's existing::protocol. AddedRelayChannelshim class that mimicsRTCDataChannel.send()but routes protocol messages through WebSocket as JSON, with epoch-level progress filtering.Key fixes along the way
availablevsidle) preventing worker selectionTest plan
🤖 Generated with Claude Code