Implement Dashboard Job Submission with Relay Server by alicup29 · Pull Request #64 · talmolab/sleap-rtc

alicup29 · 2026-03-13T20:28:19Z

Summary

Replaces the direct WebRTC data channel approach for dashboard↔worker communication with an HTTP + SSE relay architecture. The dashboard can now browse worker filesystems, submit training jobs, and monitor live epoch-level progress — all without establishing WebRTC peer connections.

Motivation

The original dashboard job submission used WebRTC data channels between the browser and HPC workers. This worked on local networks but consistently failed in production HPC environments due to:

Symmetric NAT on HPC clusters blocking STUN-based hole punching
Firewall rules dropping UDP traffic and blocking TURN relay
ICE negotiation timeouts causing jobs to silently fail to submit

After 8+ commits debugging ICE/STUN issues on the earlier add-dashboard-job-submission branch, we redesigned the architecture around outbound-only connections from workers, routing all communication through the signaling server and a lightweight SSE relay on the EC2 instance. This eliminates NAT traversal entirely — workers connect outbound via WebSocket, and browsers connect outbound via HTTP/SSE.

See the companion server-side PR: webRTC-connect#29

Architecture

Dashboard (browser)
    │
    ├── HTTP POST ──→ Signaling Server ──→ Worker (via WebSocket)
    │                      │
    └── SSE ←── Relay ←────┘ (worker sends status/progress back)

File browsing: Dashboard sends fs_list HTTP request → signaling pushes fs_list_req to worker WS → worker responds with fs_list_res → signaling forwards to relay → dashboard receives via SSE
Job submission: Dashboard POSTs config to /api/jobs/submit → signaling pushes job_assigned to worker → worker runs job, sends job_status/job_progress over WS → relay → dashboard SSE
Training progress: Worker's RelayChannel shim filters PROGRESS_REPORT:: messages, forwarding only epoch-level events (train_begin, epoch_end, train_end) to avoid flooding the relay

Changes

Dashboard (`dashboard/`)

app.js: Replaced ~200 lines of WebRTC connection code (connectToWorker, disconnectFromWorker, sendFsMessage, onDataChannelMessage) with SSE helpers (sseConnect) and HTTP API wrappers. Added full Step 3 validation flow (browse .slp → validate path → check videos → resolve missing → submit). Added job_progress SSE handler with training log terminal display.
index.html: Added #sj-training-log container for live epoch output
styles.css: Added training log terminal styles and validation/video resolution UI styles
config.js: Updated endpoints for relay server and signaling.sleap.ai

Worker (`sleap_rtc/worker/`)

mesh_coordinator.py: Added relay-forwarded message handlers (fs_list_req, use_worker_path, fs_check_videos, job_assigned) that translate between JSON relay format and the worker's existing :: protocol. Added RelayChannel shim class that mimics RTCDataChannel.send() but routes protocol messages through WebSocket as JSON, with epoch-level progress filtering.
worker_class.py: Added worker mounts to registration properties for dashboard metadata enrichment

Key fixes along the way

Worker status mismatch (available vs idle) preventing worker selection
SSE offset appended to filesystem path breaking mount validation
Stale SSE responses resetting file browser columns
Relay job_id collision from spread operator letting internal per-model IDs overwrite the signaling server's job_id
Infinite scroll re-rendering wiping deeper navigation columns

Test plan

File browser navigates worker filesystem via relay
Path validation and video resolution work end-to-end
Job submission succeeds and status updates flow to dashboard
Training log terminal shows epoch summaries with loss metrics
WandB link appears when enabled in training config
Job completion and failure states display correctly
Relay is not flooded (only ~1 message per epoch forwarded)

🤖 Generated with Claude Code

Replace direct WebRTC peer connections with HTTP + SSE relay architecture for all dashboard↔worker communication (file browsing, path validation, video resolution, job submission/status). Changes: - Add SSE helpers (sseConnect) and HTTP API wrappers for worker messaging, filesystem listing, job submit/cancel - Remove ~200 lines of WebRTC code (connectToWorker, disconnectFromWorker, sendFsMessage, onDataChannelMessage) - Rewrite Step 3 with full validation flow: browse .slp → validate path → check videos → resolve missing videos → submit with path_mappings - Add validation status and missing videos resolution UI (HTML + CSS) - Update config.js with RELAY_SERVER and signaling.sleap.ai endpoints - Add worker mounts to registration properties for metadata enrichment Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Workers register with status "available" (not "idle"), which caused them to appear grayed out and unselectable in the job submission modal. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The worker's admin WebSocket handler was ignoring fs_list_req, use_worker_path, and fs_check_videos messages from the signaling server (logged as "Admin ignoring message type"). Added handlers that translate between JSON relay format and the worker's existing :: protocol methods. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The infinite scroll was appending ?offset=100 to the filesystem path (e.g., /root/vast/amick?offset=100), which failed the mount check. Now sends offset as a separate JSON field through the relay chain. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Two fixes: 1. File browser reset: fs_list_res with no matching pending req_id was falling through to colIndex=1, resetting the browser. Now ignored. 2. Job submission: worker had no handler for job_assigned messages from the signaling server. Added RelayChannel shim that translates channel.send() protocol messages to job_status JSON via WebSocket, enabling the full job lifecycle (accept/reject/progress/complete/fail) to flow through the relay to the dashboard SSE. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

1. Removed infinite scroll pagination from file browser - it was re-rendering the current column when triggered, wiping out any deeper columns the user had navigated into. 2. Added readyState='open' to RelayChannel shim so the job executor's channel.readyState checks don't crash with AttributeError. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… dashboard The spread operator in RelayChannel.send() was letting the worker's internal per-model job_id (e.g. job_xxx_0) overwrite the signaling server's job_id (e.g. job_xxx) that the dashboard SSE channel is subscribed to. Reversed spread order so _base (with correct job_id) always wins. Also added INFERENCE_BEGIN/COMPLETE/FAILED message forwarding so the dashboard shows proper stage transitions (Training → Running inference → Complete). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Filter PROGRESS_REPORT messages in RelayChannel to forward only train_begin, epoch_end, and train_end events. Add a training log terminal to the dashboard that shows epoch summaries with loss metrics and wandb links in real time. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Update TestDashboardJobSubmission smoke tests to assert the new relay-based patterns (sseConnect, apiFsList, job_status/job_progress SSE handlers) instead of the removed WebRTC patterns (connectToWorker, sendFsMessage, JOB_SUBMIT protocol strings). Fix Black formatting in mesh_coordinator.py. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

alicup29 and others added 8 commits March 12, 2026 18:34

fix: match worker status 'available' instead of 'idle'

6f8a254

Workers register with status "available" (not "idle"), which caused them to appear grayed out and unselectable in the job submission modal. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

alicup29 changed the title ~~Add relay-based dashboard job submission with live training progress~~ Implement Dashboard Job Submission with Relay Server Mar 13, 2026

alicup29 temporarily deployed to github-pages March 13, 2026 20:37 — with GitHub Pages Inactive

alicup29 merged commit 85d11b4 into main Mar 13, 2026
10 checks passed

alicup29 deleted the amick/implement-dashboard-job-submission branch March 13, 2026 20:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Dashboard Job Submission with Relay Server#64

Implement Dashboard Job Submission with Relay Server#64
alicup29 merged 9 commits intomainfrom
amick/implement-dashboard-job-submission

alicup29 commented Mar 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

alicup29 commented Mar 13, 2026

Summary

Motivation

Architecture

Changes

Dashboard (dashboard/)

Worker (sleap_rtc/worker/)

Key fixes along the way

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Dashboard (`dashboard/`)

Worker (`sleap_rtc/worker/`)