Skip to content

Implement Dashboard Job Submission with Relay Server#64

Merged
alicup29 merged 9 commits intomainfrom
amick/implement-dashboard-job-submission
Mar 13, 2026
Merged

Implement Dashboard Job Submission with Relay Server#64
alicup29 merged 9 commits intomainfrom
amick/implement-dashboard-job-submission

Conversation

@alicup29
Copy link
Copy Markdown
Collaborator

Summary

Replaces the direct WebRTC data channel approach for dashboard↔worker communication with an HTTP + SSE relay architecture. The dashboard can now browse worker filesystems, submit training jobs, and monitor live epoch-level progress — all without establishing WebRTC peer connections.

Motivation

The original dashboard job submission used WebRTC data channels between the browser and HPC workers. This worked on local networks but consistently failed in production HPC environments due to:

  • Symmetric NAT on HPC clusters blocking STUN-based hole punching
  • Firewall rules dropping UDP traffic and blocking TURN relay
  • ICE negotiation timeouts causing jobs to silently fail to submit

After 8+ commits debugging ICE/STUN issues on the earlier add-dashboard-job-submission branch, we redesigned the architecture around outbound-only connections from workers, routing all communication through the signaling server and a lightweight SSE relay on the EC2 instance. This eliminates NAT traversal entirely — workers connect outbound via WebSocket, and browsers connect outbound via HTTP/SSE.

See the companion server-side PR: webRTC-connect#29

Architecture

Dashboard (browser)
    │
    ├── HTTP POST ──→ Signaling Server ──→ Worker (via WebSocket)
    │                      │
    └── SSE ←── Relay ←────┘ (worker sends status/progress back)
  • File browsing: Dashboard sends fs_list HTTP request → signaling pushes fs_list_req to worker WS → worker responds with fs_list_res → signaling forwards to relay → dashboard receives via SSE
  • Job submission: Dashboard POSTs config to /api/jobs/submit → signaling pushes job_assigned to worker → worker runs job, sends job_status/job_progress over WS → relay → dashboard SSE
  • Training progress: Worker's RelayChannel shim filters PROGRESS_REPORT:: messages, forwarding only epoch-level events (train_begin, epoch_end, train_end) to avoid flooding the relay

Changes

Dashboard (dashboard/)

  • app.js: Replaced ~200 lines of WebRTC connection code (connectToWorker, disconnectFromWorker, sendFsMessage, onDataChannelMessage) with SSE helpers (sseConnect) and HTTP API wrappers. Added full Step 3 validation flow (browse .slp → validate path → check videos → resolve missing → submit). Added job_progress SSE handler with training log terminal display.
  • index.html: Added #sj-training-log container for live epoch output
  • styles.css: Added training log terminal styles and validation/video resolution UI styles
  • config.js: Updated endpoints for relay server and signaling.sleap.ai

Worker (sleap_rtc/worker/)

  • mesh_coordinator.py: Added relay-forwarded message handlers (fs_list_req, use_worker_path, fs_check_videos, job_assigned) that translate between JSON relay format and the worker's existing :: protocol. Added RelayChannel shim class that mimics RTCDataChannel.send() but routes protocol messages through WebSocket as JSON, with epoch-level progress filtering.
  • worker_class.py: Added worker mounts to registration properties for dashboard metadata enrichment

Key fixes along the way

  • Worker status mismatch (available vs idle) preventing worker selection
  • SSE offset appended to filesystem path breaking mount validation
  • Stale SSE responses resetting file browser columns
  • Relay job_id collision from spread operator letting internal per-model IDs overwrite the signaling server's job_id
  • Infinite scroll re-rendering wiping deeper navigation columns

Test plan

  • File browser navigates worker filesystem via relay
  • Path validation and video resolution work end-to-end
  • Job submission succeeds and status updates flow to dashboard
  • Training log terminal shows epoch summaries with loss metrics
  • WandB link appears when enabled in training config
  • Job completion and failure states display correctly
  • Relay is not flooded (only ~1 message per epoch forwarded)

🤖 Generated with Claude Code

alicup29 and others added 8 commits March 12, 2026 18:34
Replace direct WebRTC peer connections with HTTP + SSE relay architecture
for all dashboard↔worker communication (file browsing, path validation,
video resolution, job submission/status).

Changes:
- Add SSE helpers (sseConnect) and HTTP API wrappers for worker messaging,
  filesystem listing, job submit/cancel
- Remove ~200 lines of WebRTC code (connectToWorker, disconnectFromWorker,
  sendFsMessage, onDataChannelMessage)
- Rewrite Step 3 with full validation flow: browse .slp → validate path →
  check videos → resolve missing videos → submit with path_mappings
- Add validation status and missing videos resolution UI (HTML + CSS)
- Update config.js with RELAY_SERVER and signaling.sleap.ai endpoints
- Add worker mounts to registration properties for metadata enrichment

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Workers register with status "available" (not "idle"), which caused
them to appear grayed out and unselectable in the job submission modal.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The worker's admin WebSocket handler was ignoring fs_list_req,
use_worker_path, and fs_check_videos messages from the signaling server
(logged as "Admin ignoring message type"). Added handlers that translate
between JSON relay format and the worker's existing :: protocol methods.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The infinite scroll was appending ?offset=100 to the filesystem path
(e.g., /root/vast/amick?offset=100), which failed the mount check.
Now sends offset as a separate JSON field through the relay chain.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two fixes:
1. File browser reset: fs_list_res with no matching pending req_id was
   falling through to colIndex=1, resetting the browser. Now ignored.
2. Job submission: worker had no handler for job_assigned messages from
   the signaling server. Added RelayChannel shim that translates
   channel.send() protocol messages to job_status JSON via WebSocket,
   enabling the full job lifecycle (accept/reject/progress/complete/fail)
   to flow through the relay to the dashboard SSE.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. Removed infinite scroll pagination from file browser - it was
   re-rendering the current column when triggered, wiping out any
   deeper columns the user had navigated into.
2. Added readyState='open' to RelayChannel shim so the job executor's
   channel.readyState checks don't crash with AttributeError.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… dashboard

The spread operator in RelayChannel.send() was letting the worker's internal
per-model job_id (e.g. job_xxx_0) overwrite the signaling server's job_id
(e.g. job_xxx) that the dashboard SSE channel is subscribed to. Reversed
spread order so _base (with correct job_id) always wins.

Also added INFERENCE_BEGIN/COMPLETE/FAILED message forwarding so the dashboard
shows proper stage transitions (Training → Running inference → Complete).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Filter PROGRESS_REPORT messages in RelayChannel to forward only
train_begin, epoch_end, and train_end events. Add a training log
terminal to the dashboard that shows epoch summaries with loss
metrics and wandb links in real time.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@alicup29 alicup29 changed the title Add relay-based dashboard job submission with live training progress Implement Dashboard Job Submission with Relay Server Mar 13, 2026
Update TestDashboardJobSubmission smoke tests to assert the new
relay-based patterns (sseConnect, apiFsList, job_status/job_progress
SSE handlers) instead of the removed WebRTC patterns (connectToWorker,
sendFsMessage, JOB_SUBMIT protocol strings). Fix Black formatting in
mesh_coordinator.py.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@alicup29 alicup29 merged commit 85d11b4 into main Mar 13, 2026
10 checks passed
@alicup29 alicup29 deleted the amick/implement-dashboard-job-submission branch March 13, 2026 20:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant