Skip to content

Add Dashboard Job Submission Workflow#63

Merged
alicup29 merged 24 commits intomainfrom
amick/add-dashboard-job-submission
Mar 12, 2026
Merged

Add Dashboard Job Submission Workflow#63
alicup29 merged 24 commits intomainfrom
amick/add-dashboard-job-submission

Conversation

@alicup29
Copy link
Copy Markdown
Collaborator

@alicup29 alicup29 commented Mar 12, 2026

Summary

Adds browser-based job submission to the SLEAP-RTC dashboard — the first end-to-end workflow for submitting training jobs from a web browser to remote GPU workers. This branch validates the full job submission UX and surfaces the NAT traversal limitations that motivate the relay architecture described below.

Dashboard Features

Job submission modal

Multi-step wizard (Select Worker → Configure Job → Browse Files → Submit) embedded in the dashboard. Workers are listed with live GPU specs, CUDA version, sleap-nn version, and idle/busy status. Users upload a sleap-nn config YAML (parsed client-side with js-yaml), browse the worker's filesystem in a column-style file browser to select a .slp labels file, and submit — all from the browser.

Browser-to-worker WebRTC connection

The dashboard establishes a direct RTCPeerConnection + RTCDataChannel to the selected worker, routed through the signaling server's admin WebSocket for SDP exchange. Job submission (JOB_SUBMIT), filesystem browsing (FS_LIST_DIR / FS_LIST_RESPONSE), and progress reporting all flow over this data channel. PSK challenge is bypassed for dashboard peers since they're already authenticated via GitHub OAuth.

Worker metadata in registration

Workers now report sleap_nn_version in their registration metadata alongside GPU model, memory, CUDA version, and hostname. The dashboard renders this in both the room card worker badges and the job modal worker list.

Motivation for relay architecture

During HPC testing, this branch hit the expected NAT traversal problems: symmetric NAT on HPC clusters requires TURN relays for WebRTC, trickle ICE candidate parsing needed multiple fix rounds, and STUN fallback behavior varied across container runtimes. The commit history tells the story — 8 of 22 commits are ICE/STUN debugging fixes.

This validates the architectural decision to move dashboard communication away from direct P2P WebRTC toward the signaling server + relay server pattern documented in scratch/sleap-rtc-dashboard-architecture.md:

  • Workers connect outbound to the signaling server via persistent WebSocket (no inbound ports needed)
  • Dashboard submits jobs via HTTP POST to the signaling server, which forwards to workers over WebSocket
  • Training progress is published outbound from the training process to a relay server via HTTP POST
  • Dashboard monitors via SSE streams from the relay server (auto-reconnect, no WebSocket upgrade)
  • Filesystem browsing routes through signaling → worker WS → relay → SSE, reusing the same SSE connection

All communication is outbound-only from workers/training processes, eliminating NAT traversal entirely. The relay server is a new ~100-line component; the signaling server gains job dispatch and filesystem routing endpoints. The existing worker internals (job execution, filesystem listing, progress monitoring via ZMQ) are transport-agnostic and carry forward unchanged.

What carries forward

Component Reusable Notes
Job modal UX (steps, worker list, YAML upload, file browser) Yes Transport-agnostic UI components
Worker registration metadata Yes Extend with mount_paths, visible_folders
Job ID generation (job_{uuid4_hex[:8]}) Yes Keep as-is
Job lifecycle states (available/busy/reserved) Yes Extend for relay status events
FileManager mount-scoped path validation Yes Same logic, different transport
JobExecutor subprocess spawning Yes Inject RELAY_URL + JOB_ID env vars
ZMQ progress reporting (training→worker) Yes Worker forwards to relay via HTTP POST
WebRTC data channel transport No Replaced by HTTP/SSE for dashboard

Test plan

  • Dashboard loads, room cards show Submit Job button
  • Job modal: select worker → upload YAML → browse filesystem → submit
  • Toast warning when submitting to room with no workers
  • pytest tests/test_smoke.py tests/test_worker_config_content.py tests/test_p2p_auth.py

🤖 Generated with Claude Code

alicup29 and others added 24 commits March 11, 2026 13:41
Workers now store the connecting peer's role from the signaling server's
forwarded offer message. When role is "client", on_datachannel skips the
AUTH_CHALLENGE/response flow and immediately marks the channel as trusted,
trusting the signaling server's JWT-based admission decision.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a "Submit Job" button to the room-actions section of each
non-expired room card. Clicking it calls openSubmitJobModal(roomId),
which is stubbed for now and will be fully implemented in subsequent
tasks.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three-step modal wizard with worker selection (step 1), config YAML
upload (step 2), and filesystem browser (step 3), plus a status view
after job acceptance. Includes step progress indicator, worker list
with availability states, config drop zone, column file browser
scaffold, and WandB URL link. Step navigation and close wired up in
app.js. Also adds js-yaml CDN for YAML parsing in Task 5.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add _sjRenderWorkerList() which renders sj-worker-row elements with
worker name, GPU specs (gpu_model, gpu_memory_mb, cuda_version), and
a status dot (idle/busy/maintenance). Available workers are clickable;
busy/maintenance workers are rendered with .disabled class.

Add sjSelectWorker(peerId) which stores the selected worker ID, toggles
.selected on rows, and enables the Next button.

Wire _sjRenderWorkerList() into openSubmitJobModal so the list is
populated when the modal opens.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…dashboard

Add _get_sleap_nn_version() helper to worker_class.py that imports
sleap_nn.__version__ with 'unknown' fallback. Wire into both registration
property blocks in worker_class.py and state_manager.py.

Display sleap-nn version in the worker selection row specs line alongside
gpu_model, gpu_memory_mb, and cuda_version.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…odal Step 2

Add parseTrainingConfig(yamlText): uses js-yaml to extract batch_size,
learning_rate, max_epochs, run_name, and WandB project/entity fields
from a sleap-nn training config, with 'unknown' fallback for missing keys.

Add _sjInitDropzone(): wires dragover/dragleave/drop events on
#sj-config-dropzone and change on #sj-config-input file picker.

Add _sjHandleConfigFile(file): reads file as text, calls parseTrainingConfig,
renders to #sj-hyperparams via _sjRenderHyperparams, shows inline error on
invalid YAML, enables #sj-next-2 on success, stores raw text as _sjConfigContent.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add connectToWorker(workerId, roomId, roomToken): opens a WebSocket to
the signaling server, registers as role='client', waits for registered_auth
to get ICE servers, creates RTCPeerConnection and 'job' data channel, sends
offer with role='client' so worker skips AUTH_CHALLENGE, handles answer/
candidate messages, resolves promise when data channel opens. 15 s timeout
rejects and disconnects on failure.

Add disconnectFromWorker(): closes data channel, peer connection, and WebSocket.
Update closeSubmitJobModal to call disconnectFromWorker.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add sjEnterStep3(): connects to selected worker via connectToWorker(),
shows spinner while connecting, reveals file columns on success, shows
inline error on failure.

Add sendFsMessage(msg): sends text over the open data channel.

Add onDataChannelMessage(event): dispatches FS_MOUNTS_RESPONSE (renders
mount points as column 0) and FS_LIST_RESPONSE (renders directory
contents at the appropriate column index).

Add initFileBrowser(): sends FS_GET_MOUNTS to populate the first column.

Add renderColumn(entries, colIndex, hasMorePath): filters entries to
directories and .slp files only, truncates columns to the right,
renders clickable rows. Directory click sends FS_LIST_DIR and appends
next column. .slp click stores path as _sjLabelsPath, shows selected
path display, enables Submit button. Infinite scroll requests next page
when scrolled to bottom if has_more is true.

Wire Next button in Step 2 to sjEnterStep3() instead of sjGoToStep(3).
Add CSS icon prefix for directory and .slp file entries.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add submitJob(): generates a UUID job_id, builds a TrainJobSpec with
config_content, labels_path, and model_types=[], sends it as
JOB_SUBMIT::{job_id}::{json_spec} over the data channel.

Extend onDataChannelMessage to handle job lifecycle messages:
- JOB_ACCEPTED: switches modal to status view, sets label to 'Running'
- JOB_REJECTED: parses error list, shows inline error in browser view
- JOB_PROGRESS: updates status label with epoch/loss, reveals WandB
  link element if wandb_url is present in the progress payload
- JOB_COMPLETE: updates label to 'Complete'
- JOB_FAILED: updates label to 'Failed: {message}'

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add test_js_yaml_cdn_included: asserts js-yaml CDN script is present in
index.html (required for YAML parsing in Step 2).

Add test_nav_tabs_present: asserts all four data-tab nav items (rooms,
tokens, quickstart, about) are present in index.html.

788 tests passing, black and ruff clean.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
openSubmitJobModal now checks roomWorkers before opening the modal.
If no workers are connected, it shows an error toast —
'No workers connected to this room. Start a worker with sleap-rtc worker
before submitting a job.' — and returns without opening the modal.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Bug 1 — Unknown GPU on HPC:
capabilities._detect_gpu_model now falls back to nvidia-smi when torch
returns an empty string for get_device_properties().name (observed on
some HPC nodes). dashboard _sjRenderWorkerList uses || instead of ??
so empty string also triggers the 'Unknown GPU' fallback.

Bug 2 — YAML config params all 'unknown':
parseTrainingConfig now resolves trainer_config (sleap-nn key) before
trainer/root fallbacks. batch_size reads trainer_config.train_data_loader
.batch_size; learning_rate reads trainer_config.optimizer.lr.

Bug 3 — WebRTC timeout on HPC (ICE stalls at 'checking'):
mesh_coordinator._handle_client_offer was reusing self.worker.pc which
was created before registered_auth delivered ICE server credentials, so
it had no STUN/TURN config. On HPC with blocked UDP, this caused ICE to
stall indefinitely. Fix: replace self.worker.pc with a fresh connection
from _create_client_peer_connection() before setRemoteDescription.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Dashboard dropzone now shows the selected filename and a "replace" hint
  after a config YAML is successfully parsed, with the file input re-wired
  so the user can drop or browse a new file to replace it
- worker_class.py handle_connection now skips "offer" messages when
  admin_controller.is_admin is True; MeshCoordinator's admin WebSocket
  handles client offers (with ICE server config), so the main loop was
  sending a spurious "worker_busy" error that caused the dashboard to
  time out
- Add TestDoubleOfferGuard test to assert the admin guard is present

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Root cause of dashboard WebRTC timeout on HPC:

The admin WebSocket handler (_admin_websocket_handler_loop) had no case
for msg_type == "candidate". When a worker is admin, handle_connection
exits and all signaling goes through the admin handler. Browser clients
(dashboard) use trickle ICE and send relay/TURN candidates as separate
"candidate" messages after the offer. These were silently dropped,
so the worker never learned about the browser's reachable candidates
and ICE stayed at "checking" forever.

The GUI Python client (aiortc) embeds ALL candidates inline in the SDP
— no separate candidate messages — which is why the GUI works fine.

Also fixes in the same handler:
- Set _pending_offer_role from the offer data so handle_channel_open
  skips the PSK challenge for dashboard "client" role connections
- Add ICE diagnostic logging: server count, candidate list after
  gathering, warning when ice_servers is empty

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two fixes for the dashboard WebRTC timeout on HPC:

1. Candidate parsing bug: browser sends candidate.toJSON() as
   {"candidate": "candidate:xxx ...", "sdpMid": "0", ...} — the actual
   candidate string is nested under the "candidate" key, not as top-level
   ip/port/protocol fields. Pass the raw dict directly to addIceCandidate
   so aiortc parses the candidate string internally (same as worker_class.py).

2. Public STUN fallback: when the signaling server returns no ice_servers
   in registered_auth (current state), _create_peer_connection now falls
   back to stun.l.google.com for client connections. This lets the worker
   discover its reflexive address. TURN relay still requires server-side
   credentials, but STUN unblocks cases where the HPC node has a public IP.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ners

aiortc hardcodes stun.l.google.com:19302 as its default STUN server when
no ICE servers are explicitly configured. In RunAI/Kubernetes GPU containers
without internet access, this causes gather_candidates() to hang indefinitely
(DNS resolution blocks, and aioice's asyncio.gather waits for all coroutines).

The fix passes an explicit empty iceServers list via RTCConfiguration so
aiortc skips STUN entirely and gathers only host candidates. This is
sufficient when client and worker share the same network.

Changes:
- worker_class._create_peer_connection: always pass RTCConfiguration
  (removes STUN fallback that made things worse in containers)
- mesh_coordinator: replace 4 bare RTCPeerConnection() calls with
  worker._create_mesh_peer_connection() factory method
- worker.py standalone: pass RTCConfiguration(iceServers=[])
- Add TestNoDefaultStun tests enforcing no bare RTCPeerConnection()

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
addIceCandidate expects an RTCIceCandidate, not a raw dict. Parse the
browser's candidate string via aioice Candidate.from_sdp, convert with
candidate_from_aioice, and attach sdpMid/sdpMLineIndex from the dict.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two fixes for dashboard-to-worker connection:

1. Browser STUN fallback: When signaling server returns no ICE servers,
   the dashboard now adds stun.l.google.com:19302 so the browser discovers
   its real IP (srflx candidate) instead of only sending mDNS-obfuscated
   host candidates (.local hostnames) that the worker container can't
   resolve without multicast.

2. Diagnostic logging: Log ifaddr-discovered addresses before PC creation
   to help diagnose why worker gathers 0 host candidates in containers.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The icegatheringstatechange event fires during gather() before candidates
are added to the SDP, causing a misleading "0 candidates" log. Move the
candidate log to after setLocalDescription completes so it shows the
actual answer SDP content.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
STUN (stun.l.google.com) IS reachable from RunAI containers over UDP.
The earlier "no internet" assumption was wrong — DNS was slow, not blocked.

Re-enable STUN fallback when signaling server provides no ICE servers so
both worker and browser get server-reflexive candidates with public IPs,
enabling NAT hole-punching for dashboard connections.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Log all signaling messages, ICE state changes, gathering state, local
candidates, and answer SDP receipt to browser console. This will show
whether the browser receives the worker's answer and forms ICE pairs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Worker returns {name, type:"directory"} but renderColumn expects
{name, path, is_dir}. Entries were being filtered out, causing the
column browser to appear empty after clicking a mount.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@alicup29 alicup29 changed the title feat: add dashboard job submission workflow Add Dashboard Job Submission Workflow Mar 12, 2026
@alicup29 alicup29 merged commit 281839e into main Mar 12, 2026
8 checks passed
@alicup29 alicup29 deleted the amick/add-dashboard-job-submission branch March 12, 2026 21:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant