Add Dashboard Job Submission Workflow by alicup29 · Pull Request #63 · talmolab/sleap-rtc

alicup29 · 2026-03-12T20:55:08Z

Summary

Adds browser-based job submission to the SLEAP-RTC dashboard — the first end-to-end workflow for submitting training jobs from a web browser to remote GPU workers. This branch validates the full job submission UX and surfaces the NAT traversal limitations that motivate the relay architecture described below.

Dashboard Features

Job submission modal

Multi-step wizard (Select Worker → Configure Job → Browse Files → Submit) embedded in the dashboard. Workers are listed with live GPU specs, CUDA version, sleap-nn version, and idle/busy status. Users upload a sleap-nn config YAML (parsed client-side with js-yaml), browse the worker's filesystem in a column-style file browser to select a .slp labels file, and submit — all from the browser.

Browser-to-worker WebRTC connection

The dashboard establishes a direct RTCPeerConnection + RTCDataChannel to the selected worker, routed through the signaling server's admin WebSocket for SDP exchange. Job submission (JOB_SUBMIT), filesystem browsing (FS_LIST_DIR / FS_LIST_RESPONSE), and progress reporting all flow over this data channel. PSK challenge is bypassed for dashboard peers since they're already authenticated via GitHub OAuth.

Worker metadata in registration

Workers now report sleap_nn_version in their registration metadata alongside GPU model, memory, CUDA version, and hostname. The dashboard renders this in both the room card worker badges and the job modal worker list.

Motivation for relay architecture

During HPC testing, this branch hit the expected NAT traversal problems: symmetric NAT on HPC clusters requires TURN relays for WebRTC, trickle ICE candidate parsing needed multiple fix rounds, and STUN fallback behavior varied across container runtimes. The commit history tells the story — 8 of 22 commits are ICE/STUN debugging fixes.

This validates the architectural decision to move dashboard communication away from direct P2P WebRTC toward the signaling server + relay server pattern documented in scratch/sleap-rtc-dashboard-architecture.md:

Workers connect outbound to the signaling server via persistent WebSocket (no inbound ports needed)
Dashboard submits jobs via HTTP POST to the signaling server, which forwards to workers over WebSocket
Training progress is published outbound from the training process to a relay server via HTTP POST
Dashboard monitors via SSE streams from the relay server (auto-reconnect, no WebSocket upgrade)
Filesystem browsing routes through signaling → worker WS → relay → SSE, reusing the same SSE connection

All communication is outbound-only from workers/training processes, eliminating NAT traversal entirely. The relay server is a new ~100-line component; the signaling server gains job dispatch and filesystem routing endpoints. The existing worker internals (job execution, filesystem listing, progress monitoring via ZMQ) are transport-agnostic and carry forward unchanged.

What carries forward

Component	Reusable	Notes
Job modal UX (steps, worker list, YAML upload, file browser)	Yes	Transport-agnostic UI components
Worker registration metadata	Yes	Extend with `mount_paths`, `visible_folders`
Job ID generation (`job_{uuid4_hex[:8]}`)	Yes	Keep as-is
Job lifecycle states (available/busy/reserved)	Yes	Extend for relay status events
`FileManager` mount-scoped path validation	Yes	Same logic, different transport
`JobExecutor` subprocess spawning	Yes	Inject `RELAY_URL` + `JOB_ID` env vars
ZMQ progress reporting (training→worker)	Yes	Worker forwards to relay via HTTP POST
WebRTC data channel transport	No	Replaced by HTTP/SSE for dashboard

Test plan

Dashboard loads, room cards show Submit Job button
Job modal: select worker → upload YAML → browse filesystem → submit
Toast warning when submitting to room with no workers
pytest tests/test_smoke.py tests/test_worker_config_content.py tests/test_p2p_auth.py

🤖 Generated with Claude Code

Workers now store the connecting peer's role from the signaling server's forwarded offer message. When role is "client", on_datachannel skips the AUTH_CHALLENGE/response flow and immediately marks the channel as trusted, trusting the signaling server's JWT-based admission decision. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds a "Submit Job" button to the room-actions section of each non-expired room card. Clicking it calls openSubmitJobModal(roomId), which is stubbed for now and will be fully implemented in subsequent tasks. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Three-step modal wizard with worker selection (step 1), config YAML upload (step 2), and filesystem browser (step 3), plus a status view after job acceptance. Includes step progress indicator, worker list with availability states, config drop zone, column file browser scaffold, and WandB URL link. Step navigation and close wired up in app.js. Also adds js-yaml CDN for YAML parsing in Task 5. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add _sjRenderWorkerList() which renders sj-worker-row elements with worker name, GPU specs (gpu_model, gpu_memory_mb, cuda_version), and a status dot (idle/busy/maintenance). Available workers are clickable; busy/maintenance workers are rendered with .disabled class. Add sjSelectWorker(peerId) which stores the selected worker ID, toggles .selected on rows, and enables the Next button. Wire _sjRenderWorkerList() into openSubmitJobModal so the list is populated when the modal opens. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…dashboard Add _get_sleap_nn_version() helper to worker_class.py that imports sleap_nn.__version__ with 'unknown' fallback. Wire into both registration property blocks in worker_class.py and state_manager.py. Display sleap-nn version in the worker selection row specs line alongside gpu_model, gpu_memory_mb, and cuda_version. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…odal Step 2 Add parseTrainingConfig(yamlText): uses js-yaml to extract batch_size, learning_rate, max_epochs, run_name, and WandB project/entity fields from a sleap-nn training config, with 'unknown' fallback for missing keys. Add _sjInitDropzone(): wires dragover/dragleave/drop events on #sj-config-dropzone and change on #sj-config-input file picker. Add _sjHandleConfigFile(file): reads file as text, calls parseTrainingConfig, renders to #sj-hyperparams via _sjRenderHyperparams, shows inline error on invalid YAML, enables #sj-next-2 on success, stores raw text as _sjConfigContent. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add connectToWorker(workerId, roomId, roomToken): opens a WebSocket to the signaling server, registers as role='client', waits for registered_auth to get ICE servers, creates RTCPeerConnection and 'job' data channel, sends offer with role='client' so worker skips AUTH_CHALLENGE, handles answer/ candidate messages, resolves promise when data channel opens. 15 s timeout rejects and disconnects on failure. Add disconnectFromWorker(): closes data channel, peer connection, and WebSocket. Update closeSubmitJobModal to call disconnectFromWorker. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add sjEnterStep3(): connects to selected worker via connectToWorker(), shows spinner while connecting, reveals file columns on success, shows inline error on failure. Add sendFsMessage(msg): sends text over the open data channel. Add onDataChannelMessage(event): dispatches FS_MOUNTS_RESPONSE (renders mount points as column 0) and FS_LIST_RESPONSE (renders directory contents at the appropriate column index). Add initFileBrowser(): sends FS_GET_MOUNTS to populate the first column. Add renderColumn(entries, colIndex, hasMorePath): filters entries to directories and .slp files only, truncates columns to the right, renders clickable rows. Directory click sends FS_LIST_DIR and appends next column. .slp click stores path as _sjLabelsPath, shows selected path display, enables Submit button. Infinite scroll requests next page when scrolled to bottom if has_more is true. Wire Next button in Step 2 to sjEnterStep3() instead of sjGoToStep(3). Add CSS icon prefix for directory and .slp file entries. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add submitJob(): generates a UUID job_id, builds a TrainJobSpec with config_content, labels_path, and model_types=[], sends it as JOB_SUBMIT::{job_id}::{json_spec} over the data channel. Extend onDataChannelMessage to handle job lifecycle messages: - JOB_ACCEPTED: switches modal to status view, sets label to 'Running' - JOB_REJECTED: parses error list, shows inline error in browser view - JOB_PROGRESS: updates status label with epoch/loss, reveals WandB link element if wandb_url is present in the progress payload - JOB_COMPLETE: updates label to 'Complete' - JOB_FAILED: updates label to 'Failed: {message}' Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add test_js_yaml_cdn_included: asserts js-yaml CDN script is present in index.html (required for YAML parsing in Step 2). Add test_nav_tabs_present: asserts all four data-tab nav items (rooms, tokens, quickstart, about) are present in index.html. 788 tests passing, black and ruff clean. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

openSubmitJobModal now checks roomWorkers before opening the modal. If no workers are connected, it shows an error toast — 'No workers connected to this room. Start a worker with sleap-rtc worker before submitting a job.' — and returns without opening the modal. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Bug 1 — Unknown GPU on HPC: capabilities._detect_gpu_model now falls back to nvidia-smi when torch returns an empty string for get_device_properties().name (observed on some HPC nodes). dashboard _sjRenderWorkerList uses || instead of ?? so empty string also triggers the 'Unknown GPU' fallback. Bug 2 — YAML config params all 'unknown': parseTrainingConfig now resolves trainer_config (sleap-nn key) before trainer/root fallbacks. batch_size reads trainer_config.train_data_loader .batch_size; learning_rate reads trainer_config.optimizer.lr. Bug 3 — WebRTC timeout on HPC (ICE stalls at 'checking'): mesh_coordinator._handle_client_offer was reusing self.worker.pc which was created before registered_auth delivered ICE server credentials, so it had no STUN/TURN config. On HPC with blocked UDP, this caused ICE to stall indefinitely. Fix: replace self.worker.pc with a fresh connection from _create_client_peer_connection() before setRemoteDescription. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Dashboard dropzone now shows the selected filename and a "replace" hint after a config YAML is successfully parsed, with the file input re-wired so the user can drop or browse a new file to replace it - worker_class.py handle_connection now skips "offer" messages when admin_controller.is_admin is True; MeshCoordinator's admin WebSocket handles client offers (with ICE server config), so the main loop was sending a spurious "worker_busy" error that caused the dashboard to time out - Add TestDoubleOfferGuard test to assert the admin guard is present Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Root cause of dashboard WebRTC timeout on HPC: The admin WebSocket handler (_admin_websocket_handler_loop) had no case for msg_type == "candidate". When a worker is admin, handle_connection exits and all signaling goes through the admin handler. Browser clients (dashboard) use trickle ICE and send relay/TURN candidates as separate "candidate" messages after the offer. These were silently dropped, so the worker never learned about the browser's reachable candidates and ICE stayed at "checking" forever. The GUI Python client (aiortc) embeds ALL candidates inline in the SDP — no separate candidate messages — which is why the GUI works fine. Also fixes in the same handler: - Set _pending_offer_role from the offer data so handle_channel_open skips the PSK challenge for dashboard "client" role connections - Add ICE diagnostic logging: server count, candidate list after gathering, warning when ice_servers is empty Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Two fixes for the dashboard WebRTC timeout on HPC: 1. Candidate parsing bug: browser sends candidate.toJSON() as {"candidate": "candidate:xxx ...", "sdpMid": "0", ...} — the actual candidate string is nested under the "candidate" key, not as top-level ip/port/protocol fields. Pass the raw dict directly to addIceCandidate so aiortc parses the candidate string internally (same as worker_class.py). 2. Public STUN fallback: when the signaling server returns no ice_servers in registered_auth (current state), _create_peer_connection now falls back to stun.l.google.com for client connections. This lets the worker discover its reflexive address. TURN relay still requires server-side credentials, but STUN unblocks cases where the HPC node has a public IP. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ners aiortc hardcodes stun.l.google.com:19302 as its default STUN server when no ICE servers are explicitly configured. In RunAI/Kubernetes GPU containers without internet access, this causes gather_candidates() to hang indefinitely (DNS resolution blocks, and aioice's asyncio.gather waits for all coroutines). The fix passes an explicit empty iceServers list via RTCConfiguration so aiortc skips STUN entirely and gathers only host candidates. This is sufficient when client and worker share the same network. Changes: - worker_class._create_peer_connection: always pass RTCConfiguration (removes STUN fallback that made things worse in containers) - mesh_coordinator: replace 4 bare RTCPeerConnection() calls with worker._create_mesh_peer_connection() factory method - worker.py standalone: pass RTCConfiguration(iceServers=[]) - Add TestNoDefaultStun tests enforcing no bare RTCPeerConnection() Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

addIceCandidate expects an RTCIceCandidate, not a raw dict. Parse the browser's candidate string via aioice Candidate.from_sdp, convert with candidate_from_aioice, and attach sdpMid/sdpMLineIndex from the dict. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Two fixes for dashboard-to-worker connection: 1. Browser STUN fallback: When signaling server returns no ICE servers, the dashboard now adds stun.l.google.com:19302 so the browser discovers its real IP (srflx candidate) instead of only sending mDNS-obfuscated host candidates (.local hostnames) that the worker container can't resolve without multicast. 2. Diagnostic logging: Log ifaddr-discovered addresses before PC creation to help diagnose why worker gathers 0 host candidates in containers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The icegatheringstatechange event fires during gather() before candidates are added to the SDP, causing a misleading "0 candidates" log. Move the candidate log to after setLocalDescription completes so it shows the actual answer SDP content. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

STUN (stun.l.google.com) IS reachable from RunAI containers over UDP. The earlier "no internet" assumption was wrong — DNS was slow, not blocked. Re-enable STUN fallback when signaling server provides no ICE servers so both worker and browser get server-reflexive candidates with public IPs, enabling NAT hole-punching for dashboard connections. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Log all signaling messages, ICE state changes, gathering state, local candidates, and answer SDP receipt to browser console. This will show whether the browser receives the worker's answer and forms ICE pairs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Worker returns {name, type:"directory"} but renderColumn expects {name, path, is_dir}. Entries were being filtered out, causing the column browser to appear empty after clicking a mount. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

alicup29 and others added 24 commits March 11, 2026 13:41

chore: elevate PSK bypass log to warning for security visibility

25a1898

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

debug: increase WebRTC connection timeout to 2 min for ICE debugging

b26f49e

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

alicup29 changed the title ~~feat: add dashboard job submission workflow~~ Add Dashboard Job Submission Workflow Mar 12, 2026

alicup29 merged commit 281839e into main Mar 12, 2026
8 checks passed

alicup29 deleted the amick/add-dashboard-job-submission branch March 12, 2026 21:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Dashboard Job Submission Workflow#63

Add Dashboard Job Submission Workflow#63
alicup29 merged 24 commits intomainfrom
amick/add-dashboard-job-submission

alicup29 commented Mar 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

alicup29 commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Dashboard Features

Job submission modal

Browser-to-worker WebRTC connection

Worker metadata in registration

Motivation for relay architecture

What carries forward

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

alicup29 commented Mar 12, 2026 •

edited

Loading