Merged
Conversation
Workers now store the connecting peer's role from the signaling server's forwarded offer message. When role is "client", on_datachannel skips the AUTH_CHALLENGE/response flow and immediately marks the channel as trusted, trusting the signaling server's JWT-based admission decision. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a "Submit Job" button to the room-actions section of each non-expired room card. Clicking it calls openSubmitJobModal(roomId), which is stubbed for now and will be fully implemented in subsequent tasks. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three-step modal wizard with worker selection (step 1), config YAML upload (step 2), and filesystem browser (step 3), plus a status view after job acceptance. Includes step progress indicator, worker list with availability states, config drop zone, column file browser scaffold, and WandB URL link. Step navigation and close wired up in app.js. Also adds js-yaml CDN for YAML parsing in Task 5. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add _sjRenderWorkerList() which renders sj-worker-row elements with worker name, GPU specs (gpu_model, gpu_memory_mb, cuda_version), and a status dot (idle/busy/maintenance). Available workers are clickable; busy/maintenance workers are rendered with .disabled class. Add sjSelectWorker(peerId) which stores the selected worker ID, toggles .selected on rows, and enables the Next button. Wire _sjRenderWorkerList() into openSubmitJobModal so the list is populated when the modal opens. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…dashboard Add _get_sleap_nn_version() helper to worker_class.py that imports sleap_nn.__version__ with 'unknown' fallback. Wire into both registration property blocks in worker_class.py and state_manager.py. Display sleap-nn version in the worker selection row specs line alongside gpu_model, gpu_memory_mb, and cuda_version. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…odal Step 2 Add parseTrainingConfig(yamlText): uses js-yaml to extract batch_size, learning_rate, max_epochs, run_name, and WandB project/entity fields from a sleap-nn training config, with 'unknown' fallback for missing keys. Add _sjInitDropzone(): wires dragover/dragleave/drop events on #sj-config-dropzone and change on #sj-config-input file picker. Add _sjHandleConfigFile(file): reads file as text, calls parseTrainingConfig, renders to #sj-hyperparams via _sjRenderHyperparams, shows inline error on invalid YAML, enables #sj-next-2 on success, stores raw text as _sjConfigContent. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add connectToWorker(workerId, roomId, roomToken): opens a WebSocket to the signaling server, registers as role='client', waits for registered_auth to get ICE servers, creates RTCPeerConnection and 'job' data channel, sends offer with role='client' so worker skips AUTH_CHALLENGE, handles answer/ candidate messages, resolves promise when data channel opens. 15 s timeout rejects and disconnects on failure. Add disconnectFromWorker(): closes data channel, peer connection, and WebSocket. Update closeSubmitJobModal to call disconnectFromWorker. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add sjEnterStep3(): connects to selected worker via connectToWorker(), shows spinner while connecting, reveals file columns on success, shows inline error on failure. Add sendFsMessage(msg): sends text over the open data channel. Add onDataChannelMessage(event): dispatches FS_MOUNTS_RESPONSE (renders mount points as column 0) and FS_LIST_RESPONSE (renders directory contents at the appropriate column index). Add initFileBrowser(): sends FS_GET_MOUNTS to populate the first column. Add renderColumn(entries, colIndex, hasMorePath): filters entries to directories and .slp files only, truncates columns to the right, renders clickable rows. Directory click sends FS_LIST_DIR and appends next column. .slp click stores path as _sjLabelsPath, shows selected path display, enables Submit button. Infinite scroll requests next page when scrolled to bottom if has_more is true. Wire Next button in Step 2 to sjEnterStep3() instead of sjGoToStep(3). Add CSS icon prefix for directory and .slp file entries. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add submitJob(): generates a UUID job_id, builds a TrainJobSpec with
config_content, labels_path, and model_types=[], sends it as
JOB_SUBMIT::{job_id}::{json_spec} over the data channel.
Extend onDataChannelMessage to handle job lifecycle messages:
- JOB_ACCEPTED: switches modal to status view, sets label to 'Running'
- JOB_REJECTED: parses error list, shows inline error in browser view
- JOB_PROGRESS: updates status label with epoch/loss, reveals WandB
link element if wandb_url is present in the progress payload
- JOB_COMPLETE: updates label to 'Complete'
- JOB_FAILED: updates label to 'Failed: {message}'
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add test_js_yaml_cdn_included: asserts js-yaml CDN script is present in index.html (required for YAML parsing in Step 2). Add test_nav_tabs_present: asserts all four data-tab nav items (rooms, tokens, quickstart, about) are present in index.html. 788 tests passing, black and ruff clean. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
openSubmitJobModal now checks roomWorkers before opening the modal. If no workers are connected, it shows an error toast — 'No workers connected to this room. Start a worker with sleap-rtc worker before submitting a job.' — and returns without opening the modal. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Bug 1 — Unknown GPU on HPC: capabilities._detect_gpu_model now falls back to nvidia-smi when torch returns an empty string for get_device_properties().name (observed on some HPC nodes). dashboard _sjRenderWorkerList uses || instead of ?? so empty string also triggers the 'Unknown GPU' fallback. Bug 2 — YAML config params all 'unknown': parseTrainingConfig now resolves trainer_config (sleap-nn key) before trainer/root fallbacks. batch_size reads trainer_config.train_data_loader .batch_size; learning_rate reads trainer_config.optimizer.lr. Bug 3 — WebRTC timeout on HPC (ICE stalls at 'checking'): mesh_coordinator._handle_client_offer was reusing self.worker.pc which was created before registered_auth delivered ICE server credentials, so it had no STUN/TURN config. On HPC with blocked UDP, this caused ICE to stall indefinitely. Fix: replace self.worker.pc with a fresh connection from _create_client_peer_connection() before setRemoteDescription. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Dashboard dropzone now shows the selected filename and a "replace" hint after a config YAML is successfully parsed, with the file input re-wired so the user can drop or browse a new file to replace it - worker_class.py handle_connection now skips "offer" messages when admin_controller.is_admin is True; MeshCoordinator's admin WebSocket handles client offers (with ICE server config), so the main loop was sending a spurious "worker_busy" error that caused the dashboard to time out - Add TestDoubleOfferGuard test to assert the admin guard is present Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Root cause of dashboard WebRTC timeout on HPC: The admin WebSocket handler (_admin_websocket_handler_loop) had no case for msg_type == "candidate". When a worker is admin, handle_connection exits and all signaling goes through the admin handler. Browser clients (dashboard) use trickle ICE and send relay/TURN candidates as separate "candidate" messages after the offer. These were silently dropped, so the worker never learned about the browser's reachable candidates and ICE stayed at "checking" forever. The GUI Python client (aiortc) embeds ALL candidates inline in the SDP — no separate candidate messages — which is why the GUI works fine. Also fixes in the same handler: - Set _pending_offer_role from the offer data so handle_channel_open skips the PSK challenge for dashboard "client" role connections - Add ICE diagnostic logging: server count, candidate list after gathering, warning when ice_servers is empty Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two fixes for the dashboard WebRTC timeout on HPC:
1. Candidate parsing bug: browser sends candidate.toJSON() as
{"candidate": "candidate:xxx ...", "sdpMid": "0", ...} — the actual
candidate string is nested under the "candidate" key, not as top-level
ip/port/protocol fields. Pass the raw dict directly to addIceCandidate
so aiortc parses the candidate string internally (same as worker_class.py).
2. Public STUN fallback: when the signaling server returns no ice_servers
in registered_auth (current state), _create_peer_connection now falls
back to stun.l.google.com for client connections. This lets the worker
discover its reflexive address. TURN relay still requires server-side
credentials, but STUN unblocks cases where the HPC node has a public IP.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ners aiortc hardcodes stun.l.google.com:19302 as its default STUN server when no ICE servers are explicitly configured. In RunAI/Kubernetes GPU containers without internet access, this causes gather_candidates() to hang indefinitely (DNS resolution blocks, and aioice's asyncio.gather waits for all coroutines). The fix passes an explicit empty iceServers list via RTCConfiguration so aiortc skips STUN entirely and gathers only host candidates. This is sufficient when client and worker share the same network. Changes: - worker_class._create_peer_connection: always pass RTCConfiguration (removes STUN fallback that made things worse in containers) - mesh_coordinator: replace 4 bare RTCPeerConnection() calls with worker._create_mesh_peer_connection() factory method - worker.py standalone: pass RTCConfiguration(iceServers=[]) - Add TestNoDefaultStun tests enforcing no bare RTCPeerConnection() Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
addIceCandidate expects an RTCIceCandidate, not a raw dict. Parse the browser's candidate string via aioice Candidate.from_sdp, convert with candidate_from_aioice, and attach sdpMid/sdpMLineIndex from the dict. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two fixes for dashboard-to-worker connection: 1. Browser STUN fallback: When signaling server returns no ICE servers, the dashboard now adds stun.l.google.com:19302 so the browser discovers its real IP (srflx candidate) instead of only sending mDNS-obfuscated host candidates (.local hostnames) that the worker container can't resolve without multicast. 2. Diagnostic logging: Log ifaddr-discovered addresses before PC creation to help diagnose why worker gathers 0 host candidates in containers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The icegatheringstatechange event fires during gather() before candidates are added to the SDP, causing a misleading "0 candidates" log. Move the candidate log to after setLocalDescription completes so it shows the actual answer SDP content. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
STUN (stun.l.google.com) IS reachable from RunAI containers over UDP. The earlier "no internet" assumption was wrong — DNS was slow, not blocked. Re-enable STUN fallback when signaling server provides no ICE servers so both worker and browser get server-reflexive candidates with public IPs, enabling NAT hole-punching for dashboard connections. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Log all signaling messages, ICE state changes, gathering state, local candidates, and answer SDP receipt to browser console. This will show whether the browser receives the worker's answer and forms ICE pairs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Worker returns {name, type:"directory"} but renderColumn expects
{name, path, is_dir}. Entries were being filtered out, causing the
column browser to appear empty after clicking a mount.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds browser-based job submission to the SLEAP-RTC dashboard — the first end-to-end workflow for submitting training jobs from a web browser to remote GPU workers. This branch validates the full job submission UX and surfaces the NAT traversal limitations that motivate the relay architecture described below.
Dashboard Features
Job submission modal
Multi-step wizard (Select Worker → Configure Job → Browse Files → Submit) embedded in the dashboard. Workers are listed with live GPU specs, CUDA version, sleap-nn version, and idle/busy status. Users upload a sleap-nn config YAML (parsed client-side with js-yaml), browse the worker's filesystem in a column-style file browser to select a
.slplabels file, and submit — all from the browser.Browser-to-worker WebRTC connection
The dashboard establishes a direct
RTCPeerConnection+RTCDataChannelto the selected worker, routed through the signaling server's admin WebSocket for SDP exchange. Job submission (JOB_SUBMIT), filesystem browsing (FS_LIST_DIR/FS_LIST_RESPONSE), and progress reporting all flow over this data channel. PSK challenge is bypassed for dashboard peers since they're already authenticated via GitHub OAuth.Worker metadata in registration
Workers now report
sleap_nn_versionin their registration metadata alongside GPU model, memory, CUDA version, and hostname. The dashboard renders this in both the room card worker badges and the job modal worker list.Motivation for relay architecture
During HPC testing, this branch hit the expected NAT traversal problems: symmetric NAT on HPC clusters requires TURN relays for WebRTC, trickle ICE candidate parsing needed multiple fix rounds, and STUN fallback behavior varied across container runtimes. The commit history tells the story — 8 of 22 commits are ICE/STUN debugging fixes.
This validates the architectural decision to move dashboard communication away from direct P2P WebRTC toward the signaling server + relay server pattern documented in
scratch/sleap-rtc-dashboard-architecture.md:All communication is outbound-only from workers/training processes, eliminating NAT traversal entirely. The relay server is a new ~100-line component; the signaling server gains job dispatch and filesystem routing endpoints. The existing worker internals (job execution, filesystem listing, progress monitoring via ZMQ) are transport-agnostic and carry forward unchanged.
What carries forward
mount_paths,visible_foldersjob_{uuid4_hex[:8]})FileManagermount-scoped path validationJobExecutorsubprocess spawningRELAY_URL+JOB_IDenv varsTest plan
pytest tests/test_smoke.py tests/test_worker_config_content.py tests/test_p2p_auth.py🤖 Generated with Claude Code