Dashboard inference job submission by alicup29 · Pull Request #76 · talmolab/sleap-rtc

alicup29 · 2026-04-01T19:49:49Z

Summary

Adds inference (track) job submission to the dashboard, letting users run sleap-nn inference on remote workers directly from the web UI.

Motivation

Previously the dashboard only supported training job submission. Users who wanted to run inference on trained models had to use the SLEAP GUI, sleap-app, or the CLI. This PR enables the full inference workflow from the dashboard: select models, select data, submit, and monitor — matching the training job experience.

Key Changes

Dashboard — Job type selector

Training/Inference toggle in Step 1 of the job submission wizard
Step indicator adapts: 3 steps for training (worker → config → files), 2 for inference (worker → files)
Info box explains the inference flow when selected

Dashboard — Inference file selection

Combined view for selecting model checkpoint directories and .slp data file
Reuses the existing SSE relay file browser with inference-specific routing
Model paths shown as cards with remove buttons, "+ Add Another Model" browse button
"Select Folder" / "Select File" button appears on click (no double-click required)
Submit enabled only when both models and data file are selected

Dashboard — Inference progress

Status view shows "Running inference..." with streaming worker logs
Hides training-specific UI (epoch counter, metrics, Stop Early button)
Cancel button works via same ZMQ mechanism as training
Logs saved to activeJobs for replay in Job Summary

Worker — Inference output forwarding

Track jobs merge stderr into stdout (stderr=STDOUT) so rich/tqdm output is captured
RelayChannel catch-all forwards unrecognized text lines and CR:: progress updates as running status messages
All sleap-nn inference output (timing, predictions, output path) reaches the dashboard

Files changed

File	Change
`dashboard/app.js`	Job type toggle, inference file selection, branched submitJob(), log persistence
`dashboard/index.html`	Job type toggle HTML, inference file selection view, file browser container
`dashboard/styles.css`	Job type toggle, section labels, model cards, browse buttons, data input
`sleap_rtc/worker/job_executor.py`	stderr=STDOUT for track jobs, skip stderr task when merged
`sleap_rtc/worker/mesh_coordinator.py`	Catch-all relay for text lines and CR:: progress updates

Test plan

🤖 Generated with Claude Code

Adds a Training/Inference toggle in Step 1 of the job submission wizard. When Inference is selected, the config upload step is skipped and the user goes directly to a file selection view where they browse the worker filesystem for model checkpoint directories and a .slp data file. Dashboard changes: - Job type toggle (Training/Inference) in Step 1 - Step indicator adapts: 3 steps for training, 2 for inference - New inference file selection view with model cards and data input - File browser reused from training flow with inference-specific routing - submitJob() sends TrackJobSpec (type: "track") for inference - Status view hides epoch/metrics and Stop Early for inference - Status label shows "Running inference..." for track jobs No worker changes needed — TrackJobSpec, build_track_command(), and RelayChannel inference message handling all already exist. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…r to dashboard File browser UX: - Replaced double-click model selection with explicit "Select Folder" / "Select File" button that appears when an item is clicked - Button confirms the selection and closes the file browser Inference log forwarding: - Track (inference) jobs now forward stderr lines to the relay channel so inference progress appears in the dashboard worker logs - RelayChannel handles [stderr] lines as running status with message - Dashboard _sjHandleJobStatus appends message to worker log panel Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Inference output: - Track jobs now use stderr=STDOUT to merge all output into one stream. Rich/tqdm suppress output when stderr is a pipe (not TTY), but merging into stdout captures everything through the existing stdout reader which already forwards to the relay channel. - Skip stderr streaming task when stderr is merged (prevents crash) File browser: - Add sj-inference-file-browser div to HTML (was dynamically created but not found by column renderer) - Simplify browse handlers since container exists in HTML Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…avigation - Don't reset _sjSelectedInferencePath when loading subfolder contents (only reset on fresh browse at colIndex 0) - Model browsing: clicking a folder shows Select Folder button AND navigates into it — button persists while browsing deeper - Data browsing: clicking a folder only navigates, no Select button shown — only .slp files get the Select File button Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The RelayChannel only forwarded known message types (JOB_ACCEPTED, JOB_PROGRESS, INFERENCE_BEGIN, etc). Plain text lines from inference stdout (e.g., "Started inference at...") and CR:: tqdm progress updates fell through and were silently dropped. Added catch-all handler that forwards any unrecognized text line as a running status with message field, so all output appears in the dashboard worker logs. Also forwards CR:: tqdm progress updates. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Inference logs arriving via _sjHandleJobStatus with status=running and a message field were appended to the visible log panel but not saved to the activeJobs.logs array. This meant the Training Job Summary showed no logs after inference completed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Training jobs were flooding the dashboard with config dumps, model architecture, and other startup logs because the catch-all text line handler (added in PR #76 for inference stdout) forwarded everything. Now only inference/track jobs forward unrecognized text lines. Training jobs use PROGRESS_REPORT:: for epoch-level updates as before. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Lightning's tqdm progress bar outputs via stderr and carriage returns during training, flooding the relay with per-batch updates. These handlers (added in PR #76 for inference stdout) should only forward for track/inference jobs. Training uses PROGRESS_REPORT:: epoch-level events instead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add E2E encryption for relay transport (ECDH P-256 + AES-256-GCM) Encrypts all relay messages between dashboard and workers so the signaling server cannot read message payloads. Uses ephemeral ECDH P-256 key exchange + HKDF + AES-256-GCM with zero new dependencies. Python (worker): - New sleap_rtc/encryption/ module (ecdh.py, envelope.py) - mesh_coordinator: key exchange handler, decrypt incoming, encrypt outgoing - RelayChannel.send() encrypts job status/progress when E2E session active - Session key storage with 24h pruning JavaScript (dashboard): - Web Crypto API: ECDH P-256 key generation, HKDF, AES-GCM encrypt/decrypt - Key exchange initiated on "Next →" (sjEnterStep3 / sjGoToInferenceStep) - apiWorkerMessage() transparently encrypts when E2E is ready - apiFsList/apiJobSubmit rerouted through encrypted path - sseConnect() decrypts encrypted_relay events and re-dispatches Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix E2E encryption: re-apply mesh_coordinator changes + fix job_id generation - Re-apply all E2E encryption integration to mesh_coordinator.py (key exchange handler, decrypt incoming, encrypt outgoing, session management, _send_relay_response helper) - Fix apiJobSubmit: generate job_id client-side when E2E is active, since the dedicated /api/jobs/submit endpoint is bypassed to avoid exposing config to the signaling server - Re-apply apiFsList E2E routing through apiWorkerMessage Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix crash in _sjRenderWorkerList when activeJobs has undefined keys Guard against undefined job IDs in the debug log line. This can happen when a previous E2E test stored an undefined job_id in activeJobs (persisted in localStorage). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Limit RelayChannel catch-all to inference jobs only Training jobs were flooding the dashboard with config dumps, model architecture, and other startup logs because the catch-all text line handler (added in PR #76 for inference stdout) forwarded everything. Now only inference/track jobs forward unrecognized text lines. Training jobs use PROGRESS_REPORT:: for epoch-level updates as before. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Gate [stderr] and CR:: relay handlers to inference jobs only Lightning's tqdm progress bar outputs via stderr and carriage returns during training, flooding the relay with per-batch updates. These handlers (added in PR #76 for inference stdout) should only forward for track/inference jobs. Training uses PROGRESS_REPORT:: epoch-level events instead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Re-add E2E encryption JS code stripped by linter The linter repeatedly strips the E2E encryption code from app.js. Re-adds all required components: - Constructor state variables (_e2eSessionId, _e2ePrivateKey, etc.) - Web Crypto functions (keypair gen, ECDH, HKDF, AES-GCM encrypt/decrypt) - Key exchange orchestration with timeout + retry - SSE decryption handler in sseConnect() - Transparent encryption in apiWorkerMessage() - Key exchange triggers in sjEnterStep3() and sjGoToInferenceStep() Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Format mesh_coordinator.py with Black Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

alicup29 and others added 6 commits April 1, 2026 11:26

alicup29 merged commit 5cba496 into main Apr 1, 2026
8 checks passed

alicup29 deleted the amick/implement-dashboard-inference branch April 1, 2026 20:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dashboard inference job submission#76

Dashboard inference job submission#76
alicup29 merged 6 commits intomainfrom
amick/implement-dashboard-inference

alicup29 commented Apr 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

alicup29 commented Apr 1, 2026

Summary

Motivation

Key Changes

Dashboard — Job type selector

Dashboard — Inference file selection

Dashboard — Inference progress

Worker — Inference output forwarding

Files changed

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant