Skip to content

Dashboard inference job submission#76

Merged
alicup29 merged 6 commits intomainfrom
amick/implement-dashboard-inference
Apr 1, 2026
Merged

Dashboard inference job submission#76
alicup29 merged 6 commits intomainfrom
amick/implement-dashboard-inference

Conversation

@alicup29
Copy link
Copy Markdown
Collaborator

@alicup29 alicup29 commented Apr 1, 2026

Summary

Adds inference (track) job submission to the dashboard, letting users run sleap-nn inference on remote workers directly from the web UI.

Motivation

Previously the dashboard only supported training job submission. Users who wanted to run inference on trained models had to use the SLEAP GUI, sleap-app, or the CLI. This PR enables the full inference workflow from the dashboard: select models, select data, submit, and monitor — matching the training job experience.

Key Changes

Dashboard — Job type selector

  • Training/Inference toggle in Step 1 of the job submission wizard
  • Step indicator adapts: 3 steps for training (worker → config → files), 2 for inference (worker → files)
  • Info box explains the inference flow when selected

Dashboard — Inference file selection

  • Combined view for selecting model checkpoint directories and .slp data file
  • Reuses the existing SSE relay file browser with inference-specific routing
  • Model paths shown as cards with remove buttons, "+ Add Another Model" browse button
  • "Select Folder" / "Select File" button appears on click (no double-click required)
  • Submit enabled only when both models and data file are selected

Dashboard — Inference progress

  • Status view shows "Running inference..." with streaming worker logs
  • Hides training-specific UI (epoch counter, metrics, Stop Early button)
  • Cancel button works via same ZMQ mechanism as training
  • Logs saved to activeJobs for replay in Job Summary

Worker — Inference output forwarding

  • Track jobs merge stderr into stdout (stderr=STDOUT) so rich/tqdm output is captured
  • RelayChannel catch-all forwards unrecognized text lines and CR:: progress updates as running status messages
  • All sleap-nn inference output (timing, predictions, output path) reaches the dashboard

Files changed

File Change
dashboard/app.js Job type toggle, inference file selection, branched submitJob(), log persistence
dashboard/index.html Job type toggle HTML, inference file selection view, file browser container
dashboard/styles.css Job type toggle, section labels, model cards, browse buttons, data input
sleap_rtc/worker/job_executor.py stderr=STDOUT for track jobs, skip stderr task when merged
sleap_rtc/worker/mesh_coordinator.py Catch-all relay for text lines and CR:: progress updates

Test plan

  • Training/Inference toggle appears in Step 1
  • Inference skips config upload, shows 2-step indicator
  • Browse worker filesystem for model directories (Select Folder button)
  • Browse worker filesystem for .slp data file (Select File button)
  • Submit sends TrackJobSpec with data_path and model_paths
  • Worker runs sleap-nn track and exits successfully
  • Inference logs stream to dashboard worker log panel
  • Job Summary shows inference logs after completion
  • Cancel works during inference
  • Training flow unchanged when Training toggle is selected

🤖 Generated with Claude Code

alicup29 and others added 6 commits April 1, 2026 11:26
Adds a Training/Inference toggle in Step 1 of the job submission
wizard. When Inference is selected, the config upload step is
skipped and the user goes directly to a file selection view where
they browse the worker filesystem for model checkpoint directories
and a .slp data file.

Dashboard changes:
- Job type toggle (Training/Inference) in Step 1
- Step indicator adapts: 3 steps for training, 2 for inference
- New inference file selection view with model cards and data input
- File browser reused from training flow with inference-specific routing
- submitJob() sends TrackJobSpec (type: "track") for inference
- Status view hides epoch/metrics and Stop Early for inference
- Status label shows "Running inference..." for track jobs

No worker changes needed — TrackJobSpec, build_track_command(),
and RelayChannel inference message handling all already exist.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…r to dashboard

File browser UX:
- Replaced double-click model selection with explicit "Select Folder"
  / "Select File" button that appears when an item is clicked
- Button confirms the selection and closes the file browser

Inference log forwarding:
- Track (inference) jobs now forward stderr lines to the relay channel
  so inference progress appears in the dashboard worker logs
- RelayChannel handles [stderr] lines as running status with message
- Dashboard _sjHandleJobStatus appends message to worker log panel

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Inference output:
- Track jobs now use stderr=STDOUT to merge all output into one
  stream. Rich/tqdm suppress output when stderr is a pipe (not TTY),
  but merging into stdout captures everything through the existing
  stdout reader which already forwards to the relay channel.
- Skip stderr streaming task when stderr is merged (prevents crash)

File browser:
- Add sj-inference-file-browser div to HTML (was dynamically created
  but not found by column renderer)
- Simplify browse handlers since container exists in HTML

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…avigation

- Don't reset _sjSelectedInferencePath when loading subfolder contents
  (only reset on fresh browse at colIndex 0)
- Model browsing: clicking a folder shows Select Folder button AND
  navigates into it — button persists while browsing deeper
- Data browsing: clicking a folder only navigates, no Select button
  shown — only .slp files get the Select File button

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The RelayChannel only forwarded known message types (JOB_ACCEPTED,
JOB_PROGRESS, INFERENCE_BEGIN, etc). Plain text lines from inference
stdout (e.g., "Started inference at...") and CR:: tqdm progress
updates fell through and were silently dropped.

Added catch-all handler that forwards any unrecognized text line as
a running status with message field, so all output appears in the
dashboard worker logs. Also forwards CR:: tqdm progress updates.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Inference logs arriving via _sjHandleJobStatus with status=running
and a message field were appended to the visible log panel but not
saved to the activeJobs.logs array. This meant the Training Job
Summary showed no logs after inference completed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@alicup29 alicup29 merged commit 5cba496 into main Apr 1, 2026
8 checks passed
@alicup29 alicup29 deleted the amick/implement-dashboard-inference branch April 1, 2026 20:34
alicup29 added a commit that referenced this pull request Apr 2, 2026
Training jobs were flooding the dashboard with config dumps, model
architecture, and other startup logs because the catch-all text line
handler (added in PR #76 for inference stdout) forwarded everything.

Now only inference/track jobs forward unrecognized text lines. Training
jobs use PROGRESS_REPORT:: for epoch-level updates as before.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
alicup29 added a commit that referenced this pull request Apr 2, 2026
Lightning's tqdm progress bar outputs via stderr and carriage returns
during training, flooding the relay with per-batch updates. These
handlers (added in PR #76 for inference stdout) should only forward
for track/inference jobs. Training uses PROGRESS_REPORT:: epoch-level
events instead.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
alicup29 added a commit that referenced this pull request Apr 2, 2026
* Add E2E encryption for relay transport (ECDH P-256 + AES-256-GCM)

Encrypts all relay messages between dashboard and workers so the
signaling server cannot read message payloads. Uses ephemeral ECDH
P-256 key exchange + HKDF + AES-256-GCM with zero new dependencies.

Python (worker):
- New sleap_rtc/encryption/ module (ecdh.py, envelope.py)
- mesh_coordinator: key exchange handler, decrypt incoming, encrypt outgoing
- RelayChannel.send() encrypts job status/progress when E2E session active
- Session key storage with 24h pruning

JavaScript (dashboard):
- Web Crypto API: ECDH P-256 key generation, HKDF, AES-GCM encrypt/decrypt
- Key exchange initiated on "Next →" (sjEnterStep3 / sjGoToInferenceStep)
- apiWorkerMessage() transparently encrypts when E2E is ready
- apiFsList/apiJobSubmit rerouted through encrypted path
- sseConnect() decrypts encrypted_relay events and re-dispatches

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fix E2E encryption: re-apply mesh_coordinator changes + fix job_id generation

- Re-apply all E2E encryption integration to mesh_coordinator.py
  (key exchange handler, decrypt incoming, encrypt outgoing, session
  management, _send_relay_response helper)
- Fix apiJobSubmit: generate job_id client-side when E2E is active,
  since the dedicated /api/jobs/submit endpoint is bypassed to avoid
  exposing config to the signaling server
- Re-apply apiFsList E2E routing through apiWorkerMessage

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fix crash in _sjRenderWorkerList when activeJobs has undefined keys

Guard against undefined job IDs in the debug log line. This can happen
when a previous E2E test stored an undefined job_id in activeJobs
(persisted in localStorage).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Limit RelayChannel catch-all to inference jobs only

Training jobs were flooding the dashboard with config dumps, model
architecture, and other startup logs because the catch-all text line
handler (added in PR #76 for inference stdout) forwarded everything.

Now only inference/track jobs forward unrecognized text lines. Training
jobs use PROGRESS_REPORT:: for epoch-level updates as before.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Gate [stderr] and CR:: relay handlers to inference jobs only

Lightning's tqdm progress bar outputs via stderr and carriage returns
during training, flooding the relay with per-batch updates. These
handlers (added in PR #76 for inference stdout) should only forward
for track/inference jobs. Training uses PROGRESS_REPORT:: epoch-level
events instead.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Re-add E2E encryption JS code stripped by linter

The linter repeatedly strips the E2E encryption code from app.js.
Re-adds all required components:
- Constructor state variables (_e2eSessionId, _e2ePrivateKey, etc.)
- Web Crypto functions (keypair gen, ECDH, HKDF, AES-GCM encrypt/decrypt)
- Key exchange orchestration with timeout + retry
- SSE decryption handler in sseConnect()
- Transparent encryption in apiWorkerMessage()
- Key exchange triggers in sjEnterStep3() and sjGoToInferenceStep()

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Format mesh_coordinator.py with Black

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant