Dashboard Multi-config Training, Stop Early, and Cancel fixes by alicup29 · Pull Request #75 · talmolab/sleap-rtc

alicup29 · 2026-03-31T22:47:51Z

Summary

Enables uploading multiple training config YAMLs (e.g., centroid + centered_instance) in a single dashboard job submission, with per-model progress tracking, Stop Early, and Cancel Training support.

Replaces #74 (which had merge conflicts from squash-merged #73).

Motivation

The SLEAP top-down pipeline requires two sequential models (centroid → centered_instance). Previously, the dashboard only supported single-config submissions, forcing users to submit and monitor two separate jobs. This PR enables the full top-down workflow in one submission, matching what the SLEAP GUI and sleap-app already support.

Key Changes

Dashboard — Multi-config upload

Config upload UI now shows config cards with auto-detected model type, filename, and per-card hyperparameters (batch size, learning rate, max epochs, run name, WandB project)
"+ Add Another Config" mini dropzone appears after first config is added
submitJob() sends config_contents (array) and model_types (array) instead of singular config_content
Queue label shows "Model 1 / 2", "Model 2 / 2" during training

Dashboard — Stop Early button

Yellow "Stop Early" button alongside "Cancel Training" and "Close"
Sends mode: "stop" via generic relay — stops current model at checkpoint, continues to next model and inference
Idempotent (ignores clicks if already stopping)

Dashboard — Model switch detection

MODEL_TYPE:: messages from the worker are relayed as SSE events with event: "model_type_switch"
Dashboard detects model switch and updates: status label ("Starting centered_instance..."), epoch counter (reset to 0), metrics (reset), max epochs (from new config), queue label
Stop Early and Cancel buttons re-enable for each new model
Worker logs include model name: "Training started (centroid)", "Training complete (centroid)"

Worker — Stop Early vs Cancel Training

_handle_job_cancel reads mode field: "stop" (stop early) vs "cancel" (cancel everything)
Both send ZMQ {"command": "stop"} to sleap-nn's TrainingControllerZMQ for graceful DDP shutdown
Cancel sets _cancel_requested flag → pipeline skips remaining models + inference
Stop Early only sets _stop_requested → pipeline continues to next model

Worker — Cancel fixes

cancelled check now runs BEFORE returncode == 0 check in exit analysis — previously ZMQ cancel resulted in clean exit (code 0) which was misidentified as success
MODEL_TYPE:: relay added to RelayChannel so dashboard receives model switch events
Stop/cancel routed through generic relay (/api/worker/message) instead of dedicated endpoint — no signaling server changes needed

Files changed

File	Change
`dashboard/app.js`	Multi-config cards, Stop Early handler, model switch detection, per-model progress
`dashboard/index.html`	Config list container, Stop Early button
`dashboard/styles.css`	Config card styles, mini dropzone, `.btn-warning`
`sleap_rtc/worker/mesh_coordinator.py`	`MODEL_TYPE::` relay, `_handle_job_cancel` with mode field
`sleap_rtc/worker/job_executor.py`	`_cancel_requested` flag, cancelled-before-success exit check

Test plan

Upload single config — works as before, no changes
Upload two configs (centroid + centered_instance) — per-card hyperparams shown
Submit multi-config job — "Model 1 / 2" shown, training starts
Stop Early on model 1 — stops at checkpoint, model 2 starts, buttons re-enable
Cancel Training on model 1 — stops training, skips model 2 + inference
Stop Early on model 2 — stops at checkpoint, inference runs
Cancel Training on model 2 — stops training, skips inference
Worker logs show "— centroid stopped, starting centered_instance —"
Status label updates: "Training centroid..." → "Starting centered_instance..." → "Training centered_instance..."

🤖 Generated with Claude Code

Enables uploading multiple config YAMLs (e.g., centroid + centered_instance) in a single job submission. Dashboard changes: - Config upload UI now shows config cards with auto-detected model type, filename, and key hyperparams. "+ Add Another Config" dropzone appears after first config is added. - submitJob() sends config_contents (array) and model_types (array) instead of singular config_content - Progress UI shows "Model 1 / 2" queue label, resets epoch/metrics on model switch Worker changes: - Relay MODEL_TYPE:: messages as model_type_switch SSE events so the dashboard can track model transitions Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The mini "+ Add Another Config" dropzone wasn't clickable because it lacked a <label for="sj-config-input"> wrapper and the file input event listeners weren't re-attached after innerHTML replacement. Now uses a label wrapper for click-to-browse and explicitly re-wires change/dragover/drop handlers after rendering. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Each config card now shows its own hyperparams (batch size, learning rate, max epochs, run name, WandB project/entity) in an expandable section below the filename and model type. Replaces the shared hyperparams panel that only showed the first config's values. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Stop Early: - Adds "Stop Early" button (yellow/warning) alongside Cancel Training - Sends mode="stop" to /api/jobs/{id}/cancel endpoint - Worker sends ZMQ stop WITHOUT setting _cancel_requested, so the pipeline continues to the next model and runs inference Model type switch fix: - MODEL_TYPE:: relay messages now detected inside _sjHandleJobStatus (they arrive as job_status SSE events with event="model_type_switch") - Resets epoch counter, metrics, and status label when model switches - Updates queue label from "Model 1/2" to "Model 2/2" Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…oint Switches sjCancelJob() and sjStopEarly() to send job_cancel messages through apiWorkerMessage (generic relay) instead of apiJobCancel (dedicated endpoint). Both use the same auth (room membership) and the worker already handles the mode field. This removes the cross-repo dependency on the webRTC-connect signaling server change. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

After stopping a model early, the buttons remained disabled/hidden because _sjShowCancelButton() was never called when the next model started. Now _sjHandleModelTypeSwitch() re-enables both buttons and resets the status spinner for each new model in the pipeline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Multi-config jobs now show clickable tabs at the top of the progress view, one per model plus an inference tab. Each tab shows a status dot (queued/active/complete/stopped/failed) and the model name. - Clicking a tab shows that model's last-known metrics (epoch, loss, val loss, train time, learning rate) - Active model tab auto-selects during training - Per-model metrics are snapshotted on epoch_end events and on model type switch, so completed models retain their final metrics - Tab bar hidden for single-model jobs (keeps current layout) - Job completion marks all models + inference as complete and auto-selects the inference tab Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Removes model tabs in favor of the simpler single-panel approach: - Status label, epoch counter, and metrics update in-place when models switch - Worker logs are continuous with clear "— Switching to X —" barriers - Queue label shows "Model 1/2", "Model 2/2" - Stop Early and Cancel buttons re-enable for each model Fixes: - Job complete no longer incorrectly marks all models as complete - Epoch total updates per-model from each config's max_epochs - Status label and spinner properly reset on model switch - label.className reset so "complete" styling doesn't persist Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Removes model tabs entirely in favor of simple single-panel updates. Fixes: - Remove all tab rendering code (_sjRenderModelTabs, _sjSelectModelTab, _sjUpdateCurrentModelResult) and tab HTML container - Status label now shows "Starting {model}..." on model switch instead of immediately saying "Training {model}..." - Worker logs include model name: "Training started (centroid)", "Training complete (centroid)", "— centroid stopped, starting centered_instance —" - Stop Early is idempotent (ignores clicks if already disabled) - _sjShowCancelButton() in model switch re-enables both buttons for each new model Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Adds RAW JSON logging to _sjHandleJobStatus to see exactly what SSE data arrives. Adds worker-side logging when MODEL_TYPE:: is relayed. This will help diagnose why the model switch isn't being detected on the dashboard side. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Logs _stop_requested and _cancel_requested flag values at: - send_control_message() when flags are set - execute_from_spec() exit analysis when determining stopped_early vs cancelled This will show whether the ZMQ stop command properly sets the flag before the process exit check runs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

When cancel was sent via ZMQ, the process exited with code 0 (clean shutdown) which matched the success path (returncode == 0). The cancelled flag was set correctly but never checked before returning {"cancelled": False, "success": True}. This caused the pipeline to continue to inference instead of skipping it. Now checks cancelled FIRST: if _cancel_requested is True, returns {"cancelled": True, "success": False} regardless of exit code. The pipeline loop then sets pipeline_cancelled=True and breaks, skipping remaining models and inference. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

alicup29 and others added 13 commits March 31, 2026 15:46

debug: log raw data in _handle_job_cancel to verify mode field arrives

70d59ed

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

alicup29 changed the title ~~Multi-config training, Stop Early, and cancel fixes~~ Dashboard Multi-config Training, Stop Early, and Cancel fixes Mar 31, 2026

alicup29 merged commit b14a351 into main Mar 31, 2026
7 checks passed

alicup29 deleted the amick/dashboard-multi-config-v2 branch March 31, 2026 23:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dashboard Multi-config Training, Stop Early, and Cancel fixes#75

Dashboard Multi-config Training, Stop Early, and Cancel fixes#75
alicup29 merged 13 commits intomainfrom
amick/dashboard-multi-config-v2

alicup29 commented Mar 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

alicup29 commented Mar 31, 2026

Summary

Motivation

Key Changes

Dashboard — Multi-config upload

Dashboard — Stop Early button

Dashboard — Model switch detection

Worker — Stop Early vs Cancel Training

Worker — Cancel fixes

Files changed

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant