Dashboard Multi-config Training, Stop Early, and Cancel fixes#75
Merged
Dashboard Multi-config Training, Stop Early, and Cancel fixes#75
Conversation
Enables uploading multiple config YAMLs (e.g., centroid + centered_instance) in a single job submission. Dashboard changes: - Config upload UI now shows config cards with auto-detected model type, filename, and key hyperparams. "+ Add Another Config" dropzone appears after first config is added. - submitJob() sends config_contents (array) and model_types (array) instead of singular config_content - Progress UI shows "Model 1 / 2" queue label, resets epoch/metrics on model switch Worker changes: - Relay MODEL_TYPE:: messages as model_type_switch SSE events so the dashboard can track model transitions Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The mini "+ Add Another Config" dropzone wasn't clickable because it lacked a <label for="sj-config-input"> wrapper and the file input event listeners weren't re-attached after innerHTML replacement. Now uses a label wrapper for click-to-browse and explicitly re-wires change/dragover/drop handlers after rendering. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Each config card now shows its own hyperparams (batch size, learning rate, max epochs, run name, WandB project/entity) in an expandable section below the filename and model type. Replaces the shared hyperparams panel that only showed the first config's values. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Stop Early:
- Adds "Stop Early" button (yellow/warning) alongside Cancel Training
- Sends mode="stop" to /api/jobs/{id}/cancel endpoint
- Worker sends ZMQ stop WITHOUT setting _cancel_requested, so the
pipeline continues to the next model and runs inference
Model type switch fix:
- MODEL_TYPE:: relay messages now detected inside _sjHandleJobStatus
(they arrive as job_status SSE events with event="model_type_switch")
- Resets epoch counter, metrics, and status label when model switches
- Updates queue label from "Model 1/2" to "Model 2/2"
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…oint Switches sjCancelJob() and sjStopEarly() to send job_cancel messages through apiWorkerMessage (generic relay) instead of apiJobCancel (dedicated endpoint). Both use the same auth (room membership) and the worker already handles the mode field. This removes the cross-repo dependency on the webRTC-connect signaling server change. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
After stopping a model early, the buttons remained disabled/hidden because _sjShowCancelButton() was never called when the next model started. Now _sjHandleModelTypeSwitch() re-enables both buttons and resets the status spinner for each new model in the pipeline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Multi-config jobs now show clickable tabs at the top of the progress view, one per model plus an inference tab. Each tab shows a status dot (queued/active/complete/stopped/failed) and the model name. - Clicking a tab shows that model's last-known metrics (epoch, loss, val loss, train time, learning rate) - Active model tab auto-selects during training - Per-model metrics are snapshotted on epoch_end events and on model type switch, so completed models retain their final metrics - Tab bar hidden for single-model jobs (keeps current layout) - Job completion marks all models + inference as complete and auto-selects the inference tab Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Removes model tabs in favor of the simpler single-panel approach: - Status label, epoch counter, and metrics update in-place when models switch - Worker logs are continuous with clear "— Switching to X —" barriers - Queue label shows "Model 1/2", "Model 2/2" - Stop Early and Cancel buttons re-enable for each model Fixes: - Job complete no longer incorrectly marks all models as complete - Epoch total updates per-model from each config's max_epochs - Status label and spinner properly reset on model switch - label.className reset so "complete" styling doesn't persist Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Removes model tabs entirely in favor of simple single-panel updates.
Fixes:
- Remove all tab rendering code (_sjRenderModelTabs, _sjSelectModelTab,
_sjUpdateCurrentModelResult) and tab HTML container
- Status label now shows "Starting {model}..." on model switch instead
of immediately saying "Training {model}..."
- Worker logs include model name: "Training started (centroid)",
"Training complete (centroid)", "— centroid stopped, starting
centered_instance —"
- Stop Early is idempotent (ignores clicks if already disabled)
- _sjShowCancelButton() in model switch re-enables both buttons for
each new model
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds RAW JSON logging to _sjHandleJobStatus to see exactly what SSE data arrives. Adds worker-side logging when MODEL_TYPE:: is relayed. This will help diagnose why the model switch isn't being detected on the dashboard side. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Logs _stop_requested and _cancel_requested flag values at: - send_control_message() when flags are set - execute_from_spec() exit analysis when determining stopped_early vs cancelled This will show whether the ZMQ stop command properly sets the flag before the process exit check runs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When cancel was sent via ZMQ, the process exited with code 0 (clean
shutdown) which matched the success path (returncode == 0). The
cancelled flag was set correctly but never checked before returning
{"cancelled": False, "success": True}. This caused the pipeline to
continue to inference instead of skipping it.
Now checks cancelled FIRST: if _cancel_requested is True, returns
{"cancelled": True, "success": False} regardless of exit code. The
pipeline loop then sets pipeline_cancelled=True and breaks, skipping
remaining models and inference.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Enables uploading multiple training config YAMLs (e.g., centroid + centered_instance) in a single dashboard job submission, with per-model progress tracking, Stop Early, and Cancel Training support.
Replaces #74 (which had merge conflicts from squash-merged #73).
Motivation
The SLEAP top-down pipeline requires two sequential models (centroid → centered_instance). Previously, the dashboard only supported single-config submissions, forcing users to submit and monitor two separate jobs. This PR enables the full top-down workflow in one submission, matching what the SLEAP GUI and sleap-app already support.
Key Changes
Dashboard — Multi-config upload
submitJob()sendsconfig_contents(array) andmodel_types(array) instead of singularconfig_contentDashboard — Stop Early button
mode: "stop"via generic relay — stops current model at checkpoint, continues to next model and inferenceDashboard — Model switch detection
MODEL_TYPE::messages from the worker are relayed as SSE events withevent: "model_type_switch"Worker — Stop Early vs Cancel Training
_handle_job_cancelreadsmodefield:"stop"(stop early) vs"cancel"(cancel everything){"command": "stop"}to sleap-nn's TrainingControllerZMQ for graceful DDP shutdown_cancel_requestedflag → pipeline skips remaining models + inference_stop_requested→ pipeline continues to next modelWorker — Cancel fixes
cancelledcheck now runs BEFOREreturncode == 0check in exit analysis — previously ZMQ cancel resulted in clean exit (code 0) which was misidentified as successMODEL_TYPE::relay added to RelayChannel so dashboard receives model switch events/api/worker/message) instead of dedicated endpoint — no signaling server changes neededFiles changed
dashboard/app.jsdashboard/index.htmldashboard/styles.css.btn-warningsleap_rtc/worker/mesh_coordinator.pyMODEL_TYPE::relay,_handle_job_cancelwith mode fieldsleap_rtc/worker/job_executor.py_cancel_requestedflag, cancelled-before-success exit checkTest plan
🤖 Generated with Claude Code