Skip to content

Dashboard Multi-config Training, Stop Early, and Cancel fixes#75

Merged
alicup29 merged 13 commits intomainfrom
amick/dashboard-multi-config-v2
Mar 31, 2026
Merged

Dashboard Multi-config Training, Stop Early, and Cancel fixes#75
alicup29 merged 13 commits intomainfrom
amick/dashboard-multi-config-v2

Conversation

@alicup29
Copy link
Copy Markdown
Collaborator

Summary

Enables uploading multiple training config YAMLs (e.g., centroid + centered_instance) in a single dashboard job submission, with per-model progress tracking, Stop Early, and Cancel Training support.

Replaces #74 (which had merge conflicts from squash-merged #73).

Motivation

The SLEAP top-down pipeline requires two sequential models (centroid → centered_instance). Previously, the dashboard only supported single-config submissions, forcing users to submit and monitor two separate jobs. This PR enables the full top-down workflow in one submission, matching what the SLEAP GUI and sleap-app already support.

Key Changes

Dashboard — Multi-config upload

  • Config upload UI now shows config cards with auto-detected model type, filename, and per-card hyperparameters (batch size, learning rate, max epochs, run name, WandB project)
  • "+ Add Another Config" mini dropzone appears after first config is added
  • submitJob() sends config_contents (array) and model_types (array) instead of singular config_content
  • Queue label shows "Model 1 / 2", "Model 2 / 2" during training

Dashboard — Stop Early button

  • Yellow "Stop Early" button alongside "Cancel Training" and "Close"
  • Sends mode: "stop" via generic relay — stops current model at checkpoint, continues to next model and inference
  • Idempotent (ignores clicks if already stopping)

Dashboard — Model switch detection

  • MODEL_TYPE:: messages from the worker are relayed as SSE events with event: "model_type_switch"
  • Dashboard detects model switch and updates: status label ("Starting centered_instance..."), epoch counter (reset to 0), metrics (reset), max epochs (from new config), queue label
  • Stop Early and Cancel buttons re-enable for each new model
  • Worker logs include model name: "Training started (centroid)", "Training complete (centroid)"

Worker — Stop Early vs Cancel Training

  • _handle_job_cancel reads mode field: "stop" (stop early) vs "cancel" (cancel everything)
  • Both send ZMQ {"command": "stop"} to sleap-nn's TrainingControllerZMQ for graceful DDP shutdown
  • Cancel sets _cancel_requested flag → pipeline skips remaining models + inference
  • Stop Early only sets _stop_requested → pipeline continues to next model

Worker — Cancel fixes

  • cancelled check now runs BEFORE returncode == 0 check in exit analysis — previously ZMQ cancel resulted in clean exit (code 0) which was misidentified as success
  • MODEL_TYPE:: relay added to RelayChannel so dashboard receives model switch events
  • Stop/cancel routed through generic relay (/api/worker/message) instead of dedicated endpoint — no signaling server changes needed

Files changed

File Change
dashboard/app.js Multi-config cards, Stop Early handler, model switch detection, per-model progress
dashboard/index.html Config list container, Stop Early button
dashboard/styles.css Config card styles, mini dropzone, .btn-warning
sleap_rtc/worker/mesh_coordinator.py MODEL_TYPE:: relay, _handle_job_cancel with mode field
sleap_rtc/worker/job_executor.py _cancel_requested flag, cancelled-before-success exit check

Test plan

  • Upload single config — works as before, no changes
  • Upload two configs (centroid + centered_instance) — per-card hyperparams shown
  • Submit multi-config job — "Model 1 / 2" shown, training starts
  • Stop Early on model 1 — stops at checkpoint, model 2 starts, buttons re-enable
  • Cancel Training on model 1 — stops training, skips model 2 + inference
  • Stop Early on model 2 — stops at checkpoint, inference runs
  • Cancel Training on model 2 — stops training, skips inference
  • Worker logs show "— centroid stopped, starting centered_instance —"
  • Status label updates: "Training centroid..." → "Starting centered_instance..." → "Training centered_instance..."

🤖 Generated with Claude Code

alicup29 and others added 13 commits March 31, 2026 15:46
Enables uploading multiple config YAMLs (e.g., centroid +
centered_instance) in a single job submission.

Dashboard changes:
- Config upload UI now shows config cards with auto-detected model
  type, filename, and key hyperparams. "+ Add Another Config"
  dropzone appears after first config is added.
- submitJob() sends config_contents (array) and model_types (array)
  instead of singular config_content
- Progress UI shows "Model 1 / 2" queue label, resets epoch/metrics
  on model switch

Worker changes:
- Relay MODEL_TYPE:: messages as model_type_switch SSE events so
  the dashboard can track model transitions

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The mini "+ Add Another Config" dropzone wasn't clickable because it
lacked a <label for="sj-config-input"> wrapper and the file input
event listeners weren't re-attached after innerHTML replacement.
Now uses a label wrapper for click-to-browse and explicitly re-wires
change/dragover/drop handlers after rendering.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Each config card now shows its own hyperparams (batch size, learning
rate, max epochs, run name, WandB project/entity) in an expandable
section below the filename and model type. Replaces the shared
hyperparams panel that only showed the first config's values.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Stop Early:
- Adds "Stop Early" button (yellow/warning) alongside Cancel Training
- Sends mode="stop" to /api/jobs/{id}/cancel endpoint
- Worker sends ZMQ stop WITHOUT setting _cancel_requested, so the
  pipeline continues to the next model and runs inference

Model type switch fix:
- MODEL_TYPE:: relay messages now detected inside _sjHandleJobStatus
  (they arrive as job_status SSE events with event="model_type_switch")
- Resets epoch counter, metrics, and status label when model switches
- Updates queue label from "Model 1/2" to "Model 2/2"

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…oint

Switches sjCancelJob() and sjStopEarly() to send job_cancel messages
through apiWorkerMessage (generic relay) instead of apiJobCancel
(dedicated endpoint). Both use the same auth (room membership) and
the worker already handles the mode field. This removes the
cross-repo dependency on the webRTC-connect signaling server change.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
After stopping a model early, the buttons remained disabled/hidden
because _sjShowCancelButton() was never called when the next model
started. Now _sjHandleModelTypeSwitch() re-enables both buttons and
resets the status spinner for each new model in the pipeline.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Multi-config jobs now show clickable tabs at the top of the progress
view, one per model plus an inference tab. Each tab shows a status
dot (queued/active/complete/stopped/failed) and the model name.

- Clicking a tab shows that model's last-known metrics (epoch, loss,
  val loss, train time, learning rate)
- Active model tab auto-selects during training
- Per-model metrics are snapshotted on epoch_end events and on model
  type switch, so completed models retain their final metrics
- Tab bar hidden for single-model jobs (keeps current layout)
- Job completion marks all models + inference as complete and
  auto-selects the inference tab

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Removes model tabs in favor of the simpler single-panel approach:
- Status label, epoch counter, and metrics update in-place when
  models switch
- Worker logs are continuous with clear "— Switching to X —" barriers
- Queue label shows "Model 1/2", "Model 2/2"
- Stop Early and Cancel buttons re-enable for each model

Fixes:
- Job complete no longer incorrectly marks all models as complete
- Epoch total updates per-model from each config's max_epochs
- Status label and spinner properly reset on model switch
- label.className reset so "complete" styling doesn't persist

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Removes model tabs entirely in favor of simple single-panel updates.

Fixes:
- Remove all tab rendering code (_sjRenderModelTabs, _sjSelectModelTab,
  _sjUpdateCurrentModelResult) and tab HTML container
- Status label now shows "Starting {model}..." on model switch instead
  of immediately saying "Training {model}..."
- Worker logs include model name: "Training started (centroid)",
  "Training complete (centroid)", "— centroid stopped, starting
  centered_instance —"
- Stop Early is idempotent (ignores clicks if already disabled)
- _sjShowCancelButton() in model switch re-enables both buttons for
  each new model

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds RAW JSON logging to _sjHandleJobStatus to see exactly what SSE
data arrives. Adds worker-side logging when MODEL_TYPE:: is relayed.
This will help diagnose why the model switch isn't being detected
on the dashboard side.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Logs _stop_requested and _cancel_requested flag values at:
- send_control_message() when flags are set
- execute_from_spec() exit analysis when determining stopped_early vs cancelled

This will show whether the ZMQ stop command properly sets the flag
before the process exit check runs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When cancel was sent via ZMQ, the process exited with code 0 (clean
shutdown) which matched the success path (returncode == 0). The
cancelled flag was set correctly but never checked before returning
{"cancelled": False, "success": True}. This caused the pipeline to
continue to inference instead of skipping it.

Now checks cancelled FIRST: if _cancel_requested is True, returns
{"cancelled": True, "success": False} regardless of exit code. The
pipeline loop then sets pipeline_cancelled=True and breaks, skipping
remaining models and inference.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@alicup29 alicup29 changed the title Multi-config training, Stop Early, and cancel fixes Dashboard Multi-config Training, Stop Early, and Cancel fixes Mar 31, 2026
@alicup29 alicup29 merged commit b14a351 into main Mar 31, 2026
7 checks passed
@alicup29 alicup29 deleted the amick/dashboard-multi-config-v2 branch March 31, 2026 23:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant