Skip to content

Dashboard Job Cancellation, Stale State Fixes, and README update#73

Merged
alicup29 merged 13 commits intomainfrom
amick/dashboard-cancel-bugs
Mar 30, 2026
Merged

Dashboard Job Cancellation, Stale State Fixes, and README update#73
alicup29 merged 13 commits intomainfrom
amick/dashboard-cancel-bugs

Conversation

@alicup29
Copy link
Copy Markdown
Collaborator

Summary

  • Job cancellation from dashboard — Added "Cancel Training" button with two-click confirmation. Routes through WebSocket → worker's ZMQ control channel, matching the same graceful shutdown mechanism used by the SLEAP GUI and sleap-app. Cancelling skips post-training inference; "Stop Early" (future) will stop training but still run inference.
  • Stale wizard state fixes — Fixed config preloading bug where previously selected files (config YAML, missing videos, validation status) persisted across modal reopens. Also guards SSE handlers against replayed events from previous sessions.
  • README rewrite — Side-by-side Dashboard, CLI, and Docker quickstart with --torch-backend auto.

Key changes

Job cancellation pipeline

The dashboard's cancel button triggers POST /api/jobs/{id}/cancel → signaling server forwards job_cancel over WebSocket → worker's mesh coordinator sends ZMQ {"command": "stop"} to sleap-nn's TrainingControllerZMQ callback, which sets trainer.should_stop = True and coordinates DDP shutdown via reduce_boolean_decision.

A new _cancel_requested flag on JobExecutor distinguishes "Cancel Training" (skip inference) from "Stop Early" (run inference). This also fixes the SLEAP GUI/sleap-app data channel path (ZMQ_CTRL::) which previously bypassed the flag.

Stale state fixes

  • openSubmitJobModal() now resets all wizard state variables + clears DOM (dropzone, missing videos, validation status, file browser)
  • SSE handlers (_sjHandleVideoCheck, _sjHandlePathOk, _sjHandlePathError) ignore events when no labels path is selected (stale replays)

Files changed

File Change
dashboard/app.js Cancel button handler, state resets, SSE guards
dashboard/index.html Cancel button element
dashboard/styles.css .btn-confirming pulse animation
sleap_rtc/worker/mesh_coordinator.py job_cancel WebSocket handler, ZMQ cancel routing
sleap_rtc/worker/job_executor.py _cancel_requested flag, SIGKILL escalation fallback
sleap_rtc/worker/worker_class.py Route ZMQ_CTRL:: through job_executor.send_control_message()
README.md Full rewrite with Dashboard/CLI/Docker quickstart

Test plan

  • Cancel button shows during running job, hides on completion
  • Two-click confirmation: first click → "Confirm Cancel?", timeout reverts
  • Cancel sends request to worker, training stops gracefully via ZMQ
  • Post-training inference is skipped on cancel
  • Reopening modal clears stale config filename
  • Reopening modal clears stale missing videos
  • SSE replayed events don't trigger UI before file selection

🤖 Generated with Claude Code

alicup29 and others added 12 commits March 30, 2026 11:03
Previously, openSubmitJobModal() only cleared _sjConfigContent and
_sjLabelsPath but left _sjPathMappings, _sjMissingVideos, _sjModelType,
_sjMaxEpochs, _sjBrowseMode, and _sjResolvingVideoIndex stale from
previous sessions. This caused config preloading artifacts like
previously seen filenames showing up when nothing was selected.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds a "Cancel Training" button alongside "Close" in the job status
view. Cancel uses a two-click confirmation: first click changes text
to "Confirm Cancel?" with a pulsing animation, second click within
3 seconds sends the cancel request. Reverts after timeout.

- Show both Cancel + Close buttons when job is running
- Hide Cancel button when job completes/fails/cancels
- Wire to existing apiJobCancel() endpoint
- Add explicit logging at every step for debugging

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replaces the minimal 39-line README with a comprehensive guide
covering Dashboard, CLI, and Docker workflows. Adds installation
variants for CUDA, CPU, and Apple Silicon. Adds "How It Works"
overview and updated links.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Restores the auto-detection flag instead of hardcoding torch-cuda130,
which is simpler for users who don't know their GPU backend.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The dropzone innerHTML was replaced with the filename when a config
file was loaded, but openSubmitJobModal() only reset state variables
without restoring the default dropzone HTML. This caused the previous
filename (e.g., "centroid.yaml") to persist visually when the modal
was reopened.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…odal reopen

The missing videos panel, validation status, selected path, file
browser columns, and training log retained their innerHTML from the
previous session. Even though they were hidden, sjEnterStep3() would
unhide them, showing stale data (e.g., "Missing Videos: mice.mp4")
before the user selected any file.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Added guards to _sjHandleVideoCheck, _sjHandlePathOk, and
_sjHandlePathError that ignore events when no labels path has been
selected yet. This prevents stale SSE events replayed on reconnect
from showing "Missing Videos" UI before the user selects a file.

Also clears _sjPendingRequests on modal open, closes previous worker
SSE before opening a new one in sjEnterStep3, and adds detailed
logging for debugging SSE event flow.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
SSE relay replays previous events on reconnect, causing a flood of
warnings for ignored fs_list_res, worker_path_ok, and
fs_check_videos_response events. These are correctly handled (ignored)
but the console noise makes real issues hard to spot. Downgrade to
console.debug and remove the fs_list_res warning entirely since it
fires for every previous directory listing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The dashboard cancels jobs via POST /api/jobs/{id}/cancel, which the
signaling server forwards as a {"type": "job_cancel"} WebSocket
message. But the worker's WebSocket handlers (admin and non-admin)
didn't process this message type — only the WebRTC data channel
handler did. This meant dashboard cancellation was silently dropped.

Added _handle_job_cancel() to mesh_coordinator and wired it in both
admin and non-admin WebSocket handler loops.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PyTorch Lightning with DDP catches SIGTERM for graceful shutdown but
doesn't always exit, leaving zombie training processes. Now starts a
background thread that waits 10 seconds after SIGTERM, and if the
process is still running, sends SIGKILL to the process group to
force termination.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replaces OS signal-based cancellation with ZMQ {"command": "stop"},
matching the approach used by the SLEAP GUI and sleap-app. sleap-nn's
TrainingControllerZMQ callback handles this by setting
trainer.should_stop=True and coordinating DDP shutdown via
reduce_boolean_decision, avoiding the deadlock that occurred when
SIGTERM was caught by Lightning but couldn't cleanly synchronize
across DDP ranks.

Falls back to SIGTERM→SIGKILL for inference jobs or when no ZMQ
ProgressReporter is available. The existing watchdog (30s timeout)
still escalates ZMQ stop → SIGINT if the process doesn't exit.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ancel

Adds _cancel_requested flag to JobExecutor alongside _stop_requested.
Both "stop" and "cancel" ZMQ commands stop training via
trainer.should_stop, but only cancel sets pipeline_cancelled=True
which skips post-training inference.

- send_control_message() now tracks "cancel" command separately
- Result detection uses _cancel_requested flag (not just SIGTERM exit code)
- Dashboard job_cancel sends ZMQ stop + sets cancel flag
- ZMQ_CTRL data channel path (SLEAP GUI/sleap-app) now routes through
  job_executor.send_control_message() so flags are set correctly
  (previously bypassed to progress_reporter directly)

This fixes cancel for all three clients: dashboard, SLEAP GUI, sleap-app.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@alicup29 alicup29 changed the title Dashboard job cancellation, stale state fixes, and README update Dashboard Job Cancellation, Stale State Fixes, and README update Mar 30, 2026
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@alicup29 alicup29 merged commit bfe739e into main Mar 30, 2026
10 checks passed
@alicup29 alicup29 deleted the amick/dashboard-cancel-bugs branch March 30, 2026 22:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant