Dashboard Job Cancellation, Stale State Fixes, and README update#73
Merged
Dashboard Job Cancellation, Stale State Fixes, and README update#73
Conversation
Previously, openSubmitJobModal() only cleared _sjConfigContent and _sjLabelsPath but left _sjPathMappings, _sjMissingVideos, _sjModelType, _sjMaxEpochs, _sjBrowseMode, and _sjResolvingVideoIndex stale from previous sessions. This caused config preloading artifacts like previously seen filenames showing up when nothing was selected. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds a "Cancel Training" button alongside "Close" in the job status view. Cancel uses a two-click confirmation: first click changes text to "Confirm Cancel?" with a pulsing animation, second click within 3 seconds sends the cancel request. Reverts after timeout. - Show both Cancel + Close buttons when job is running - Hide Cancel button when job completes/fails/cancels - Wire to existing apiJobCancel() endpoint - Add explicit logging at every step for debugging Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replaces the minimal 39-line README with a comprehensive guide covering Dashboard, CLI, and Docker workflows. Adds installation variants for CUDA, CPU, and Apple Silicon. Adds "How It Works" overview and updated links. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Restores the auto-detection flag instead of hardcoding torch-cuda130, which is simpler for users who don't know their GPU backend. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The dropzone innerHTML was replaced with the filename when a config file was loaded, but openSubmitJobModal() only reset state variables without restoring the default dropzone HTML. This caused the previous filename (e.g., "centroid.yaml") to persist visually when the modal was reopened. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…odal reopen The missing videos panel, validation status, selected path, file browser columns, and training log retained their innerHTML from the previous session. Even though they were hidden, sjEnterStep3() would unhide them, showing stale data (e.g., "Missing Videos: mice.mp4") before the user selected any file. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Added guards to _sjHandleVideoCheck, _sjHandlePathOk, and _sjHandlePathError that ignore events when no labels path has been selected yet. This prevents stale SSE events replayed on reconnect from showing "Missing Videos" UI before the user selects a file. Also clears _sjPendingRequests on modal open, closes previous worker SSE before opening a new one in sjEnterStep3, and adds detailed logging for debugging SSE event flow. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
SSE relay replays previous events on reconnect, causing a flood of warnings for ignored fs_list_res, worker_path_ok, and fs_check_videos_response events. These are correctly handled (ignored) but the console noise makes real issues hard to spot. Downgrade to console.debug and remove the fs_list_res warning entirely since it fires for every previous directory listing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The dashboard cancels jobs via POST /api/jobs/{id}/cancel, which the
signaling server forwards as a {"type": "job_cancel"} WebSocket
message. But the worker's WebSocket handlers (admin and non-admin)
didn't process this message type — only the WebRTC data channel
handler did. This meant dashboard cancellation was silently dropped.
Added _handle_job_cancel() to mesh_coordinator and wired it in both
admin and non-admin WebSocket handler loops.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PyTorch Lightning with DDP catches SIGTERM for graceful shutdown but doesn't always exit, leaving zombie training processes. Now starts a background thread that waits 10 seconds after SIGTERM, and if the process is still running, sends SIGKILL to the process group to force termination. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replaces OS signal-based cancellation with ZMQ {"command": "stop"},
matching the approach used by the SLEAP GUI and sleap-app. sleap-nn's
TrainingControllerZMQ callback handles this by setting
trainer.should_stop=True and coordinating DDP shutdown via
reduce_boolean_decision, avoiding the deadlock that occurred when
SIGTERM was caught by Lightning but couldn't cleanly synchronize
across DDP ranks.
Falls back to SIGTERM→SIGKILL for inference jobs or when no ZMQ
ProgressReporter is available. The existing watchdog (30s timeout)
still escalates ZMQ stop → SIGINT if the process doesn't exit.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ancel Adds _cancel_requested flag to JobExecutor alongside _stop_requested. Both "stop" and "cancel" ZMQ commands stop training via trainer.should_stop, but only cancel sets pipeline_cancelled=True which skips post-training inference. - send_control_message() now tracks "cancel" command separately - Result detection uses _cancel_requested flag (not just SIGTERM exit code) - Dashboard job_cancel sends ZMQ stop + sets cancel flag - ZMQ_CTRL data channel path (SLEAP GUI/sleap-app) now routes through job_executor.send_control_message() so flags are set correctly (previously bypassed to progress_reporter directly) This fixes cancel for all three clients: dashboard, SLEAP GUI, sleap-app. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This was referenced Mar 31, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
--torch-backend auto.Key changes
Job cancellation pipeline
The dashboard's cancel button triggers
POST /api/jobs/{id}/cancel→ signaling server forwardsjob_cancelover WebSocket → worker's mesh coordinator sends ZMQ{"command": "stop"}to sleap-nn'sTrainingControllerZMQcallback, which setstrainer.should_stop = Trueand coordinates DDP shutdown viareduce_boolean_decision.A new
_cancel_requestedflag onJobExecutordistinguishes "Cancel Training" (skip inference) from "Stop Early" (run inference). This also fixes the SLEAP GUI/sleap-app data channel path (ZMQ_CTRL::) which previously bypassed the flag.Stale state fixes
openSubmitJobModal()now resets all wizard state variables + clears DOM (dropzone, missing videos, validation status, file browser)_sjHandleVideoCheck,_sjHandlePathOk,_sjHandlePathError) ignore events when no labels path is selected (stale replays)Files changed
dashboard/app.jsdashboard/index.htmldashboard/styles.css.btn-confirmingpulse animationsleap_rtc/worker/mesh_coordinator.pyjob_cancelWebSocket handler, ZMQ cancel routingsleap_rtc/worker/job_executor.py_cancel_requestedflag, SIGKILL escalation fallbacksleap_rtc/worker/worker_class.pyZMQ_CTRL::throughjob_executor.send_control_message()README.mdTest plan
🤖 Generated with Claude Code