-
Notifications
You must be signed in to change notification settings - Fork 43
fix: cancel on multicluster #401
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThis PR adds internal tracking of SLURM clusters used for submitted jobs and uses that tracking to conditionally include cluster targets in Changes
Sequence Diagram(s)(omitted — changes are internal tracking and conditional CLI argument construction and do not introduce a multi-component control flow requiring visualization) Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
snakemake_executor_plugin_slurm/__init__.py (1)
814-834: Avoid shell injection via--clustersargument
clusters_stroriginates from user-controlled job resources and is appended to ashell=Truecommand without quoting, creating an injection vector. This is inconsistent with the same parameter insubmit_string.py(line 54), which properly usessafe_quote(). Other similar parameters like account and partition are also properly quoted withshlex.quote().Prefer
shell=Falsewith a list of arguments, or at minimum useshlex.quote()on the cluster string. Theshlexmodule is already imported in the file.🔒 Proposed fix (avoid shell=True)
- jobids = " ".join([job_info.external_jobid for job_info in active_jobs]) + jobids = [job_info.external_jobid for job_info in active_jobs] try: # timeout set to 60, because a scheduler cycle usually is # about 30 sec, but can be longer in extreme cases. # Under 'normal' circumstances, 'scancel' is executed in # virtually no time. - scancel_command = f"scancel {jobids}" + scancel_command = ["scancel", *jobids] # Add cluster specification if any clusters were found during submission if self._submitted_job_clusters: clusters_str = ",".join(sorted(self._submitted_job_clusters)) - scancel_command += f" --clusters={clusters_str}" + scancel_command.append(f"--clusters={clusters_str}") subprocess.check_output( scancel_command, text=True, - shell=True, + shell=False, timeout=60, stderr=subprocess.PIPE, )
🤖 Fix all issues with AI agents
In `@snakemake_executor_plugin_slurm/__init__.py`:
- Around line 600-607: The code stores cluster_val directly into
self._submitted_job_clusters which breaks when job.resources["clusters"] is a
comma/whitespace-delimited string or when the resource is a list/tuple; instead,
normalize cluster values like the slurm_account handling: obtain cluster_val
from job.resources, if it's a (str) split on commas/whitespace and strip each
token, if it's a list/tuple iterate its items converting to strings and
stripping, then add each non-empty normalized cluster name individually into
self._submitted_job_clusters (referencing cluster_val and
self._submitted_job_clusters to locate the code).
🧹 Nitpick comments (1)
snakemake_executor_plugin_slurm/__init__.py (1)
841-850: Log multicluster advice separately to keep WorkflowError conciseThe appended multi-line warning makes the exception message very long. Consider logging the advisory via
self.logger.warning(...)and keep theWorkflowErrorsuccinct. Based on learnings, prefer concise error messages in this file.✂️ Proposed refactor
- if self._submitted_job_clusters: - msg += ( - "\nWARNING: Job cancellation failed while using " - "--clusters flag. Your multicluster SLURM setup may not " - "support this feature, or the SLURM database may not be " - "properly configured for multicluster operations. " - "Please verify your SLURM configuration with your " - "HPC administrator." - ) + if self._submitted_job_clusters: + self.logger.warning( + "Job cancellation failed while using --clusters. " + "Your multicluster SLURM setup may not support this, or " + "slurmdbd may be misconfigured. Please verify with your " + "HPC administrator." + )
cademirch
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If clusters can actually be a list of str we should handle that either enforcing it isnt (and updating docs) or checking when adding to cluster set.
|
@cademirch Oh, then the docs need a change, too. SLURM does indeed only submit to one cluster at a time. This contrasts with how it works on partitions. I remember that I thought it to be similar and did not remember fixing this for the reason I was writing you earlier. However, the docs are broken in this regard. I think, I will fix them too. Not now, it is too late. |
|
OK - I'll fix the docs in this PR. |
…er(s) at once: it is possible, albeit not with every multicluster setup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@docs/further.md`:
- Around line 239-242: Fix the typo in the prose where "prinicple" is
misspelled; locate the sentence that reads "While it is possible to submit to
more than one cluster in prinicple, not all SLURM multicluster setups will
support this." (the paragraph discussing `clusters`/`cluster`/`slurm_cluster`)
and change "prinicple" to "principle".
In `@snakemake_executor_plugin_slurm/__init__.py`:
- Around line 821-828: The string concatenation uses an unnecessary f-string in
the scancel_command update; inside the block that checks
self._submitted_job_clusters (where scancel_command is built from jobids),
replace scancel_command += f" --clusters=all" with a plain string concatenation
scancel_command += " --clusters=all" to remove the redundant f-prefix while
preserving behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@snakemake_executor_plugin_slurm/__init__.py`:
- Around line 821-828: The code appends "--clusters=all" whenever
self._submitted_job_clusters is non-empty, but it should only do so when more
than one cluster was used; update the condition on scancel_command to check that
self._submitted_job_clusters contains multiple entries (e.g.,
len(self._submitted_job_clusters) > 1 or an equivalent check) before appending
the flag so that scancel_command only gains "--clusters=all" for multi-cluster
submissions.
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
cademirch
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks great. thank you. sorry i couldn't get to this sooner.
|
@cademirch please do not apologize: we all have (other) work to do and (presumably) something called "live" going on. Thank you for your work! |
🤖 I have created a release *beep* *boop* --- ## [2.2.0](v2.1.0...v2.2.0) (2026-02-09) ### Features * job name prefix ([#408](#408)) ([5fa0d33](5fa0d33)) ### Bug Fixes * cancel on multicluster ([#401](#401)) ([cb6124b](cb6124b)) * decreasing job query verbosity ([#405](#405)) ([6649881](6649881)) * extracting job id from convoluted output, if necessary ([#375](#375)) ([950c909](950c909)) * quoting piped substring with shlex ([#402](#402)) ([7aa7fc3](7aa7fc3)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes * **New Features** * Added job name prefix capability. * **Bug Fixes** * Fixed cancel operation on multicluster environments. * Improved job query verbosity output. * Fixed job ID extraction from complex output formats. * Fixed substring quoting in piped commands. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
This PR suggests a fix for multicluster setups: Before
scancel --clusters=allwas used. A user (#397) with an outdated SLURM described that this flag is not supported. Regardless, a specific cluster cancel command should only be issued when actually using a multicluster setup. Also, a proper warning should be included, if the command fails (the multicluster setup can be tricky.)Summary by CodeRabbit
Bug Fixes
Documentation
✏️ Tip: You can customize this high-level summary in your review settings.