fix: cancel on multicluster #401

cmeesters · 2026-01-21T16:40:16Z

This PR suggests a fix for multicluster setups: Before scancel --clusters=all was used. A user (#397) with an outdated SLURM described that this flag is not supported. Regardless, a specific cluster cancel command should only be issued when actually using a multicluster setup. Also, a proper warning should be included, if the command fails (the multicluster setup can be tricky.)

Summary by CodeRabbit

Bug Fixes
- Job cancellation now targets recorded SLURM clusters when available instead of cancelling all clusters by default.
- Cancellation failure messages now include contextual guidance when multicluster SLURM setups may be involved.
Documentation
- SLURM resources docs updated to list cluster naming conventions (cluster / clusters / slurm_cluster) and clarify multi-cluster support limitations.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

…ified at all

…ter setup

coderabbitai · 2026-01-21T16:40:36Z

Walkthrough

This PR adds internal tracking of SLURM clusters used for submitted jobs and uses that tracking to conditionally include cluster targets in scancel invocations instead of always using --clusters=all. Documentation updated to accept cluster, clusters, or slurm_cluster resource keys and note multi-cluster considerations.

Changes

Cohort / File(s)	Summary
Cluster tracking & cancellation logic `snakemake_executor_plugin_slurm/__init__.py`	Added private `_submitted_job_clusters` set initialized in `__post_init__`. `run_job` records cluster identifiers from `job.resources` (`cluster`, `clusters`, or `slurm_cluster`) after submission. `cancel_jobs` only appends `--clusters=...` when recorded clusters exist and extends error messages with multicluster guidance.
Docs update `docs/further.md`	Updated SLURM resources table and explanatory text to list accepted cluster resource keys (`clusters`, `cluster`, `slurm_cluster`) and note that multi-cluster submission/cancellation may not be supported in all setups.

Sequence Diagram(s)

(omitted — changes are internal tracking and conditional CLI argument construction and do not introduce a multi-component control flow requiring visualization)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

fix: logdir misconstruct when leading slash in wildcard #220: Overlaps changes in run_job (GPU/GRES and submission helpers) and may conflict with resource handling.
fix: account lookup / test in multicluster environment #350: Modifies run_job and multicluster-related logic (account/cluster extraction), directly related.
fix: partition cluster selection #385: Updates handling of cluster resource keys (cluster/clusters/slurm_cluster), closely related.

Suggested reviewers

johanneskoester

Poem

🐰 I hopped through code with tiny paws,
I kept a list of SLURM-cluster laws,
When jobs were sent to many lands,
I marked each home with careful hands,
Now cancels aim true where each job paws.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix: cancel on multicluster' directly addresses the main change: modifying the cancel behavior for multicluster SLURM setups by removing the unconditional --clusters=all flag and conditionally adding it only when clusters were actually submitted.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

snakemake_executor_plugin_slurm/__init__.py (1)

814-834: Avoid shell injection via --clusters argument

clusters_str originates from user-controlled job resources and is appended to a shell=True command without quoting, creating an injection vector. This is inconsistent with the same parameter in submit_string.py (line 54), which properly uses safe_quote(). Other similar parameters like account and partition are also properly quoted with shlex.quote().

Prefer shell=False with a list of arguments, or at minimum use shlex.quote() on the cluster string. The shlex module is already imported in the file.

🔒 Proposed fix (avoid shell=True)

-            jobids = " ".join([job_info.external_jobid for job_info in active_jobs])
+            jobids = [job_info.external_jobid for job_info in active_jobs]

             try:
                 # timeout set to 60, because a scheduler cycle usually is
                 # about 30 sec, but can be longer in extreme cases.
                 # Under 'normal' circumstances, 'scancel' is executed in
                 # virtually no time.
-                scancel_command = f"scancel {jobids}"
+                scancel_command = ["scancel", *jobids]

                 # Add cluster specification if any clusters were found during submission
                 if self._submitted_job_clusters:
                     clusters_str = ",".join(sorted(self._submitted_job_clusters))
-                    scancel_command += f" --clusters={clusters_str}"
+                    scancel_command.append(f"--clusters={clusters_str}")

                 subprocess.check_output(
                     scancel_command,
                     text=True,
-                    shell=True,
+                    shell=False,
                     timeout=60,
                     stderr=subprocess.PIPE,
                 )

🤖 Fix all issues with AI agents

In `@snakemake_executor_plugin_slurm/__init__.py`:
- Around line 600-607: The code stores cluster_val directly into
self._submitted_job_clusters which breaks when job.resources["clusters"] is a
comma/whitespace-delimited string or when the resource is a list/tuple; instead,
normalize cluster values like the slurm_account handling: obtain cluster_val
from job.resources, if it's a (str) split on commas/whitespace and strip each
token, if it's a list/tuple iterate its items converting to strings and
stripping, then add each non-empty normalized cluster name individually into
self._submitted_job_clusters (referencing cluster_val and
self._submitted_job_clusters to locate the code).

🧹 Nitpick comments (1)

snakemake_executor_plugin_slurm/__init__.py (1)

841-850: Log multicluster advice separately to keep WorkflowError concise

The appended multi-line warning makes the exception message very long. Consider logging the advisory via self.logger.warning(...) and keep the WorkflowError succinct. Based on learnings, prefer concise error messages in this file.

✂️ Proposed refactor

-                if self._submitted_job_clusters:
-                    msg += (
-                        "\nWARNING: Job cancellation failed while using "
-                        "--clusters flag. Your multicluster SLURM setup may not "
-                        "support this feature, or the SLURM database may not be "
-                        "properly configured for multicluster operations. "
-                        "Please verify your SLURM configuration with your "
-                        "HPC administrator."
-                    )
+                if self._submitted_job_clusters:
+                    self.logger.warning(
+                        "Job cancellation failed while using --clusters. "
+                        "Your multicluster SLURM setup may not support this, or "
+                        "slurmdbd may be misconfigured. Please verify with your "
+                        "HPC administrator."
+                    )

snakemake_executor_plugin_slurm/__init__.py

cademirch

If clusters can actually be a list of str we should handle that either enforcing it isnt (and updating docs) or checking when adding to cluster set.

snakemake_executor_plugin_slurm/__init__.py

cmeesters · 2026-01-21T17:32:40Z

@cademirch Oh, then the docs need a change, too. SLURM does indeed only submit to one cluster at a time. This contrasts with how it works on partitions. I remember that I thought it to be similar and did not remember fixing this for the reason I was writing you earlier.

However, the docs are broken in this regard. I think, I will fix them too. Not now, it is too late.

cademirch · 2026-01-21T19:08:09Z

OK - I'll fix the docs in this PR.

…er(s) at once: it is possible, albeit not with every multicluster setup

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@docs/further.md`:
- Around line 239-242: Fix the typo in the prose where "prinicple" is
misspelled; locate the sentence that reads "While it is possible to submit to
more than one cluster in prinicple, not all SLURM multicluster setups will
support this." (the paragraph discussing `clusters`/`cluster`/`slurm_cluster`)
and change "prinicple" to "principle".

In `@snakemake_executor_plugin_slurm/__init__.py`:
- Around line 821-828: The string concatenation uses an unnecessary f-string in
the scancel_command update; inside the block that checks
self._submitted_job_clusters (where scancel_command is built from jobids),
replace scancel_command += f" --clusters=all" with a plain string concatenation
scancel_command += " --clusters=all" to remove the redundant f-prefix while
preserving behavior.

docs/further.md

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@snakemake_executor_plugin_slurm/__init__.py`:
- Around line 821-828: The code appends "--clusters=all" whenever
self._submitted_job_clusters is non-empty, but it should only do so when more
than one cluster was used; update the condition on scancel_command to check that
self._submitted_job_clusters contains multiple entries (e.g.,
len(self._submitted_job_clusters) > 1 or an equivalent check) before appending
the flag so that scancel_command only gains "--clusters=all" for multi-cluster
submissions.

snakemake_executor_plugin_slurm/__init__.py

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

cademirch

looks great. thank you. sorry i couldn't get to this sooner.

cmeesters · 2026-01-26T09:00:06Z

@cademirch please do not apologize: we all have (other) work to do and (presumably) something called "live" going on. Thank you for your work!

🤖 I have created a release *beep* *boop* --- ## [2.2.0](v2.1.0...v2.2.0) (2026-02-09) ### Features * job name prefix ([#408](#408)) ([5fa0d33](5fa0d33)) ### Bug Fixes * cancel on multicluster ([#401](#401)) ([cb6124b](cb6124b)) * decreasing job query verbosity ([#405](#405)) ([6649881](6649881)) * extracting job id from convoluted output, if necessary ([#375](#375)) ([950c909](950c909)) * quoting piped substring with shlex ([#402](#402)) ([7aa7fc3](7aa7fc3)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).  ## Summary by CodeRabbit ## Release Notes * **New Features** * Added job name prefix capability. * **Bug Fixes** * Fixed cancel operation on multicluster environments. * Improved job query verbosity output. * Fixed job ID extraction from complex output formats. * Fixed substring quoting in piped commands.

cmeesters added 3 commits January 21, 2026 17:25

fix: scancel should work only on specified clusters, if they are spec…

dd4502f

…ified at all

fix: better keep track of clusters we submitted to

d7bda79

fix: issue proper warning, if the cancel command fails in a multiclus…

1899f52

…ter setup

cmeesters mentioned this pull request Jan 21, 2026

Cancel running jobs failed when press ctrl+c in the terminal #397

Closed

coderabbitai bot reviewed Jan 21, 2026

View reviewed changes

snakemake_executor_plugin_slurm/__init__.py Show resolved Hide resolved

cmeesters requested a review from cademirch January 21, 2026 16:49

cademirch reviewed Jan 21, 2026

View reviewed changes

snakemake_executor_plugin_slurm/__init__.py Show resolved Hide resolved

snakemake_executor_plugin_slurm/__init__.py Show resolved Hide resolved

cmeesters added 3 commits January 22, 2026 11:26

docs: fixes the irritation stemming from the submit to multiple clust…

46cc063

…er(s) at once: it is possible, albeit not with every multicluster setup

docs: mentioning the diffeerent cluster resources

080d56d

fix: reverting to but only in case of a multi cluster setup.

64bf37a

coderabbitai bot reviewed Jan 22, 2026

View reviewed changes

docs/further.md Show resolved Hide resolved

cmeesters added 3 commits January 22, 2026 11:50

fix: formatting

b6e4009

fix: formatting

b3c3312

fix: linting (non-f string to ordinary string)

bd2e9ca

coderabbitai bot reviewed Jan 22, 2026

View reviewed changes

snakemake_executor_plugin_slurm/__init__.py Outdated Show resolved Hide resolved

cmeesters and others added 2 commits January 22, 2026 12:01

fix: in-code documentation phrasing

5dbc20d

Update docs/further.md

14554ea

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

cmeesters requested a review from cademirch January 22, 2026 11:28

Merge branch 'main' into fix/cancel_on_multicluster

31afbea

cademirch approved these changes Jan 24, 2026

View reviewed changes

cmeesters merged commit cb6124b into main Jan 26, 2026
6 checks passed

cmeesters deleted the fix/cancel_on_multicluster branch January 26, 2026 08:59

snakemake-bot mentioned this pull request Jan 26, 2026

chore(main): release 2.2.0 #404

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: cancel on multicluster #401

fix: cancel on multicluster #401

Uh oh!

cmeesters commented Jan 21, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 21, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

cademirch left a comment

Uh oh!

Uh oh!

Uh oh!

cmeesters commented Jan 21, 2026

Uh oh!

cademirch commented Jan 21, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

cademirch left a comment

Uh oh!

Uh oh!

cmeesters commented Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: cancel on multicluster #401

fix: cancel on multicluster #401

Uh oh!

Conversation

cmeesters commented Jan 21, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cademirch left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cmeesters commented Jan 21, 2026

Uh oh!

cademirch commented Jan 21, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cademirch left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cmeesters commented Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cmeesters commented Jan 21, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 21, 2026 •

edited

Loading