Skip to content

Conversation

@cmeesters
Copy link
Member

@cmeesters cmeesters commented Jan 21, 2026

This PR suggests a fix for multicluster setups: Before scancel --clusters=all was used. A user (#397) with an outdated SLURM described that this flag is not supported. Regardless, a specific cluster cancel command should only be issued when actually using a multicluster setup. Also, a proper warning should be included, if the command fails (the multicluster setup can be tricky.)

Summary by CodeRabbit

  • Bug Fixes

    • Job cancellation now targets recorded SLURM clusters when available instead of cancelling all clusters by default.
    • Cancellation failure messages now include contextual guidance when multicluster SLURM setups may be involved.
  • Documentation

    • SLURM resources docs updated to list cluster naming conventions (cluster / clusters / slurm_cluster) and clarify multi-cluster support limitations.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 21, 2026

Walkthrough

This PR adds internal tracking of SLURM clusters used for submitted jobs and uses that tracking to conditionally include cluster targets in scancel invocations instead of always using --clusters=all. Documentation updated to accept cluster, clusters, or slurm_cluster resource keys and note multi-cluster considerations.

Changes

Cohort / File(s) Summary
Cluster tracking & cancellation logic
snakemake_executor_plugin_slurm/__init__.py
Added private _submitted_job_clusters set initialized in __post_init__. run_job records cluster identifiers from job.resources (cluster, clusters, or slurm_cluster) after submission. cancel_jobs only appends --clusters=... when recorded clusters exist and extends error messages with multicluster guidance.
Docs update
docs/further.md
Updated SLURM resources table and explanatory text to list accepted cluster resource keys (clusters, cluster, slurm_cluster) and note that multi-cluster submission/cancellation may not be supported in all setups.

Sequence Diagram(s)

(omitted — changes are internal tracking and conditional CLI argument construction and do not introduce a multi-component control flow requiring visualization)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • johanneskoester

Poem

🐰 I hopped through code with tiny paws,
I kept a list of SLURM-cluster laws,
When jobs were sent to many lands,
I marked each home with careful hands,
Now cancels aim true where each job paws.

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix: cancel on multicluster' directly addresses the main change: modifying the cancel behavior for multicluster SLURM setups by removing the unconditional --clusters=all flag and conditionally adding it only when clusters were actually submitted.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
snakemake_executor_plugin_slurm/__init__.py (1)

814-834: Avoid shell injection via --clusters argument

clusters_str originates from user-controlled job resources and is appended to a shell=True command without quoting, creating an injection vector. This is inconsistent with the same parameter in submit_string.py (line 54), which properly uses safe_quote(). Other similar parameters like account and partition are also properly quoted with shlex.quote().

Prefer shell=False with a list of arguments, or at minimum use shlex.quote() on the cluster string. The shlex module is already imported in the file.

🔒 Proposed fix (avoid shell=True)
-            jobids = " ".join([job_info.external_jobid for job_info in active_jobs])
+            jobids = [job_info.external_jobid for job_info in active_jobs]

             try:
                 # timeout set to 60, because a scheduler cycle usually is
                 # about 30 sec, but can be longer in extreme cases.
                 # Under 'normal' circumstances, 'scancel' is executed in
                 # virtually no time.
-                scancel_command = f"scancel {jobids}"
+                scancel_command = ["scancel", *jobids]

                 # Add cluster specification if any clusters were found during submission
                 if self._submitted_job_clusters:
                     clusters_str = ",".join(sorted(self._submitted_job_clusters))
-                    scancel_command += f" --clusters={clusters_str}"
+                    scancel_command.append(f"--clusters={clusters_str}")

                 subprocess.check_output(
                     scancel_command,
                     text=True,
-                    shell=True,
+                    shell=False,
                     timeout=60,
                     stderr=subprocess.PIPE,
                 )
🤖 Fix all issues with AI agents
In `@snakemake_executor_plugin_slurm/__init__.py`:
- Around line 600-607: The code stores cluster_val directly into
self._submitted_job_clusters which breaks when job.resources["clusters"] is a
comma/whitespace-delimited string or when the resource is a list/tuple; instead,
normalize cluster values like the slurm_account handling: obtain cluster_val
from job.resources, if it's a (str) split on commas/whitespace and strip each
token, if it's a list/tuple iterate its items converting to strings and
stripping, then add each non-empty normalized cluster name individually into
self._submitted_job_clusters (referencing cluster_val and
self._submitted_job_clusters to locate the code).
🧹 Nitpick comments (1)
snakemake_executor_plugin_slurm/__init__.py (1)

841-850: Log multicluster advice separately to keep WorkflowError concise

The appended multi-line warning makes the exception message very long. Consider logging the advisory via self.logger.warning(...) and keep the WorkflowError succinct. Based on learnings, prefer concise error messages in this file.

✂️ Proposed refactor
-                if self._submitted_job_clusters:
-                    msg += (
-                        "\nWARNING: Job cancellation failed while using "
-                        "--clusters flag. Your multicluster SLURM setup may not "
-                        "support this feature, or the SLURM database may not be "
-                        "properly configured for multicluster operations. "
-                        "Please verify your SLURM configuration with your "
-                        "HPC administrator."
-                    )
+                if self._submitted_job_clusters:
+                    self.logger.warning(
+                        "Job cancellation failed while using --clusters. "
+                        "Your multicluster SLURM setup may not support this, or "
+                        "slurmdbd may be misconfigured. Please verify with your "
+                        "HPC administrator."
+                    )

@cmeesters cmeesters requested a review from cademirch January 21, 2026 16:49
Copy link
Contributor

@cademirch cademirch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If clusters can actually be a list of str we should handle that either enforcing it isnt (and updating docs) or checking when adding to cluster set.

@cmeesters
Copy link
Member Author

@cademirch Oh, then the docs need a change, too. SLURM does indeed only submit to one cluster at a time. This contrasts with how it works on partitions. I remember that I thought it to be similar and did not remember fixing this for the reason I was writing you earlier.

However, the docs are broken in this regard. I think, I will fix them too. Not now, it is too late.

@cademirch
Copy link
Contributor

OK - I'll fix the docs in this PR.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@docs/further.md`:
- Around line 239-242: Fix the typo in the prose where "prinicple" is
misspelled; locate the sentence that reads "While it is possible to submit to
more than one cluster in prinicple, not all SLURM multicluster setups will
support this." (the paragraph discussing `clusters`/`cluster`/`slurm_cluster`)
and change "prinicple" to "principle".

In `@snakemake_executor_plugin_slurm/__init__.py`:
- Around line 821-828: The string concatenation uses an unnecessary f-string in
the scancel_command update; inside the block that checks
self._submitted_job_clusters (where scancel_command is built from jobids),
replace scancel_command += f" --clusters=all" with a plain string concatenation
scancel_command += " --clusters=all" to remove the redundant f-prefix while
preserving behavior.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@snakemake_executor_plugin_slurm/__init__.py`:
- Around line 821-828: The code appends "--clusters=all" whenever
self._submitted_job_clusters is non-empty, but it should only do so when more
than one cluster was used; update the condition on scancel_command to check that
self._submitted_job_clusters contains multiple entries (e.g.,
len(self._submitted_job_clusters) > 1 or an equivalent check) before appending
the flag so that scancel_command only gains "--clusters=all" for multi-cluster
submissions.

cmeesters and others added 2 commits January 22, 2026 12:01
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
@cmeesters cmeesters requested a review from cademirch January 22, 2026 11:28
Copy link
Contributor

@cademirch cademirch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great. thank you. sorry i couldn't get to this sooner.

@cmeesters cmeesters merged commit cb6124b into main Jan 26, 2026
6 checks passed
@cmeesters cmeesters deleted the fix/cancel_on_multicluster branch January 26, 2026 08:59
@cmeesters
Copy link
Member Author

@cademirch please do not apologize: we all have (other) work to do and (presumably) something called "live" going on. Thank you for your work!

cmeesters pushed a commit that referenced this pull request Feb 9, 2026
🤖 I have created a release *beep* *boop*
---


##
[2.2.0](v2.1.0...v2.2.0)
(2026-02-09)


### Features

* job name prefix
([#408](#408))
([5fa0d33](5fa0d33))


### Bug Fixes

* cancel on multicluster
([#401](#401))
([cb6124b](cb6124b))
* decreasing job query verbosity
([#405](#405))
([6649881](6649881))
* extracting job id from convoluted output, if necessary
([#375](#375))
([950c909](950c909))
* quoting piped substring with shlex
([#402](#402))
([7aa7fc3](7aa7fc3))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

## Release Notes

* **New Features**
  * Added job name prefix capability.

* **Bug Fixes**
  * Fixed cancel operation on multicluster environments.
  * Improved job query verbosity output.
  * Fixed job ID extraction from complex output formats.
  * Fixed substring quoting in piped commands.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants