Skip to content

fix: handling of numeric-only partition names#442

Closed
cmeesters wants to merge 7 commits intomainfrom
fix/numeric_partition_handling
Closed

fix: handling of numeric-only partition names#442
cmeesters wants to merge 7 commits intomainfrom
fix/numeric_partition_handling

Conversation

@cmeesters
Copy link
Copy Markdown
Member

@cmeesters cmeesters commented Mar 18, 2026

This might(!) be a fix for snakemake/snakemake#3992

When a partition is numeric only, i.e. 20, Snakemake's internal resource handling interprets it as an integer, not a string. The exception handling in the resource module is hard to extend to handle SLURM specific resources. Hence this PR.

Summary by CodeRabbit

  • Bug Fixes
    • Improved partition selection logic for grouped jobs with enhanced recovery mechanisms.
    • Strengthened cluster-consistency validation when selecting partitions to prevent mismatches.
    • Added robust fallback behavior for edge cases where partition determination fails.
    • Introduced informative warnings for ambiguous partition scenarios.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 18, 2026

Warning

Rate limit exceeded

@cmeesters has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 23 minutes and 16 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8a837400-902b-4ab5-a118-c29f1c490f64

📥 Commits

Reviewing files that changed from the base of the PR and between 16ff5a8 and be171b5.

📒 Files selected for processing (1)
  • snakemake_executor_plugin_slurm/__init__.py

Walkthrough

The changes introduce numeric partition recovery logic for group jobs in SLURM task execution, where numeric partition values are divided by constituent rule counts to recover original partition names. New fallback mechanisms are added for cluster consistency validation and automatic partition selection when recovery fails or constraints are unmet.

Changes

Cohort / File(s) Summary
Partition Selection Logic
snakemake_executor_plugin_slurm/__init__.py
Added numeric partition name recovery for group jobs by dividing aggregated values by rule count, with warning on ambiguous recovery. Enhanced cluster constraint validation and multi-tier fallback to get_best_partition and get_default_partition, returning quoted partition arguments or empty strings.
Partition Selection Tests
tests/test_partition_selection.py
Introduced TestNumericGroupPartitionHandling class with three tests validating numeric partition recovery, divisibility failures, and correct behavior for non-group jobs. Includes _MockResources helper for simulating resource objects and mocked get_default_partition interactions.

Sequence Diagram

sequenceDiagram
    participant Client
    participant Executor as Executor.get_partition_arg()
    participant PartitionMgr as Partition Manager
    participant FallbackFns as Fallback Functions

    Client->>Executor: Request partition argument
    activate Executor
    
    Executor->>Executor: Check if group job<br/>with numeric partition
    
    alt Numeric Partition Recovery
        Executor->>Executor: Divide aggregated value<br/>by rule count
        alt Recovery Possible
            Executor->>Executor: Recover partition name
            Note over Executor: Set recovered partition
        else Recovery Fails
            Executor->>Executor: Log fallback warning
        end
    end
    
    Executor->>PartitionMgr: Validate cluster constraint
    alt Cluster Constraint Exists
        PartitionMgr->>PartitionMgr: Match partition object
        alt Cluster Mismatch
            Executor->>FallbackFns: Invoke get_best_partition()
            FallbackFns->>Executor: Return auto-selected partition
        end
    end
    
    alt No Partition Determined
        Executor->>FallbackFns: Invoke get_default_partition()
        FallbackFns->>Executor: Return fallback partition
    end
    
    Executor->>Executor: Format as -p argument<br/>or return empty string
    Executor->>Client: Return partition argument
    deactivate Executor
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 A partition scheme both wise and neat,
Where numeric groups now find retreat,
Divided down by constituent count,
Recovery flows like a flowing mount,
And clusters align in harmony sweet!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix: handling of numeric-only partition names' directly and clearly summarizes the main change - addressing how numeric partition names are handled in the Slurm executor plugin.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/numeric_partition_handling
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
snakemake_executor_plugin_slurm/__init__.py (1)

1143-1166: Consider handling the num_rules == 0 edge case explicitly.

If num_rules is 0 (theoretically invalid for a group job), the short-circuit evaluation prevents a ZeroDivisionError, but the warning message at line 1164 would read "not evenly divisible by 0 rules", which is mathematically confusing.

While this is likely an invalid state that shouldn't occur in practice, an explicit guard could provide a clearer error:

🔧 Suggested improvement
             num_rules = len(list(job.rules))
             aggregated = job.resources.slurm_partition
-            if num_rules > 0 and aggregated % num_rules == 0:
+            if num_rules == 0:
+                self.logger.warning(
+                    f"Group job '{job.name}' has no constituent rules. "
+                    "Cannot recover numeric partition. Falling back to default."
+                )
+            elif aggregated % num_rules == 0:
                 recovered = aggregated // num_rules
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@snakemake_executor_plugin_slurm/__init__.py` around lines 1143 - 1166, The
code path handling numeric group 'slurm_partition' should explicitly guard the
num_rules == 0 case to avoid confusing wording and potential divide-by-zero
logic: in the block that checks if job.is_group() and
isinstance(job.resources.slurm_partition, int) (referencing job.rules,
num_rules, aggregated, and partition), add an early branch when num_rules == 0
that logs a clear warning via self.logger.warning stating the group contains
zero constituent rules and that the code will fall back to default partition
selection, then set partition to the fallback behavior (same as the existing
else) and return/continue; otherwise proceed with the existing divisible check
and recovered calculation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@snakemake_executor_plugin_slurm/__init__.py`:
- Around line 1143-1166: The code path handling numeric group 'slurm_partition'
should explicitly guard the num_rules == 0 case to avoid confusing wording and
potential divide-by-zero logic: in the block that checks if job.is_group() and
isinstance(job.resources.slurm_partition, int) (referencing job.rules,
num_rules, aggregated, and partition), add an early branch when num_rules == 0
that logs a clear warning via self.logger.warning stating the group contains
zero constituent rules and that the code will fall back to default partition
selection, then set partition to the fallback behavior (same as the existing
else) and return/continue; otherwise proceed with the existing divisible check
and recovered calculation.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b19f6c0e-280d-49d6-bbd2-3144f8ee9918

📥 Commits

Reviewing files that changed from the base of the PR and between c4e3ec3 and 16ff5a8.

📒 Files selected for processing (2)
  • snakemake_executor_plugin_slurm/__init__.py
  • tests/test_partition_selection.py

@andynu
Copy link
Copy Markdown

andynu commented Mar 18, 2026

I tested this PR on a real SLURM cluster with partition 20. The divide-to-recover approach works when grouped rules are independent, but fails when one rule depends on the other's output.

When rules have a dependency chain (e.g. rule b reads rule a's output), snakemake merges their resources using max rather than sum. So the merged partition is max(20, 20) = 20 — already correct. But the PR divides anyway: 20 / 2 rules = 10. The job fails because partition 10 doesn't exist.

The relevant merge logic is in https://github.com/snakemake/snakemake/blob/f08fae8b/src/snakemake/resources.py:

The core problem is that the division heuristic can't distinguish "was summed and needs recovery" from "was maxed and is already correct." It only sees the final number and the rule count.

It might be more robust to prevent the aggregation from happening in the first place — introducing a configurable set of identifier resource keys in resources.py that are passed through untouched, the same way string resources already are. This would be executor-agnostic and wouldn't require any SLURM-specific logic in snakemake core. That way there's nothing to unwind.

@cmeesters
Copy link
Copy Markdown
Member Author

agreed. slurm_partition in resource.py should not be an additive_resource. Then again, no other *partition, too. Need to discuss this. Will provide feedback in the issue thread.

@cmeesters cmeesters closed this Mar 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants