fix: handling of numeric-only partition names#442
Conversation
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
WalkthroughThe changes introduce numeric partition recovery logic for group jobs in SLURM task execution, where numeric partition values are divided by constituent rule counts to recover original partition names. New fallback mechanisms are added for cluster consistency validation and automatic partition selection when recovery fails or constraints are unmet. Changes
Sequence DiagramsequenceDiagram
participant Client
participant Executor as Executor.get_partition_arg()
participant PartitionMgr as Partition Manager
participant FallbackFns as Fallback Functions
Client->>Executor: Request partition argument
activate Executor
Executor->>Executor: Check if group job<br/>with numeric partition
alt Numeric Partition Recovery
Executor->>Executor: Divide aggregated value<br/>by rule count
alt Recovery Possible
Executor->>Executor: Recover partition name
Note over Executor: Set recovered partition
else Recovery Fails
Executor->>Executor: Log fallback warning
end
end
Executor->>PartitionMgr: Validate cluster constraint
alt Cluster Constraint Exists
PartitionMgr->>PartitionMgr: Match partition object
alt Cluster Mismatch
Executor->>FallbackFns: Invoke get_best_partition()
FallbackFns->>Executor: Return auto-selected partition
end
end
alt No Partition Determined
Executor->>FallbackFns: Invoke get_default_partition()
FallbackFns->>Executor: Return fallback partition
end
Executor->>Executor: Format as -p argument<br/>or return empty string
Executor->>Client: Return partition argument
deactivate Executor
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
snakemake_executor_plugin_slurm/__init__.py (1)
1143-1166: Consider handling thenum_rules == 0edge case explicitly.If
num_rulesis 0 (theoretically invalid for a group job), the short-circuit evaluation prevents aZeroDivisionError, but the warning message at line 1164 would read "not evenly divisible by 0 rules", which is mathematically confusing.While this is likely an invalid state that shouldn't occur in practice, an explicit guard could provide a clearer error:
🔧 Suggested improvement
num_rules = len(list(job.rules)) aggregated = job.resources.slurm_partition - if num_rules > 0 and aggregated % num_rules == 0: + if num_rules == 0: + self.logger.warning( + f"Group job '{job.name}' has no constituent rules. " + "Cannot recover numeric partition. Falling back to default." + ) + elif aggregated % num_rules == 0: recovered = aggregated // num_rules🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@snakemake_executor_plugin_slurm/__init__.py` around lines 1143 - 1166, The code path handling numeric group 'slurm_partition' should explicitly guard the num_rules == 0 case to avoid confusing wording and potential divide-by-zero logic: in the block that checks if job.is_group() and isinstance(job.resources.slurm_partition, int) (referencing job.rules, num_rules, aggregated, and partition), add an early branch when num_rules == 0 that logs a clear warning via self.logger.warning stating the group contains zero constituent rules and that the code will fall back to default partition selection, then set partition to the fallback behavior (same as the existing else) and return/continue; otherwise proceed with the existing divisible check and recovered calculation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@snakemake_executor_plugin_slurm/__init__.py`:
- Around line 1143-1166: The code path handling numeric group 'slurm_partition'
should explicitly guard the num_rules == 0 case to avoid confusing wording and
potential divide-by-zero logic: in the block that checks if job.is_group() and
isinstance(job.resources.slurm_partition, int) (referencing job.rules,
num_rules, aggregated, and partition), add an early branch when num_rules == 0
that logs a clear warning via self.logger.warning stating the group contains
zero constituent rules and that the code will fall back to default partition
selection, then set partition to the fallback behavior (same as the existing
else) and return/continue; otherwise proceed with the existing divisible check
and recovered calculation.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: b19f6c0e-280d-49d6-bbd2-3144f8ee9918
📒 Files selected for processing (2)
snakemake_executor_plugin_slurm/__init__.pytests/test_partition_selection.py
|
I tested this PR on a real SLURM cluster with partition 20. The divide-to-recover approach works when grouped rules are independent, but fails when one rule depends on the other's output. When rules have a dependency chain (e.g. rule b reads rule a's output), snakemake merges their resources using max rather than sum. So the merged partition is max(20, 20) = 20 — already correct. But the PR divides anyway: 20 / 2 rules = 10. The job fails because partition 10 doesn't exist. The relevant merge logic is in https://github.com/snakemake/snakemake/blob/f08fae8b/src/snakemake/resources.py:
The core problem is that the division heuristic can't distinguish "was summed and needs recovery" from "was maxed and is already correct." It only sees the final number and the rule count. It might be more robust to prevent the aggregation from happening in the first place — introducing a configurable set of identifier resource keys in resources.py that are passed through untouched, the same way string resources already are. This would be executor-agnostic and wouldn't require any SLURM-specific logic in snakemake core. That way there's nothing to unwind. |
|
agreed. |
This might(!) be a fix for snakemake/snakemake#3992
When a partition is numeric only, i.e.
20, Snakemake's internal resource handling interprets it as an integer, not a string. The exception handling in the resource module is hard to extend to handle SLURM specific resources. Hence this PR.Summary by CodeRabbit