Skip to content

Commit 45806a7

Browse files
daniel-thomclaude
andauthored
Add Slurm scheduling recommendations (#226)
* Add slurm scheduling recommendations * Add sbatch --test-only probes and MCP tool for allocation planning Integrate Slurm's sbatch --test-only into plan-allocations to provide data-driven recommendations (single-large vs many-small) based on actual scheduler estimates rather than heuristics alone. Refactor core logic into analyze_plan_allocations() library function for reuse by both CLI and MCP. - Add sbatch --test-only parsing and execution in slurm.rs - Extract analyze_plan_allocations() from handle_plan_allocations() - Add plan_allocations MCP tool calling the library directly - Add --skip-test-only and --offline CLI flags - Add allocation-strategies documentation page Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent d74e176 commit 45806a7

File tree

7 files changed

+1744
-7
lines changed

7 files changed

+1744
-7
lines changed
Lines changed: 155 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,155 @@
1+
# Slurm Allocation Strategies
2+
3+
When submitting a workflow with many jobs to Slurm, you must decide how to split work across
4+
allocations. The `torc slurm plan-allocations` command (or the `plan_allocations` MCP tool for AI
5+
assistants) analyzes your workflow and cluster state to recommend a strategy.
6+
7+
## The Core Tradeoff: Single Large vs Many Small
8+
9+
Given N nodes worth of work, there are two extremes:
10+
11+
| Strategy | Description | Pros | Cons |
12+
| ------------------------ | ------------------------------------- | --------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------- |
13+
| **1 x N** (single large) | One allocation requesting all N nodes | Slurm prioritizes larger jobs; all work completes in one walltime window; no fair-share degradation | Must wait for N nodes to be available simultaneously |
14+
| **N x 1** (many small) | N separate single-node allocations | First jobs start as soon as any node is free | Fair-share degrades as allocations start; last jobs may wait much longer than the first |
15+
16+
### When Single Large Wins
17+
18+
- **Slurm backfill priority**: Slurm's scheduler reserves nodes for large pending jobs. A 167-node
19+
request gets a reserved slot in the queue, while 167 individual jobs compete with everyone.
20+
- **Fair-share preservation**: A single allocation consumes your fair-share budget once. Many small
21+
allocations drain it progressively, causing later jobs to lose priority.
22+
- **Deterministic completion**: All jobs start processing simultaneously and finish within one
23+
walltime window.
24+
- **Busy clusters**: Counter-intuitively, a fully loaded cluster often favors large allocations
25+
because Slurm will schedule the large job as a block when enough nodes free up, rather than
26+
letting small jobs trickle through.
27+
28+
### When Many Small Wins
29+
30+
- **Extremely long queues**: If the cluster is oversubscribed for weeks, small jobs may fit into
31+
backfill gaps that a large allocation cannot.
32+
- **Partial results needed**: If you need some results quickly rather than waiting for all of them.
33+
- **Near partition limits**: If your ideal node count exceeds `max_nodes_per_user`, you cannot
34+
request a single allocation that large.
35+
36+
## Using `sbatch --test-only`
37+
38+
The `plan-allocations` command runs `sbatch --test-only` to ask Slurm's scheduler when each strategy
39+
would start, without actually submitting jobs. For a plan with K nodes per allocation and N total
40+
allocations:
41+
42+
```bash
43+
# Single large: when would all K*N nodes start together?
44+
sbatch --test-only --nodes=<K*N> --time=04:00:00 --account=myproject --wrap="hostname"
45+
46+
# Many small: when would one K-node allocation start?
47+
sbatch --test-only --nodes=<K> --time=04:00:00 --account=myproject --wrap="hostname"
48+
```
49+
50+
When no partition is explicitly configured, the `--partition` flag is omitted so Slurm uses its
51+
default partition.
52+
53+
The single-large estimated start + walltime gives the completion time directly. The many-small
54+
estimate is **optimistic** — it only predicts when the _first_ allocation would start. Later
55+
allocations will be delayed by fair-share degradation.
56+
57+
### Fair-Share Degradation Estimate
58+
59+
The tool estimates the last small allocation's completion as:
60+
61+
```
62+
last_completion ≈ first_wait × min(N, 10) + walltime
63+
```
64+
65+
This is a rough approximation. The actual degradation depends on your account's fair-share balance,
66+
other users' activity, and the scheduler's configuration.
67+
68+
## Interpreting Results
69+
70+
Example output:
71+
72+
```
73+
Recommendations
74+
===============
75+
"work_resources": 1 allocation(s) x 167 node(s) [single]
76+
sbatch --test-only: large (167 nodes) completes in ~4h 30min,
77+
faster than 167 small allocations (~6h 30min).
78+
Slurm prioritizes larger allocations
79+
80+
Scheduler Estimate (sbatch --test-only):
81+
Single large (167 nodes): start in ~30min, complete in ~4h 30min
82+
Many small (1 node): start in ~5min, complete in ~4h 5min
83+
Note: estimate is for first job only; later jobs delayed by fair-share
84+
```
85+
86+
Key things to check:
87+
88+
- **Large completion vs small completion**: The tool accounts for fair-share degradation in its
89+
recommendation, but review the raw estimates yourself.
90+
- **Wait time for large**: If the large allocation won't start for hours while small jobs start
91+
immediately, small may still be better for partial results.
92+
- **Dependency depth**: A DAG with deep dependency chains cannot exploit N-node parallelism fully.
93+
Check `max_parallelism` in the workflow analysis — if it's much less than `ideal_nodes`, you may
94+
need fewer nodes than calculated.
95+
96+
## Skipping the Probes
97+
98+
Use `--skip-test-only` to rely on heuristics only (faster, no `sbatch` calls):
99+
100+
```bash
101+
torc slurm plan-allocations --account myproject --skip-test-only workflow.yaml
102+
```
103+
104+
Use `--offline` to skip all cluster queries (sinfo, squeue, sbatch):
105+
106+
```bash
107+
torc slurm plan-allocations --account myproject --offline workflow.yaml
108+
```
109+
110+
## Applying the Recommendation
111+
112+
After reviewing the recommendation:
113+
114+
```bash
115+
# Single large allocation (1 x N nodes)
116+
torc slurm generate --account myproject --single-allocation workflow.yaml
117+
118+
# Many small allocations (N x 1 node, the default)
119+
torc slurm generate --account myproject workflow.yaml
120+
```
121+
122+
## Using with AI Assistants
123+
124+
The `plan_allocations` MCP tool provides the same analysis through AI assistants (Claude Code,
125+
GitHub Copilot). The assistant can analyze your workflow spec and present the recommendation
126+
conversationally:
127+
128+
```
129+
> I have a workflow with 167 parallel jobs. Should I use one big Slurm
130+
> allocation or many small ones? My account is "myproject".
131+
132+
⏺ torc - plan_allocations (MCP)(spec_json: {...}, account: "myproject")
133+
⎿ { "recommendations": [...], "cluster_state": [...], ... }
134+
135+
⏺ Based on the current cluster state and sbatch --test-only probes:
136+
137+
Recommendation: Single large allocation (1 x 167 nodes)
138+
139+
| Strategy | Estimated Start | Estimated Completion |
140+
|-------------|-----------------|----------------------|
141+
| Single large| ~30 min | ~4h 30min |
142+
| Many small | ~5 min (first) | ~6h 30min (last) |
143+
144+
The single large allocation completes ~2 hours faster because Slurm
145+
prioritizes it in the backfill queue, and you avoid fair-share
146+
degradation from 167 separate allocations.
147+
148+
Shall I generate the workflow with a single allocation?
149+
```
150+
151+
The tool accepts a workflow spec as a JSON object, the Slurm account, and optional partition and HPC
152+
profile overrides. Use `skip_test_only: true` to skip the sbatch probes for faster results based on
153+
heuristics only.
154+
155+
See [Configuring AI Assistants](../tools/ai-assistants.md) for setup instructions.

docs/src/specialized/tools/ai-assistant.md

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@ With Torc's AI integration, you can manage workflows using natural language:
2525
| Fix problems | _"Restart the failed jobs with doubled memory"_ |
2626
| Investigate | _"Check if any jobs exceeded their memory limits"_ |
2727
| Optimize resources | _"Analyze usage and suggest better resource groupings"_ |
28+
| Plan allocations | _"Should I use one big allocation or many small ones?"_ |
2829

2930
---
3031

@@ -262,6 +263,49 @@ In this example, 6 data-processing jobs are split across 2 resource groups (`sta
262263

263264
---
264265

266+
## Example: Planning Slurm allocation strategy
267+
268+
When submitting a workflow to Slurm, the AI can analyze the cluster state and recommend whether to
269+
use a single large allocation or many small ones.
270+
271+
```
272+
> I'm about to submit this workflow to Slurm. Should I use one big
273+
> allocation or many small ones? My account is "climate_sim".
274+
275+
⏺ torc - plan_allocations (MCP)(spec_json: {...}, account: "climate_sim")
276+
⎿ { "recommendations": [...], "cluster_state": [...], ... }
277+
278+
⏺ Allocation Strategy Analysis
279+
280+
Workflow: 167 parallel work jobs, 1 preprocess, 1 postprocess
281+
Cluster: short partition — 42 idle, 180 allocated, 12 pending jobs
282+
283+
Recommendation: Single large allocation (1 x 167 nodes)
284+
285+
| Strategy | Est. Start | Est. Completion |
286+
|--------------|------------|-----------------|
287+
| Single large | ~30 min | ~4h 30min |
288+
| Many small | ~5 min | ~6h 30min |
289+
290+
Reason: Slurm's backfill scheduler gives the large allocation a
291+
reserved slot. Many small allocations would start faster initially,
292+
but fair-share degradation delays the last jobs by ~2 hours.
293+
294+
Your DAG has max_parallelism=167, so all work jobs can run
295+
simultaneously — a single allocation fully utilizes all nodes.
296+
297+
> Go ahead and generate it with the single allocation.
298+
299+
⏺ Running: torc slurm generate --account climate_sim --single-allocation workflow.yaml
300+
⎿ Generated scheduler plan with 1 allocation x 167 nodes.
301+
302+
⏺ Done. Submit with: torc submit workflow.yaml
303+
```
304+
305+
For background on the tradeoffs, see [Allocation Strategies](../hpc/allocation-strategies.md).
306+
307+
---
308+
265309
## Example: Recovering a failed workflow that timed out
266310

267311
The workflow in `tests/workflows/timeout_auto_recovery_test` is used to test torc's auto-recovery of

docs/src/specialized/tools/ai-assistants.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,7 @@ The AI assistant has access to these Torc operations:
3232
| `list_pending_failed_jobs` | List jobs awaiting AI-assisted classification |
3333
| `classify_and_resolve_failures` | Classify failures as transient or permanent |
3434
| `get_slurm_sacct` | Get Slurm accounting data for a workflow |
35+
| `plan_allocations` | Analyze workflow and recommend Slurm allocation strategy |
3536
| `get_docs` | Retrieve Torc documentation by topic |
3637
| `list_examples` | List available example workflow specifications |
3738
| `get_example` | Retrieve a complete example workflow spec |

0 commit comments

Comments
 (0)