|
| 1 | +# Slurm Allocation Strategies |
| 2 | + |
| 3 | +When submitting a workflow with many jobs to Slurm, you must decide how to split work across |
| 4 | +allocations. The `torc slurm plan-allocations` command (or the `plan_allocations` MCP tool for AI |
| 5 | +assistants) analyzes your workflow and cluster state to recommend a strategy. |
| 6 | + |
| 7 | +## The Core Tradeoff: Single Large vs Many Small |
| 8 | + |
| 9 | +Given N nodes worth of work, there are two extremes: |
| 10 | + |
| 11 | +| Strategy | Description | Pros | Cons | |
| 12 | +| ------------------------ | ------------------------------------- | --------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------- | |
| 13 | +| **1 x N** (single large) | One allocation requesting all N nodes | Slurm prioritizes larger jobs; all work completes in one walltime window; no fair-share degradation | Must wait for N nodes to be available simultaneously | |
| 14 | +| **N x 1** (many small) | N separate single-node allocations | First jobs start as soon as any node is free | Fair-share degrades as allocations start; last jobs may wait much longer than the first | |
| 15 | + |
| 16 | +### When Single Large Wins |
| 17 | + |
| 18 | +- **Slurm backfill priority**: Slurm's scheduler reserves nodes for large pending jobs. A 167-node |
| 19 | + request gets a reserved slot in the queue, while 167 individual jobs compete with everyone. |
| 20 | +- **Fair-share preservation**: A single allocation consumes your fair-share budget once. Many small |
| 21 | + allocations drain it progressively, causing later jobs to lose priority. |
| 22 | +- **Deterministic completion**: All jobs start processing simultaneously and finish within one |
| 23 | + walltime window. |
| 24 | +- **Busy clusters**: Counter-intuitively, a fully loaded cluster often favors large allocations |
| 25 | + because Slurm will schedule the large job as a block when enough nodes free up, rather than |
| 26 | + letting small jobs trickle through. |
| 27 | + |
| 28 | +### When Many Small Wins |
| 29 | + |
| 30 | +- **Extremely long queues**: If the cluster is oversubscribed for weeks, small jobs may fit into |
| 31 | + backfill gaps that a large allocation cannot. |
| 32 | +- **Partial results needed**: If you need some results quickly rather than waiting for all of them. |
| 33 | +- **Near partition limits**: If your ideal node count exceeds `max_nodes_per_user`, you cannot |
| 34 | + request a single allocation that large. |
| 35 | + |
| 36 | +## Using `sbatch --test-only` |
| 37 | + |
| 38 | +The `plan-allocations` command runs `sbatch --test-only` to ask Slurm's scheduler when each strategy |
| 39 | +would start, without actually submitting jobs. For a plan with K nodes per allocation and N total |
| 40 | +allocations: |
| 41 | + |
| 42 | +```bash |
| 43 | +# Single large: when would all K*N nodes start together? |
| 44 | +sbatch --test-only --nodes=<K*N> --time=04:00:00 --account=myproject --wrap="hostname" |
| 45 | + |
| 46 | +# Many small: when would one K-node allocation start? |
| 47 | +sbatch --test-only --nodes=<K> --time=04:00:00 --account=myproject --wrap="hostname" |
| 48 | +``` |
| 49 | + |
| 50 | +When no partition is explicitly configured, the `--partition` flag is omitted so Slurm uses its |
| 51 | +default partition. |
| 52 | + |
| 53 | +The single-large estimated start + walltime gives the completion time directly. The many-small |
| 54 | +estimate is **optimistic** — it only predicts when the _first_ allocation would start. Later |
| 55 | +allocations will be delayed by fair-share degradation. |
| 56 | + |
| 57 | +### Fair-Share Degradation Estimate |
| 58 | + |
| 59 | +The tool estimates the last small allocation's completion as: |
| 60 | + |
| 61 | +``` |
| 62 | +last_completion ≈ first_wait × min(N, 10) + walltime |
| 63 | +``` |
| 64 | + |
| 65 | +This is a rough approximation. The actual degradation depends on your account's fair-share balance, |
| 66 | +other users' activity, and the scheduler's configuration. |
| 67 | + |
| 68 | +## Interpreting Results |
| 69 | + |
| 70 | +Example output: |
| 71 | + |
| 72 | +``` |
| 73 | +Recommendations |
| 74 | +=============== |
| 75 | + "work_resources": 1 allocation(s) x 167 node(s) [single] |
| 76 | + sbatch --test-only: large (167 nodes) completes in ~4h 30min, |
| 77 | + faster than 167 small allocations (~6h 30min). |
| 78 | + Slurm prioritizes larger allocations |
| 79 | +
|
| 80 | + Scheduler Estimate (sbatch --test-only): |
| 81 | + Single large (167 nodes): start in ~30min, complete in ~4h 30min |
| 82 | + Many small (1 node): start in ~5min, complete in ~4h 5min |
| 83 | + Note: estimate is for first job only; later jobs delayed by fair-share |
| 84 | +``` |
| 85 | + |
| 86 | +Key things to check: |
| 87 | + |
| 88 | +- **Large completion vs small completion**: The tool accounts for fair-share degradation in its |
| 89 | + recommendation, but review the raw estimates yourself. |
| 90 | +- **Wait time for large**: If the large allocation won't start for hours while small jobs start |
| 91 | + immediately, small may still be better for partial results. |
| 92 | +- **Dependency depth**: A DAG with deep dependency chains cannot exploit N-node parallelism fully. |
| 93 | + Check `max_parallelism` in the workflow analysis — if it's much less than `ideal_nodes`, you may |
| 94 | + need fewer nodes than calculated. |
| 95 | + |
| 96 | +## Skipping the Probes |
| 97 | + |
| 98 | +Use `--skip-test-only` to rely on heuristics only (faster, no `sbatch` calls): |
| 99 | + |
| 100 | +```bash |
| 101 | +torc slurm plan-allocations --account myproject --skip-test-only workflow.yaml |
| 102 | +``` |
| 103 | + |
| 104 | +Use `--offline` to skip all cluster queries (sinfo, squeue, sbatch): |
| 105 | + |
| 106 | +```bash |
| 107 | +torc slurm plan-allocations --account myproject --offline workflow.yaml |
| 108 | +``` |
| 109 | + |
| 110 | +## Applying the Recommendation |
| 111 | + |
| 112 | +After reviewing the recommendation: |
| 113 | + |
| 114 | +```bash |
| 115 | +# Single large allocation (1 x N nodes) |
| 116 | +torc slurm generate --account myproject --single-allocation workflow.yaml |
| 117 | + |
| 118 | +# Many small allocations (N x 1 node, the default) |
| 119 | +torc slurm generate --account myproject workflow.yaml |
| 120 | +``` |
| 121 | + |
| 122 | +## Using with AI Assistants |
| 123 | + |
| 124 | +The `plan_allocations` MCP tool provides the same analysis through AI assistants (Claude Code, |
| 125 | +GitHub Copilot). The assistant can analyze your workflow spec and present the recommendation |
| 126 | +conversationally: |
| 127 | + |
| 128 | +``` |
| 129 | +> I have a workflow with 167 parallel jobs. Should I use one big Slurm |
| 130 | +> allocation or many small ones? My account is "myproject". |
| 131 | +
|
| 132 | +⏺ torc - plan_allocations (MCP)(spec_json: {...}, account: "myproject") |
| 133 | + ⎿ { "recommendations": [...], "cluster_state": [...], ... } |
| 134 | +
|
| 135 | +⏺ Based on the current cluster state and sbatch --test-only probes: |
| 136 | +
|
| 137 | + Recommendation: Single large allocation (1 x 167 nodes) |
| 138 | +
|
| 139 | + | Strategy | Estimated Start | Estimated Completion | |
| 140 | + |-------------|-----------------|----------------------| |
| 141 | + | Single large| ~30 min | ~4h 30min | |
| 142 | + | Many small | ~5 min (first) | ~6h 30min (last) | |
| 143 | +
|
| 144 | + The single large allocation completes ~2 hours faster because Slurm |
| 145 | + prioritizes it in the backfill queue, and you avoid fair-share |
| 146 | + degradation from 167 separate allocations. |
| 147 | +
|
| 148 | + Shall I generate the workflow with a single allocation? |
| 149 | +``` |
| 150 | + |
| 151 | +The tool accepts a workflow spec as a JSON object, the Slurm account, and optional partition and HPC |
| 152 | +profile overrides. Use `skip_test_only: true` to skip the sbatch probes for faster results based on |
| 153 | +heuristics only. |
| 154 | + |
| 155 | +See [Configuring AI Assistants](../tools/ai-assistants.md) for setup instructions. |
0 commit comments