Skip to content

Commit 24ce509

Browse files
committed
Update parallel srun section with more explanation, extended examples, and output.
1 parent 38c456a commit 24ce509

File tree

1 file changed

+49
-11
lines changed

1 file changed

+49
-11
lines changed

docs/running/slurm.md

Lines changed: 49 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,8 @@ CSCS uses the [Slurm](https://slurm.schedmd.com/documentation.html) workload man
55
SLURM is an open-source, highly scalable job scheduler that allocates computing resources, queues user jobs, and optimizes workload distribution across the cluster.
66
It supports advanced scheduling policies, job dependencies, resource reservations, and accounting, making it well-suited for high-performance computing environments.
77

8+
Refer to the [Quick Start User Guide](https://slurm.schedmd.com/quickstart.html) for commonly used terminology and commands.
9+
810
<div class="grid cards" markdown>
911

1012
- :fontawesome-solid-mountain-sun: __Configuring jobs__
@@ -199,14 +201,17 @@ For workflows and use cases with tasks that require only a subset of these resou
199201
CSCS will support this feature on some Alps [clusters][ref-alps-clusters] in the near-medium future.
200202

201203
[](){#ref-slurm-exclusive}
202-
### Running more than one MPI job per node
204+
### Running more than one job step per node
205+
206+
Running multiple job steps in parallel on the same allocated set of nodes can improve resource utilization by taking advantage of all the available CPUs, GPUs, or memory within a single job allocation.
203207

204208
The approach is to:
205209

206210
1. first allocate all the resources on each node to the job;
207211
2. then subdivide those resources at each invocation of srun.
208212

209-
If slurm believes that a request for resources (cores, gpus, memory) overlaps with what another step has already allocated, it will defer the execution until the resources are relinquished.
213+
If Slurm believes that a request for resources (cores, gpus, memory) overlaps with what another step has already allocated, it will defer the execution until the resources are relinquished.
214+
This must be avoided.
210215

211216
First ensure that *all* resources are allocated to the whole job with the following preamble:
212217

@@ -215,13 +220,21 @@ First ensure that *all* resources are allocated to the whole job with the follow
215220
#SBATCH --exclusive --mem=450G
216221
```
217222

218-
* `--exclusive` allocates all the CPUs and GPUs;
223+
* `--exclusive` allocates all the CPUs and GPUs exclusively to this job;
219224
* `--mem=450G` most of allowable memory (there are 4 Grace CPUs with ~120 GB of memory on the node)
220225

221226
!!! note
222-
`--mem=0` can be used to allocate all memory on the node, however there is currently a configuration issue that causes this to fail.
227+
`--mem=0` can generally be used to allocate all memory on the node but the Slurm configuration on clariden doesn't allow this.
228+
229+
Next, launch your applications using `srun`, carefully subdividing resources for each job step.
230+
The `--exclusive` flag must be used again, but note that its meaning differs in the context of `srun`.
231+
Here, `--exclusive` ensures that only the resources explicitly requested for a given job step are reserved and allocated to it.
232+
Without this flag, Slurm reserves all resources for the job step, even if it only allocates a subset -- effectively blocking further parallel `srun` invocations from accessing unrequested but needed resources.
223233

224-
`--exclusive` has two different meanings depending on whether it's used in the job context (here) or in the job step context (srun). We need to use both.
234+
Be sure to background each `srun` command with `&`, so that subsequent job steps start immediately without waiting for previous ones to finish.
235+
A final `wait` command ensures that your submission script does not exit until all job steps complete.
236+
237+
Slurm will automatically set `CUDA_VISIBLE_DEVICES` for each `srun` call, restricting GPU access to only the devices assigned to that job step.
225238

226239
!!! todo "use [affinity](https://github.com/bcumming/affinity) for these examples"
227240

@@ -233,13 +246,24 @@ First ensure that *all* resources are allocated to the whole job with the follow
233246
#SBATCH --exclusive --mem=450G
234247
#SBATCH -N1
235248

236-
srun -n1 --exclusive --gpus=2 --cpus-per-gpu=5 --mem=50G bash -c "echo JobStep:\${SLURM_STEP_ID}"
237-
srun -n1 --exclusive --gpus=1 --cpus-per-gpu=5 --mem=50G bash -c "echo JobStep:\${SLURM_STEP_ID}"
238-
srun -n1 --exclusive --gpus=1 --cpus-per-gpu=5 --mem=50G bash -c "echo JobStep:\${SLURM_STEP_ID}"
249+
CMD="echo \$(date) \$(hostname) JobStep:\${SLURM_STEP_ID} ProcID:\${SLURM_PROCID} CUDA_VISIBLE_DEVICES=\${CUDA_VISIBLE_DEVICES}; sleep 5"
250+
srun -N1 --ntasks-per-node=1 --exclusive --gpus-per-task=2 --cpus-per-gpu=5 --mem=50G --output "out-%J.log" bash -c "${CMD}" &
251+
srun -N1 --ntasks-per-node=1 --exclusive --gpus-per-task=1 --cpus-per-gpu=5 --mem=50G --output "out-%J.log" bash -c "${CMD}" &
252+
srun -N1 --ntasks-per-node=1 --exclusive --gpus-per-task=1 --cpus-per-gpu=5 --mem=50G --output "out-%J.log" bash -c "${CMD}" &
239253

240254
wait
241255
```
242256

257+
Output (exact output will vary):
258+
```
259+
$ cat out-537506.*.log
260+
Tue Jul 1 11:40:46 CEST 2025 nid007104 JobStep:0 ProcID:0 CUDA_VISIBLE_DEVICES=0
261+
Tue Jul 1 11:40:46 CEST 2025 nid007104 JobStep:1 ProcID:0 CUDA_VISIBLE_DEVICES=1
262+
Tue Jul 1 11:40:46 CEST 2025 nid007104 JobStep:2 ProcID:0 CUDA_VISIBLE_DEVICES=2,3
263+
```
264+
265+
266+
243267
=== "multi-node"
244268

245269
!!! example "three jobs on two nodes"
@@ -248,9 +272,23 @@ First ensure that *all* resources are allocated to the whole job with the follow
248272
#SBATCH --exclusive --mem=450G
249273
#SBATCH -N2
250274

251-
srun -N2 -n2 --exclusive --gpus-per-task=1 --cpus-per-gpu=5 --mem=50G bash -c "echo JobStep:\${SLURM_STEP_ID}"
252-
srun -N2 -n1 --exclusive --gpus-per-task=1 --cpus-per-gpu=5 --mem=50G bash -c "echo JobStep:\${SLURM_STEP_ID}"
253-
srun -N2 -n1 --exclusive --gpus-per-task=1 --cpus-per-gpu=5 --mem=50G bash -c "echo JobStep:\${SLURM_STEP_ID}"
275+
CMD="echo \$(date) \$(hostname) JobStep:\${SLURM_STEP_ID} ProcID:\${SLURM_PROCID} CUDA_VISIBLE_DEVICES=\${CUDA_VISIBLE_DEVICES}; sleep 5"
276+
srun -N2 --ntasks-per-node=2 --exclusive --gpus-per-task=1 --cpus-per-gpu=5 --mem=50G --output "out-%J.log" bash -c "${CMD}"
277+
srun -N2 --ntasks-per-node=1 --exclusive --gpus-per-task=1 --cpus-per-gpu=5 --mem=50G --output "out-%J.log" bash -c "${CMD}"
278+
srun -N2 --ntasks-per-node=1 --exclusive --gpus-per-task=1 --cpus-per-gpu=5 --mem=50G --output "out-%J.log" bash -c "${CMD}"
254279

255280
wait
256281
```
282+
283+
Output (exact output will vary):
284+
```
285+
$ cat out-537539.*.log
286+
Tue Jul 1 12:02:01 CEST 2025 nid005085 JobStep:0 ProcID:2 CUDA_VISIBLE_DEVICES=0
287+
Tue Jul 1 12:02:01 CEST 2025 nid005085 JobStep:0 ProcID:3 CUDA_VISIBLE_DEVICES=1
288+
Tue Jul 1 12:02:01 CEST 2025 nid005080 JobStep:0 ProcID:0 CUDA_VISIBLE_DEVICES=0
289+
Tue Jul 1 12:02:01 CEST 2025 nid005080 JobStep:0 ProcID:1 CUDA_VISIBLE_DEVICES=1
290+
Tue Jul 1 12:02:01 CEST 2025 nid005085 JobStep:1 ProcID:1 CUDA_VISIBLE_DEVICES=2
291+
Tue Jul 1 12:02:01 CEST 2025 nid005080 JobStep:1 ProcID:0 CUDA_VISIBLE_DEVICES=2
292+
Tue Jul 1 12:02:01 CEST 2025 nid005085 JobStep:2 ProcID:1 CUDA_VISIBLE_DEVICES=3
293+
Tue Jul 1 12:02:01 CEST 2025 nid005080 JobStep:2 ProcID:0 CUDA_VISIBLE_DEVICES=3
294+
```

0 commit comments

Comments
 (0)