You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/running/slurm.md
+49-11Lines changed: 49 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,6 +5,8 @@ CSCS uses the [Slurm](https://slurm.schedmd.com/documentation.html) workload man
5
5
SLURM is an open-source, highly scalable job scheduler that allocates computing resources, queues user jobs, and optimizes workload distribution across the cluster.
6
6
It supports advanced scheduling policies, job dependencies, resource reservations, and accounting, making it well-suited for high-performance computing environments.
7
7
8
+
Refer to the [Quick Start User Guide](https://slurm.schedmd.com/quickstart.html) for commonly used terminology and commands.
@@ -199,14 +201,17 @@ For workflows and use cases with tasks that require only a subset of these resou
199
201
CSCS will support this feature on some Alps [clusters][ref-alps-clusters] in the near-medium future.
200
202
201
203
[](){#ref-slurm-exclusive}
202
-
### Running more than one MPI job per node
204
+
### Running more than one job step per node
205
+
206
+
Running multiple job steps in parallel on the same allocated set of nodes can improve resource utilization by taking advantage of all the available CPUs, GPUs, or memory within a single job allocation.
203
207
204
208
The approach is to:
205
209
206
210
1. first allocate all the resources on each node to the job;
207
211
2. then subdivide those resources at each invocation of srun.
208
212
209
-
If slurm believes that a request for resources (cores, gpus, memory) overlaps with what another step has already allocated, it will defer the execution until the resources are relinquished.
213
+
If Slurm believes that a request for resources (cores, gpus, memory) overlaps with what another step has already allocated, it will defer the execution until the resources are relinquished.
214
+
This must be avoided.
210
215
211
216
First ensure that *all* resources are allocated to the whole job with the following preamble:
212
217
@@ -215,13 +220,21 @@ First ensure that *all* resources are allocated to the whole job with the follow
215
220
#SBATCH --exclusive --mem=450G
216
221
```
217
222
218
-
*`--exclusive` allocates all the CPUs and GPUs;
223
+
*`--exclusive` allocates all the CPUs and GPUs exclusively to this job;
219
224
*`--mem=450G` most of allowable memory (there are 4 Grace CPUs with ~120 GB of memory on the node)
220
225
221
226
!!! note
222
-
`--mem=0` can be used to allocate all memory on the node, however there is currently a configuration issue that causes this to fail.
227
+
`--mem=0` can generally be used to allocate all memory on the node but the Slurm configuration on clariden doesn't allow this.
228
+
229
+
Next, launch your applications using `srun`, carefully subdividing resources for each job step.
230
+
The `--exclusive` flag must be used again, but note that its meaning differs in the context of `srun`.
231
+
Here, `--exclusive` ensures that only the resources explicitly requested for a given job step are reserved and allocated to it.
232
+
Without this flag, Slurm reserves all resources for the job step, even if it only allocates a subset -- effectively blocking further parallel `srun` invocations from accessing unrequested but needed resources.
223
233
224
-
`--exclusive` has two different meanings depending on whether it's used in the job context (here) or in the job step context (srun). We need to use both.
234
+
Be sure to background each `srun` command with `&`, so that subsequent job steps start immediately without waiting for previous ones to finish.
235
+
A final `wait` command ensures that your submission script does not exit until all job steps complete.
236
+
237
+
Slurm will automatically set `CUDA_VISIBLE_DEVICES` for each `srun` call, restricting GPU access to only the devices assigned to that job step.
225
238
226
239
!!! todo "use [affinity](https://github.com/bcumming/affinity) for these examples"
227
240
@@ -233,13 +246,24 @@ First ensure that *all* resources are allocated to the whole job with the follow
0 commit comments