Skip to content

Commit 442dd75

Browse files
committed
make it consistently Slurm
1 parent 347a671 commit 442dd75

File tree

1 file changed

+32
-32
lines changed

1 file changed

+32
-32
lines changed

docs/running/slurm.md

Lines changed: 32 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
[](){#ref-slurm}
2-
# SLURM
2+
# Slurm
33

4-
CSCS uses the [SLURM](https://slurm.schedmd.com/documentation.html) workload manager to efficiently schedule and manage jobs on Alps vClusters.
5-
SLURM is an open-source, highly scalable job scheduler that allocates computing resources, queues user jobs, and optimizes workload distribution across the cluster.
4+
CSCS uses the [Slurm](https://slurm.schedmd.com/documentation.html) workload manager to efficiently schedule and manage jobs on Alps vClusters.
5+
Slurm is an open-source, highly scalable job scheduler that allocates computing resources, queues user jobs, and optimizes workload distribution across the cluster.
66
It supports advanced scheduling policies, job dependencies, resource reservations, and accounting, making it well-suited for high-performance computing environments.
77

88
Refer to the [Quick Start User Guide](https://slurm.schedmd.com/quickstart.html) for commonly used terminology and commands.
@@ -11,7 +11,7 @@ Refer to the [Quick Start User Guide](https://slurm.schedmd.com/quickstart.html)
1111

1212
- :fontawesome-solid-mountain-sun: __Configuring jobs__
1313

14-
Specific guidance for configuring SLURM jobs on different node types.
14+
Specific guidance for configuring Slurm jobs on different node types.
1515

1616
[:octicons-arrow-right-24: GH200 nodes (Daint, Clariden, Santis)][ref-slurm-gh200]
1717

@@ -29,7 +29,7 @@ Refer to the [Quick Start User Guide](https://slurm.schedmd.com/quickstart.html)
2929

3030
## Accounts and resources
3131

32-
SLURM associates each job with a CSCS project in order to perform accounting.
32+
Slurm associates each job with a CSCS project in order to perform accounting.
3333
The project to use for accounting is specified using the `--account/-A` flag.
3434
If no job is specified, the primary project is used as the default.
3535

@@ -81,13 +81,13 @@ Additionally, short-duration jobs may be selected for backfilling — a process
8181
[](){#ref-slurm-partitions}
8282
## Partitions
8383

84-
At CSCS, SLURM is configured to accommodate the diverse range of node types available in our HPC clusters.
84+
At CSCS, Slurm is configured to accommodate the diverse range of node types available in our HPC clusters.
8585
These nodes vary in architecture, including CPU-only nodes and nodes equipped with different types of GPUs.
86-
Because of this heterogeneity, SLURM must be tailored to ensure efficient resource allocation, job scheduling, and workload management specific to each node type.
86+
Because of this heterogeneity, Slurm must be tailored to ensure efficient resource allocation, job scheduling, and workload management specific to each node type.
8787

88-
Each type of node has different resource constraints and capabilities, which SLURM takes into account when scheduling jobs.
88+
Each type of node has different resource constraints and capabilities, which Slurm takes into account when scheduling jobs.
8989
For example, CPU-only nodes may have configurations optimized for multi-threaded CPU workloads, while GPU nodes require additional parameters to allocate GPU resources efficiently.
90-
SLURM ensures that user jobs request and receive the appropriate resources while preventing conflicts or inefficient utilization.
90+
Slurm ensures that user jobs request and receive the appropriate resources while preventing conflicts or inefficient utilization.
9191

9292
[](){#ref-slurm-partitions-nodecount}
9393
!!! example "How to check the partitions and number of nodes therein?"
@@ -103,25 +103,25 @@ SLURM ensures that user jobs request and receive the appropriate resources while
103103

104104
[](){#ref-slurm-partition-debug}
105105
### Debug partition
106-
The SLURM `debug` partition is useful for quick turnaround workflows. The partition has a short maximum time (timelimit can be seen with `sinfo -p debug`), and a low number of maximum nodes (the `MaxNodes` can be seen with `scontrol show partition=debug`).
106+
The Slurm `debug` partition is useful for quick turnaround workflows. The partition has a short maximum time (timelimit can be seen with `sinfo -p debug`), and a low number of maximum nodes (the `MaxNodes` can be seen with `scontrol show partition=debug`).
107107

108108
[](){#ref-slurm-partition-normal}
109109
### Normal partition
110110
This is the default partition, and will be used when you do not explicitly set a partition.
111111
This is the correct choice for standard jobs. The maximum time is usually set to 24 hours (`sinfo -p normal` for timelimit), and the maximum nodes can be as much as nodes are available.
112112

113-
The following sections will provide detailed guidance on how to use SLURM to request and manage CPU cores, memory, and GPUs in jobs.
113+
The following sections will provide detailed guidance on how to use Slurm to request and manage CPU cores, memory, and GPUs in jobs.
114114
These instructions will help users optimize their workload execution and ensure efficient use of CSCS computing resources.
115115

116116
## Affinity
117117

118-
The following sections will document how to use SLURM on different compute nodes available on Alps.
119-
To demonstrate the effects different SLURM parameters, we will use a little command line tool [affinity](https://github.com/bcumming/affinity) that prints the CPU cores and GPUs that are assigned to each MPI rank in a job, and which node they are run on.
118+
The following sections will document how to use Slurm on different compute nodes available on Alps.
119+
To demonstrate the effects different Slurm parameters, we will use a little command line tool [affinity](https://github.com/bcumming/affinity) that prints the CPU cores and GPUs that are assigned to each MPI rank in a job, and which node they are run on.
120120

121-
We strongly recommend using a tool like affinity to understand and test the SLURM configuration for jobs, because the behavior of SLURM is highly dependent on the system configuration.
122-
Parameters that worked on a different cluster -- or with a different SLURM version or configuration on the same cluster -- are not guaranteed to give the same results.
121+
We strongly recommend using a tool like affinity to understand and test the Slurm configuration for jobs, because the behavior of Slurm is highly dependent on the system configuration.
122+
Parameters that worked on a different cluster -- or with a different Slurm version or configuration on the same cluster -- are not guaranteed to give the same results.
123123

124-
It is straightforward to build the affinity tool to experiment with SLURM configurations.
124+
It is straightforward to build the affinity tool to experiment with Slurm configurations.
125125

126126
```console title="Compiling affinity"
127127
$ uenv start prgenv-gnu/24.11:v2 --view=default #(1)
@@ -223,9 +223,9 @@ The build generates the following executables:
223223

224224
!!! info "Quick affinity checks"
225225

226-
The SLURM flag [`--cpu-bind=verbose`](https://slurm.schedmd.com/srun.html#OPT_cpu-bind) prints information about MPI ranks and their thread affinity.
226+
The Slurm flag [`--cpu-bind=verbose`](https://slurm.schedmd.com/srun.html#OPT_cpu-bind) prints information about MPI ranks and their thread affinity.
227227

228-
The mask it prints is not very readable, but it can be used with the `true` command to quickly test SLURM parameters without building the Affinity tool.
228+
The mask it prints is not very readable, but it can be used with the `true` command to quickly test Slurm parameters without building the Affinity tool.
229229

230230
```console title="hello"
231231
$ srun --cpu-bind=verbose -c32 -n4 -N1 --hint=nomultithread -- true
@@ -240,13 +240,13 @@ The build generates the following executables:
240240
[](){#ref-slurm-gh200}
241241
## NVIDIA GH200 GPU Nodes
242242

243-
The [GH200 nodes on Alps][ref-alps-gh200-node] have four GPUs per node, and SLURM job submissions must be configured appropriately to best make use of the resources.
243+
The [GH200 nodes on Alps][ref-alps-gh200-node] have four GPUs per node, and Slurm job submissions must be configured appropriately to best make use of the resources.
244244
Applications that can saturate the GPUs with a single process per GPU should generally prefer this mode.
245-
[Configuring SLURM jobs to use a single GPU per rank][ref-slurm-gh200-single-rank-per-gpu] is also the most straightforward setup.
245+
[Configuring Slurm jobs to use a single GPU per rank][ref-slurm-gh200-single-rank-per-gpu] is also the most straightforward setup.
246246
Some applications perform badly with a single rank per GPU, and require use of [NVIDIA's Multi-Process Service (MPS)] to oversubscribe GPUs with multiple ranks per GPU.
247247

248-
The best SLURM configuration is application- and workload-specific, so it is worth testing which works best in your particular case.
249-
See [Scientific Applications][ref-software-sciapps] for information about recommended application-specific SLURM configurations.
248+
The best Slurm configuration is application- and workload-specific, so it is worth testing which works best in your particular case.
249+
See [Scientific Applications][ref-software-sciapps] for information about recommended application-specific Slurm configurations.
250250

251251
!!! warning
252252
The GH200 nodes have their GPUs configured in ["default" compute mode](https://docs.nvidia.com/deploy/mps/index.html#gpu-compute-modes).
@@ -262,7 +262,7 @@ See [Scientific Applications][ref-software-sciapps] for information about recomm
262262
[](){#ref-slurm-gh200-single-rank-per-gpu}
263263
### One rank per GPU
264264

265-
Configuring SLURM to use one GH200 GPU per rank is easiest done using the `--ntasks-per-node=4` and `--gpus-per-task=1` SLURM flags.
265+
Configuring Slurm to use one GH200 GPU per rank is easiest done using the `--ntasks-per-node=4` and `--gpus-per-task=1` Slurm flags.
266266
For advanced users, using `--gpus-per-task` is equivalent to setting `CUDA_VISIBLE_DEVICES` to `SLURM_LOCALID`, assuming the job is using four ranks per node.
267267
The examples below launch jobs on two nodes with four ranks per node using `sbatch` and `srun`:
268268

@@ -282,7 +282,7 @@ Omitting the `--gpus-per-task` results in `CUDA_VISIBLE_DEVICES` being unset, wh
282282
### Multiple ranks per GPU
283283

284284
Using multiple ranks per GPU can improve performance e.g. of applications that don't generate enough work for a GPU using a single rank, or ones that scale badly to all 72 cores of the Grace CPU.
285-
In these cases SLURM jobs must be configured to assign multiple ranks to a single GPU.
285+
In these cases Slurm jobs must be configured to assign multiple ranks to a single GPU.
286286
This is best done using [NVIDIA's Multi-Process Service (MPS)].
287287
To use MPS, launch your application using the following wrapper script, which will start MPS on one rank per node and assign GPUs to ranks according to the CPU mask of a rank, ensuring the closest GPU is used:
288288

@@ -357,7 +357,7 @@ For a detailed description of the node hardware, see the [AMD Rome node][ref-alp
357357
![Screenshot](../images/slurm/eiger-topo.png)
358358

359359

360-
Each MPI rank is assigned a set of cores on a node, and SLURM provides flags that can be used directly as flags to `srun`, or as arguments in an `sbatch` script.
360+
Each MPI rank is assigned a set of cores on a node, and Slurm provides flags that can be used directly as flags to `srun`, or as arguments in an `sbatch` script.
361361
Here are some basic flags that we will use to distribute work.
362362

363363
| flag | meaning |
@@ -368,10 +368,10 @@ Here are some basic flags that we will use to distribute work.
368368
| `-c`, `--cpus-per-task` | The number of cores to assign to each rank. |
369369
| `--hint=nomultithread` | Use only one PU per core |
370370

371-
!!! info "SLURM is highly configurable"
371+
!!! info "Slurm is highly configurable"
372372
These are a subset of the most useful flags.
373373
Call `srun --help` or `sbatch --help` to get a complete list of all the flags available on your target cluster.
374-
Note that the exact set of flags available depends on the SLURM version, how SLURM was configured, and SLURM plugins.
374+
Note that the exact set of flags available depends on the Slurm version, how Slurm was configured, and Slurm plugins.
375375

376376
The first example assigns 2 MPI ranks per node, with 64 cores per rank, with the two PUs per core:
377377
```console title="One MPI rank per socket"
@@ -578,12 +578,12 @@ The approach is to:
578578
1. first allocate all the resources on each node to the job;
579579
2. then subdivide those resources at each invocation of srun.
580580

581-
If SLURM believes that a request for resources (cores, gpus, memory) overlaps with what another step has already allocated, it will defer the execution until the resources are relinquished.
581+
If Slurm believes that a request for resources (cores, gpus, memory) overlaps with what another step has already allocated, it will defer the execution until the resources are relinquished.
582582
This must be avoided.
583583

584584
First ensure that *all* resources are allocated to the whole job with the following preamble:
585585

586-
```bash title="SLURM preamble on a GH200 node"
586+
```bash title="Slurm preamble on a GH200 node"
587587
#!/usr/bin/env bash
588588
#SBATCH --exclusive --mem=450G
589589
```
@@ -592,17 +592,17 @@ First ensure that *all* resources are allocated to the whole job with the follow
592592
* `--mem=450G` most of allowable memory (there are 4 Grace CPUs with ~120 GB of memory on the node)
593593

594594
!!! note
595-
`--mem=0` can generally be used to allocate all memory on the node but the SLURM configuration on clariden doesn't allow this.
595+
`--mem=0` can generally be used to allocate all memory on the node but the Slurm configuration on clariden doesn't allow this.
596596

597597
Next, launch your applications using `srun`, carefully subdividing resources for each job step.
598598
The `--exclusive` flag must be used again, but note that its meaning differs in the context of `srun`.
599599
Here, `--exclusive` ensures that only the resources explicitly requested for a given job step are reserved and allocated to it.
600-
Without this flag, SLURM reserves all resources for the job step, even if it only allocates a subset -- effectively blocking further parallel `srun` invocations from accessing unrequested but needed resources.
600+
Without this flag, Slurm reserves all resources for the job step, even if it only allocates a subset -- effectively blocking further parallel `srun` invocations from accessing unrequested but needed resources.
601601

602602
Be sure to background each `srun` command with `&`, so that subsequent job steps start immediately without waiting for previous ones to finish.
603603
A final `wait` command ensures that your submission script does not exit until all job steps complete.
604604

605-
SLURM will automatically set `CUDA_VISIBLE_DEVICES` for each `srun` call, restricting GPU access to only the devices assigned to that job step.
605+
Slurm will automatically set `CUDA_VISIBLE_DEVICES` for each `srun` call, restricting GPU access to only the devices assigned to that job step.
606606

607607
!!! todo "use [affinity](https://github.com/bcumming/affinity) for these examples"
608608

0 commit comments

Comments
 (0)