Skip to content

Commit c9db1d6

Browse files
authored
Merge branch 'main' into update-eiger
2 parents 0f48fb1 + 73d7ee4 commit c9db1d6

File tree

8 files changed

+167
-28
lines changed

8 files changed

+167
-28
lines changed

docs/clusters/bristen.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ Bristen consists of 32 A100 nodes [NVIDIA A100 nodes][ref-alps-a100-node]. The n
1212
|-----------|--------| ----------------- | ---------- |
1313
| [a100][ref-alps-a100-node] | 32 | 32 | 128 |
1414

15-
Nodes are in the [`normal` slurm partition][ref-slurm-partition-normal].
15+
Nodes are in the [`normal` Slurm partition][ref-slurm-partition-normal].
1616

1717
### Storage and file systems
1818

@@ -48,7 +48,7 @@ Users are encouraged to use containers on Bristen.
4848

4949
Bristen uses [Slurm][ref-slurm] as the workload manager, which is used to launch and monitor distributed workloads, such as training runs.
5050

51-
There is currently a single slurm partition on the system:
51+
There is currently a single Slurm partition on the system:
5252

5353
* the `normal` partition is for all production workloads.
5454
+ nodes in this partition are not shared.

docs/clusters/clariden.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ The number of nodes can change when nodes are added or removed from other cluste
1414
|-----------|--------| ----------------- | ---------- |
1515
| [gh200][ref-alps-gh200-node] | 1,200 | 4,800 | 4,800 |
1616

17-
Most nodes are in the [`normal` slurm partition][ref-slurm-partition-normal], while a few nodes are in the [`debug` partition][ref-slurm-partition-debug].
17+
Most nodes are in the [`normal` Slurm partition][ref-slurm-partition-normal], while a few nodes are in the [`debug` partition][ref-slurm-partition-debug].
1818

1919
### Storage and file systems
2020

@@ -71,7 +71,7 @@ Alternatively, [uenv][ref-uenv] are also available on Clariden. Currently deploy
7171

7272
Clariden uses [Slurm][ref-slurm] as the workload manager, which is used to launch and monitor distributed workloads, such as training runs.
7373

74-
There are two slurm partitions on the system:
74+
There are two Slurm partitions on the system:
7575

7676
* the `normal` partition is for all production workloads.
7777
* the `debug` partition can be used to access a small allocation for up to 30 minutes for debugging and testing purposes.

docs/running/jobreport.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,7 @@ The report is divided into two parts: a general summary and GPU specific values.
5656
| Field | Description |
5757
| ----- | ----------- |
5858
| Job Id | The Slurm job id |
59-
| Step Id | The slurm step id. A job step in Slurm is a subdivision of a job started with srun |
59+
| Step Id | The Slurm step id. A job step in Slurm is a subdivision of a job started with srun |
6060
| User | The user account that submitted the job |
6161
| Slurm Account | The project account that will be billed |
6262
| Start Time, End Time, Elapsed Time | The time the job started and ended, and how long it ran |
@@ -77,7 +77,7 @@ The report is divided into two parts: a general summary and GPU specific values.
7777
| SM Utilization % | The percentage of the process's lifetime during which Streaming Multiprocessors (SM) were executing a kernel |
7878
| Memory Utilization % | The percentage of process's lifetime during which global (device) memory was being read or written |
7979

80-
## Example with slurm: srun
80+
## Example with Slurm: srun
8181

8282
The simplest example to test `jobreport` is to run it with the sleep command.
8383
It is important to separate `jobreport` (and its options) and your command with `--`.
@@ -155,7 +155,7 @@ GPU Specific Values
155155
4. Uncheck "Set locale environment variables on startup"
156156
5. Quit and reopen the terminal and try again. This should fix the issue.
157157

158-
## Example with slurm: batch script
158+
## Example with Slurm: batch script
159159

160160
The `jobreport` command can be used in a batch script
161161
The report printing, too, can be included in the script and does not need the `srun` command.

docs/running/slurm.md

Lines changed: 154 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,31 @@
11
[](){#ref-slurm}
2-
# SLURM
2+
# Slurm
33

4-
CSCS uses the [SLURM](https://slurm.schedmd.com/documentation.html) as its workload manager to efficiently schedule and manage jobs on Alps vClusters.
5-
SLURM is an open-source, highly scalable job scheduler that allocates computing resources, queues user jobs, and optimizes workload distribution across the cluster. It supports advanced scheduling policies, job dependencies, resource reservations, and accounting, making it well-suited for high-performance computing environments.
4+
CSCS uses the [Slurm](https://slurm.schedmd.com/documentation.html) workload manager to efficiently schedule and manage jobs on Alps vClusters.
5+
Slurm is an open-source, highly scalable job scheduler that allocates computing resources, queues user jobs, and optimizes workload distribution across the cluster.
6+
It supports advanced scheduling policies, job dependencies, resource reservations, and accounting, making it well-suited for high-performance computing environments.
7+
8+
Refer to the [Quick Start User Guide](https://slurm.schedmd.com/quickstart.html) for commonly used terminology and commands.
9+
10+
<div class="grid cards" markdown>
11+
12+
- :fontawesome-solid-mountain-sun: __Configuring jobs__
13+
14+
Specific guidance for configuring Slurm jobs on different node types.
15+
16+
[:octicons-arrow-right-24: GH200 nodes (Daint, Clariden, Santis)][ref-slurm-gh200]
17+
18+
[:octicons-arrow-right-24: AMD CPU-only nodes (Eiger)][ref-slurm-amdcpu]
19+
20+
- :fontawesome-solid-mountain-sun: __Node sharing__
21+
22+
Guides on how to effectively use all resouces on nodes by running more than one job per node.
23+
24+
[:octicons-arrow-right-24: Node sharing][ref-slurm-sharing]
25+
26+
[:octicons-arrow-right-24: Multiple MPI jobs per node][ref-slurm-exclusive]
27+
28+
</div>
629

730
## Accounts and resources
831

@@ -58,9 +81,9 @@ Additionally, short-duration jobs may be selected for backfilling — a process
5881
[](){#ref-slurm-partitions}
5982
## Partitions
6083

61-
At CSCS, SLURM is configured to accommodate the diverse range of node types available in our HPC clusters.
84+
At CSCS, Slurm is configured to accommodate the diverse range of node types available in our HPC clusters.
6285
These nodes vary in architecture, including CPU-only nodes and nodes equipped with different types of GPUs.
63-
Because of this heterogeneity, SLURM must be tailored to ensure efficient resource allocation, job scheduling, and workload management specific to each node type.
86+
Because of this heterogeneity, Slurm must be tailored to ensure efficient resource allocation, job scheduling, and workload management specific to each node type.
6487

6588
Each type of node has different resource constraints and capabilities, which Slurm takes into account when scheduling jobs.
6689
For example, CPU-only nodes may have configurations optimized for multi-threaded CPU workloads, while GPU nodes require additional parameters to allocate GPU resources efficiently.
@@ -80,13 +103,15 @@ Slurm ensures that user jobs request and receive the appropriate resources while
80103

81104
[](){#ref-slurm-partition-debug}
82105
### Debug partition
83-
The SLURM `debug` partition is useful for quick turnaround workflows. The partition has a short maximum time (timelimit can be seen with `sinfo -p debug`), and a low number of maximum nodes (the `MaxNodes` can be seen with `scontrol show partition=debug`).
106+
The Slurm `debug` partition is useful for quick turnaround workflows. The partition has a short maximum time (timelimit can be seen with `sinfo -p debug`), and a low number of maximum nodes (the `MaxNodes` can be seen with `scontrol show partition=debug`).
84107

85108
[](){#ref-slurm-partition-normal}
86109
### Normal partition
87-
This is the default partition, and will be used when you do not explicitly set a partition. This is the correct choice for standard jobs. The maximum time is usually set to 24 hours (`sinfo -p normal` for timelimit), and the maximum nodes can be as much as nodes are available.
110+
This is the default partition, and will be used when you do not explicitly set a partition.
111+
This is the correct choice for standard jobs. The maximum time is usually set to 24 hours (`sinfo -p normal` for timelimit), and the maximum nodes can be as much as nodes are available.
88112

89-
The following sections will provide detailed guidance on how to use SLURM to request and manage CPU cores, memory, and GPUs in jobs. These instructions will help users optimize their workload execution and ensure efficient use of CSCS computing resources.
113+
The following sections will provide detailed guidance on how to use Slurm to request and manage CPU cores, memory, and GPUs in jobs.
114+
These instructions will help users optimize their workload execution and ensure efficient use of CSCS computing resources.
90115

91116
## Affinity
92117

@@ -215,13 +240,13 @@ The build generates the following executables:
215240
[](){#ref-slurm-gh200}
216241
## NVIDIA GH200 GPU Nodes
217242

218-
The [GH200 nodes on Alps][ref-alps-gh200-node] have four GPUs per node, and SLURM job submissions must be configured appropriately to best make use of the resources.
243+
The [GH200 nodes on Alps][ref-alps-gh200-node] have four GPUs per node, and Slurm job submissions must be configured appropriately to best make use of the resources.
219244
Applications that can saturate the GPUs with a single process per GPU should generally prefer this mode.
220-
[Configuring SLURM jobs to use a single GPU per rank][ref-slurm-gh200-single-rank-per-gpu] is also the most straightforward setup.
245+
[Configuring Slurm jobs to use a single GPU per rank][ref-slurm-gh200-single-rank-per-gpu] is also the most straightforward setup.
221246
Some applications perform badly with a single rank per GPU, and require use of [NVIDIA's Multi-Process Service (MPS)] to oversubscribe GPUs with multiple ranks per GPU.
222247

223-
The best SLURM configuration is application- and workload-specific, so it is worth testing which works best in your particular case.
224-
See [Scientific Applications][ref-software-sciapps] for information about recommended application-specific SLURM configurations.
248+
The best Slurm configuration is application- and workload-specific, so it is worth testing which works best in your particular case.
249+
See [Scientific Applications][ref-software-sciapps] for information about recommended application-specific Slurm configurations.
225250

226251
!!! warning
227252
The GH200 nodes have their GPUs configured in ["default" compute mode](https://docs.nvidia.com/deploy/mps/index.html#gpu-compute-modes).
@@ -232,12 +257,12 @@ See [Scientific Applications][ref-software-sciapps] for information about recomm
232257
Some applications benefit from using multiple ranks per GPU. However, [MPS should be used][ref-slurm-gh200-multi-rank-per-gpu] in these cases.
233258

234259
If you are unsure about which GPU is being used for a particular rank, print the `CUDA_VISIBLE_DEVICES` variable, along with e.g. `SLURM_LOCALID`, `SLURM_PROCID`, and `SLURM_NODEID` variables, in your job script.
235-
If the variable is unset or empty all GPUs are visible to the rank and the rank will in most cases only use the first GPU.
260+
If the variable is unset or empty all GPUs are visible to the rank and the rank will in most cases only use the first GPU.
236261

237262
[](){#ref-slurm-gh200-single-rank-per-gpu}
238263
### One rank per GPU
239264

240-
Configuring SLURM to use one GH200 GPU per rank is easiest done using the `--ntasks-per-node=4` and `--gpus-per-task=1` SLURM flags.
265+
Configuring Slurm to use one GH200 GPU per rank is easiest done using the `--ntasks-per-node=4` and `--gpus-per-task=1` Slurm flags.
241266
For advanced users, using `--gpus-per-task` is equivalent to setting `CUDA_VISIBLE_DEVICES` to `SLURM_LOCALID`, assuming the job is using four ranks per node.
242267
The examples below launch jobs on two nodes with four ranks per node using `sbatch` and `srun`:
243268

@@ -257,7 +282,7 @@ Omitting the `--gpus-per-task` results in `CUDA_VISIBLE_DEVICES` being unset, wh
257282
### Multiple ranks per GPU
258283

259284
Using multiple ranks per GPU can improve performance e.g. of applications that don't generate enough work for a GPU using a single rank, or ones that scale badly to all 72 cores of the Grace CPU.
260-
In these cases SLURM jobs must be configured to assign multiple ranks to a single GPU.
285+
In these cases Slurm jobs must be configured to assign multiple ranks to a single GPU.
261286
This is best done using [NVIDIA's Multi-Process Service (MPS)].
262287
To use MPS, launch your application using the following wrapper script, which will start MPS on one rank per node and assign GPUs to ranks according to the CPU mask of a rank, ensuring the closest GPU is used:
263288

@@ -519,3 +544,117 @@ rank 3 @ nid001512
519544
!!! warning
520545
The `OMP_*` environment variables only affect thread affinity of applications that use OpenMP for thread-level parallelism.
521546
Other threading runtimes will be configured differently, and the `affinity.mpi` tool will only be able to show the set of cores assigned to the rank.
547+
548+
[](){#ref-slurm-over-subscription}
549+
## Node over-subscription
550+
551+
The nodes on Alps provide a lot of resources, particularly the GPU nodes that have 4 GPUs.
552+
For workflows and use cases with tasks that require only a subset of these resources, for example a simulation that only needs one GPU, allocating a whole node to run one task is a waste of resources.
553+
554+
!!! example
555+
A workflow that runs a single [GROMACS][ref-uenv-gromacs] simulation, that uses one GPU.
556+
557+
* The optimal use of resources would allocate one quarter of a node, and allow other jobs to access the other three GPUs.
558+
559+
A workflow that runs 100 independent [GROMACS][ref-uenv-gromacs] simulations, where each simulation requires two GPUs.
560+
561+
* The optimal use of resources would allocate 50 nodes, with two simulations run on each node.
562+
563+
[](){#ref-slurm-sharing}
564+
### Node sharing
565+
566+
!!! under-construction
567+
Node sharing, whereby jobs can request part of the resources on a node, and multiple jobs can run on a node (possibly from different users) is _not currently available on Alps clusters_.
568+
569+
CSCS will support this feature on some Alps [clusters][ref-alps-clusters] in the near-medium future.
570+
571+
[](){#ref-slurm-exclusive}
572+
### Running more than one job step per node
573+
574+
Running multiple job steps in parallel on the same allocated set of nodes can improve resource utilization by taking advantage of all the available CPUs, GPUs, or memory within a single job allocation.
575+
576+
The approach is to:
577+
578+
1. first allocate all the resources on each node to the job;
579+
2. then subdivide those resources at each invocation of srun.
580+
581+
If Slurm believes that a request for resources (cores, gpus, memory) overlaps with what another step has already allocated, it will defer the execution until the resources are relinquished.
582+
This must be avoided.
583+
584+
First ensure that *all* resources are allocated to the whole job with the following preamble:
585+
586+
```bash title="Slurm preamble on a GH200 node"
587+
#!/usr/bin/env bash
588+
#SBATCH --exclusive --mem=450G
589+
```
590+
591+
* `--exclusive` allocates all the CPUs and GPUs exclusively to this job;
592+
* `--mem=450G` most of allowable memory (there are 4 Grace CPUs with ~120 GB of memory on the node)
593+
594+
!!! note
595+
`--mem=0` can generally be used to allocate all memory on the node but the Slurm configuration on clariden doesn't allow this.
596+
597+
Next, launch your applications using `srun`, carefully subdividing resources for each job step.
598+
The `--exclusive` flag must be used again, but note that its meaning differs in the context of `srun`.
599+
Here, `--exclusive` ensures that only the resources explicitly requested for a given job step are reserved and allocated to it.
600+
Without this flag, Slurm reserves all resources for the job step, even if it only allocates a subset -- effectively blocking further parallel `srun` invocations from accessing unrequested but needed resources.
601+
602+
Be sure to background each `srun` command with `&`, so that subsequent job steps start immediately without waiting for previous ones to finish.
603+
A final `wait` command ensures that your submission script does not exit until all job steps complete.
604+
605+
Slurm will automatically set `CUDA_VISIBLE_DEVICES` for each `srun` call, restricting GPU access to only the devices assigned to that job step.
606+
607+
=== "single node"
608+
609+
!!! example "Three jobs on one node"
610+
```bash
611+
#!/usr/bin/env bash
612+
#SBATCH --exclusive --mem=450G
613+
#SBATCH -N1
614+
615+
CMD="echo \$(date) \$(hostname) JobStep:\${SLURM_STEP_ID} ProcID:\${SLURM_PROCID} CUDA_VISIBLE_DEVICES=\${CUDA_VISIBLE_DEVICES}; sleep 5"
616+
srun -N1 --ntasks-per-node=1 --exclusive --gpus-per-task=2 --cpus-per-gpu=5 --mem=50G --output "out-%J.log" bash -c "${CMD}" &
617+
srun -N1 --ntasks-per-node=1 --exclusive --gpus-per-task=1 --cpus-per-gpu=5 --mem=50G --output "out-%J.log" bash -c "${CMD}" &
618+
srun -N1 --ntasks-per-node=1 --exclusive --gpus-per-task=1 --cpus-per-gpu=5 --mem=50G --output "out-%J.log" bash -c "${CMD}" &
619+
620+
wait
621+
```
622+
623+
Output (exact output will vary):
624+
```
625+
$ cat out-537506.*.log
626+
Tue Jul 1 11:40:46 CEST 2025 nid007104 JobStep:0 ProcID:0 CUDA_VISIBLE_DEVICES=0
627+
Tue Jul 1 11:40:46 CEST 2025 nid007104 JobStep:1 ProcID:0 CUDA_VISIBLE_DEVICES=1
628+
Tue Jul 1 11:40:46 CEST 2025 nid007104 JobStep:2 ProcID:0 CUDA_VISIBLE_DEVICES=2,3
629+
```
630+
631+
632+
633+
=== "multi-node"
634+
635+
!!! example "Three jobs on two nodes"
636+
```bash
637+
#!/usr/bin/env bash
638+
#SBATCH --exclusive --mem=450G
639+
#SBATCH -N2
640+
641+
CMD="echo \$(date) \$(hostname) JobStep:\${SLURM_STEP_ID} ProcID:\${SLURM_PROCID} CUDA_VISIBLE_DEVICES=\${CUDA_VISIBLE_DEVICES}; sleep 5"
642+
srun -N2 --ntasks-per-node=2 --exclusive --gpus-per-task=1 --cpus-per-gpu=5 --mem=50G --output "out-%J.log" bash -c "${CMD}" &
643+
srun -N2 --ntasks-per-node=1 --exclusive --gpus-per-task=1 --cpus-per-gpu=5 --mem=50G --output "out-%J.log" bash -c "${CMD}" &
644+
srun -N2 --ntasks-per-node=1 --exclusive --gpus-per-task=1 --cpus-per-gpu=5 --mem=50G --output "out-%J.log" bash -c "${CMD}" &
645+
646+
wait
647+
```
648+
649+
Output (exact output will vary):
650+
```
651+
$ cat out-537539.*.log
652+
Tue Jul 1 12:02:01 CEST 2025 nid005085 JobStep:0 ProcID:2 CUDA_VISIBLE_DEVICES=0
653+
Tue Jul 1 12:02:01 CEST 2025 nid005085 JobStep:0 ProcID:3 CUDA_VISIBLE_DEVICES=1
654+
Tue Jul 1 12:02:01 CEST 2025 nid005080 JobStep:0 ProcID:0 CUDA_VISIBLE_DEVICES=0
655+
Tue Jul 1 12:02:01 CEST 2025 nid005080 JobStep:0 ProcID:1 CUDA_VISIBLE_DEVICES=1
656+
Tue Jul 1 12:02:01 CEST 2025 nid005085 JobStep:1 ProcID:1 CUDA_VISIBLE_DEVICES=2
657+
Tue Jul 1 12:02:01 CEST 2025 nid005080 JobStep:1 ProcID:0 CUDA_VISIBLE_DEVICES=2
658+
Tue Jul 1 12:02:01 CEST 2025 nid005085 JobStep:2 ProcID:1 CUDA_VISIBLE_DEVICES=3
659+
Tue Jul 1 12:02:01 CEST 2025 nid005080 JobStep:2 ProcID:0 CUDA_VISIBLE_DEVICES=3
660+
```

docs/software/communication/openmpi.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ OpenMPI is provided through a [uenv][ref-uenv] similar to [`prgenv-gnu`][ref-uen
2222
Once the uenv is loaded, compiling and linking with OpenMPI and libfabric is transparent.
2323
At runtime, some additional options must be set to correctly use the Slingshot network.
2424

25-
First, when launching applications through slurm, [PMIx](https://pmix.github.com) must be used for application launching.
25+
First, when launching applications through Slurm, [PMIx](https://pmix.github.com) must be used for application launching.
2626
This is done with the `--mpi` flag of `srun`:
2727
```bash
2828
srun --mpi=pmix ...

docs/software/sciapps/cp2k.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -65,7 +65,7 @@ On our systems, CP2K is built with the following dependencies:
6565

6666
### Running on the HPC platform
6767

68-
To start a job, two bash scripts are potentially required: a [slurm] submission script, and a wrapper to start the [CUDA
68+
To start a job, two bash scripts are potentially required: a [Slurm] submission script, and a wrapper to start the [CUDA
6969
MPS] daemon so that multiple MPI ranks can use the same GPU.
7070

7171
```bash title="run_cp2k.sh"
@@ -138,7 +138,7 @@ sbatch run_cp2k.sh
138138

139139
Each GH200 node has 4 modules, each of them composed of a ARM Grace CPU with 72 cores and a H200 GPU directly
140140
attached to it. Please see [Alps hardware][ref-alps-hardware] for more information.
141-
It is important that the number of MPI ranks passed to [slurm] with `--ntasks-per-node` is a multiple of 4.
141+
It is important that the number of MPI ranks passed to [Slurm] with `--ntasks-per-node` is a multiple of 4.
142142

143143
??? note
144144

@@ -524,5 +524,5 @@ As a workaround, you can disable CUDA acceleration for the grid backend:
524524
[OpenBLAS]: http://www.openmathlib.org/OpenBLAS/
525525
[Intel MKL]: https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html
526526
[Cray MPICH]: https://docs.nersc.gov/development/programming-models/mpi/cray-mpich/
527-
[slurm]: https://slurm.schedmd.com/
527+
[Slurm]: https://slurm.schedmd.com/
528528
[CUDA MPS]: https://docs.nvidia.com/deploy/mps/index.html

0 commit comments

Comments
 (0)