You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -20,37 +20,107 @@ The following sections will provide detailed guidance on how to use SLURM to req
20
20
[](){#gh200-slurm}
21
21
### NVIDIA GH200 GPU Nodes
22
22
23
-
!!! todo
24
-
document how slurm can be used on the Grace-Hopper nodes.
25
-
26
-
Note how you can link to this section from elsewhere using the anchor above, e.g.:
27
-
28
-
```
29
-
[using slurm on Grace-Hopper][gh200-slurm]
30
-
```
31
-
32
-
Link to the [Grace-Hopper overview][gh200-node].
33
-
34
-
An example of using tabs to show srun and sbatch useage to get one GPU per MPI rank:
35
-
36
-
=== "sbatch"
37
-
38
-
```bash
39
-
#!/bin/bash
40
-
#SBATCH --job-name=affinity-test
41
-
#SBATCH --ntasks-per-node=4
42
-
#SBATCH --nodes=2
43
-
#SBATCH --gpus-per-task=1
44
-
45
-
srun affinity
46
-
```
47
-
48
-
=== "srun"
49
-
50
-
```
51
-
> srun -n8 -N2 --gpus-per-task=1 affinity
52
-
```
53
-
23
+
The [GH200 nodes on Alps][gh200-node] have four GPUs per node, and SLURM job submissions must be configured appropriately to best make use of the resources.
24
+
Applications that can saturate the GPUs with a single process per GPU should generally prefer this mode.
25
+
[Configuring SLURM jobs to use a single GPU per rank][gh200-slurm-single-rank-per-gpu] is also the most straightforward setup.
26
+
Some applications perform badly with a single rank per GPU, and require use of [NVIDIA's Multi-Process Service (MPS)] to oversubscribe GPUs with multiple ranks per GPU.
27
+
28
+
The best SLURM configuration is application- and workload-specific, so it is worth testing which works best in your particular case.
29
+
See [Scientific Applications][sciapps] for information about recommended application-specific SLURM configurations.
30
+
31
+
!!! warning
32
+
The GH200 nodes have their GPUs configured in ["default" compute mode](https://docs.nvidia.com/deploy/mps/index.html#gpu-compute-modes).
33
+
The "default" mode is used to avoid issues with certain containers.
34
+
Unlike "exclusive process" mode, "default" mode allows multiple processes to submit work to a single GPU simultaneously.
35
+
This also means that different ranks on the same node can inadvertently use the same GPU leading to suboptimal performance or unused GPUs, rather than job failures.
36
+
37
+
Some applications benefit from using multiple ranks per GPU. However, [MPS should be used][gh200-slurm-multi-rank-per-gpu] in these cases.
38
+
39
+
If you are unsure about which GPU is being used for a particular rank, print the `CUDA_VISIBLE_DEVICES` variable, along with e.g. `SLURM_LOCALID`, `SLURM_PROCID`, and `SLURM_NODEID` variables, in your job script.
40
+
If the variable is unset or empty all GPUs are visible to the rank and the rank will in most cases only use the first GPU.
41
+
42
+
[](){#gh200-slurm-single-rank-per-gpu}
43
+
#### One rank per GPU
44
+
45
+
Configuring SLURM to use one GH200 GPU per rank is easiest done using the `--ntasks-per-node=4` and `--gpus-per-task=1` SLURM flags.
46
+
For advanced users, using `--gpus-per-task` is equivalent to setting `CUDA_VISIBLE_DEVICES` to `SLURM_LOCALID`, assuming the job is using four ranks per node.
47
+
The examples below launch jobs on two nodes with four ranks per node using `sbatch` and `srun`:
48
+
49
+
```bash
50
+
#!/bin/bash
51
+
#SBATCH --job-name=gh200-single-rank-per-gpu
52
+
#SBATCH --nodes=2
53
+
#SBATCH --ntasks-per-node=4
54
+
#SBATCH --gpus-per-task=1
55
+
56
+
srun <application>
57
+
```
58
+
59
+
Omitting the `--gpus-per-task` results in `CUDA_VISIBLE_DEVICES` being unset, which will lead to most applications using the first GPU on all ranks.
60
+
61
+
[](){#gh200-slurm-multi-rank-per-gpu}
62
+
#### Multiple ranks per GPU
63
+
64
+
Using multiple ranks per GPU can improve performance e.g. of applications that don't generate enough work for a GPU using a single rank, or ones that scale badly to all 72 cores of the Grace CPU.
65
+
In these cases SLURM jobs must be configured to assign multiple ranks to a single GPU.
66
+
This is best done using [NVIDIA's Multi-Process Service (MPS)].
67
+
To use MPS, launch your application using the following wrapper script, which will start MPS on one rank per node and assign GPUs to ranks according to the CPU mask of a rank, ensuring the closest GPU is used:
0 commit comments