Skip to content

Commit a2be996

Browse files
authored
Add documentation about configuring SLURM on GH200 (#12)
1 parent 897aed6 commit a2be996

File tree

1 file changed

+101
-31
lines changed

1 file changed

+101
-31
lines changed

docs/tools/slurm.md

Lines changed: 101 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -20,37 +20,107 @@ The following sections will provide detailed guidance on how to use SLURM to req
2020
[](){#gh200-slurm}
2121
### NVIDIA GH200 GPU Nodes
2222

23-
!!! todo
24-
document how slurm can be used on the Grace-Hopper nodes.
25-
26-
Note how you can link to this section from elsewhere using the anchor above, e.g.:
27-
28-
```
29-
[using slurm on Grace-Hopper][gh200-slurm]
30-
```
31-
32-
Link to the [Grace-Hopper overview][gh200-node].
33-
34-
An example of using tabs to show srun and sbatch useage to get one GPU per MPI rank:
35-
36-
=== "sbatch"
37-
38-
```bash
39-
#!/bin/bash
40-
#SBATCH --job-name=affinity-test
41-
#SBATCH --ntasks-per-node=4
42-
#SBATCH --nodes=2
43-
#SBATCH --gpus-per-task=1
44-
45-
srun affinity
46-
```
47-
48-
=== "srun"
49-
50-
```
51-
> srun -n8 -N2 --gpus-per-task=1 affinity
52-
```
53-
23+
The [GH200 nodes on Alps][gh200-node] have four GPUs per node, and SLURM job submissions must be configured appropriately to best make use of the resources.
24+
Applications that can saturate the GPUs with a single process per GPU should generally prefer this mode.
25+
[Configuring SLURM jobs to use a single GPU per rank][gh200-slurm-single-rank-per-gpu] is also the most straightforward setup.
26+
Some applications perform badly with a single rank per GPU, and require use of [NVIDIA's Multi-Process Service (MPS)] to oversubscribe GPUs with multiple ranks per GPU.
27+
28+
The best SLURM configuration is application- and workload-specific, so it is worth testing which works best in your particular case.
29+
See [Scientific Applications][sciapps] for information about recommended application-specific SLURM configurations.
30+
31+
!!! warning
32+
The GH200 nodes have their GPUs configured in ["default" compute mode](https://docs.nvidia.com/deploy/mps/index.html#gpu-compute-modes).
33+
The "default" mode is used to avoid issues with certain containers.
34+
Unlike "exclusive process" mode, "default" mode allows multiple processes to submit work to a single GPU simultaneously.
35+
This also means that different ranks on the same node can inadvertently use the same GPU leading to suboptimal performance or unused GPUs, rather than job failures.
36+
37+
Some applications benefit from using multiple ranks per GPU. However, [MPS should be used][gh200-slurm-multi-rank-per-gpu] in these cases.
38+
39+
If you are unsure about which GPU is being used for a particular rank, print the `CUDA_VISIBLE_DEVICES` variable, along with e.g. `SLURM_LOCALID`, `SLURM_PROCID`, and `SLURM_NODEID` variables, in your job script.
40+
If the variable is unset or empty all GPUs are visible to the rank and the rank will in most cases only use the first GPU.
41+
42+
[](){#gh200-slurm-single-rank-per-gpu}
43+
#### One rank per GPU
44+
45+
Configuring SLURM to use one GH200 GPU per rank is easiest done using the `--ntasks-per-node=4` and `--gpus-per-task=1` SLURM flags.
46+
For advanced users, using `--gpus-per-task` is equivalent to setting `CUDA_VISIBLE_DEVICES` to `SLURM_LOCALID`, assuming the job is using four ranks per node.
47+
The examples below launch jobs on two nodes with four ranks per node using `sbatch` and `srun`:
48+
49+
```bash
50+
#!/bin/bash
51+
#SBATCH --job-name=gh200-single-rank-per-gpu
52+
#SBATCH --nodes=2
53+
#SBATCH --ntasks-per-node=4
54+
#SBATCH --gpus-per-task=1
55+
56+
srun <application>
57+
```
58+
59+
Omitting the `--gpus-per-task` results in `CUDA_VISIBLE_DEVICES` being unset, which will lead to most applications using the first GPU on all ranks.
60+
61+
[](){#gh200-slurm-multi-rank-per-gpu}
62+
#### Multiple ranks per GPU
63+
64+
Using multiple ranks per GPU can improve performance e.g. of applications that don't generate enough work for a GPU using a single rank, or ones that scale badly to all 72 cores of the Grace CPU.
65+
In these cases SLURM jobs must be configured to assign multiple ranks to a single GPU.
66+
This is best done using [NVIDIA's Multi-Process Service (MPS)].
67+
To use MPS, launch your application using the following wrapper script, which will start MPS on one rank per node and assign GPUs to ranks according to the CPU mask of a rank, ensuring the closest GPU is used:
68+
69+
```bash
70+
#!/bin/bash
71+
# Example mps-wrapper.sh usage:
72+
# > srun [srun args] mps-wrapper.sh [cmd] [cmd args]
73+
74+
# Only this path is supported by MPS
75+
export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
76+
export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log-$(id -un)
77+
78+
# Launch MPS from a single rank per node
79+
if [[ $SLURM_LOCALID -eq 0 ]]; then
80+
CUDA_VISIBLE_DEVICES=0,1,2,3 nvidia-cuda-mps-control -d
81+
fi
82+
83+
# Set CUDA device
84+
numa_nodes=$(hwloc-calc --physical --intersect NUMAnode $(hwloc-bind --get --taskset))
85+
export CUDA_VISIBLE_DEVICES=$numa_nodes
86+
87+
# Wait for MPS to start
88+
sleep 1
89+
90+
# Run the command
91+
numactl --membind=$numa_nodes "$@"
92+
result=$?
93+
94+
# Quit MPS control daemon before exiting
95+
if [[ $SLURM_LOCALID -eq 0 ]]; then
96+
echo quit | nvidia-cuda-mps-control
97+
fi
98+
99+
exit $result
100+
```
101+
102+
Save the above script as `mps-wrapper.sh` and make it executable with `chmod +x mps-wrapper.sh`.
103+
If the `mps-wrapper.sh` script is in the current working directory, you can then launch jobs using MPS for example as follows:
104+
105+
```bash
106+
#!/bin/bash
107+
#SBATCH --job-name=gh200-multiple-ranks-per-gpu
108+
#SBATCH --nodes=2
109+
#SBATCH --ntasks-per-node=32
110+
#SBATCH --cpus-per-task=8
111+
112+
srun ./mps-wrapper.sh <application>
113+
```
114+
115+
Note that in the example job above:
116+
117+
- `--gpus-per-node` is not set at all; the `mps-wrapper.sh` script ensures that the right GPU is visible for each rank using `CUDA_VISIBLE_DEVICES`
118+
- `--ntasks-per-node` is set to 32; this results in 8 ranks per GPU
119+
- `--cpus-per-task` is set to 8; this ensures that threads are not allowed to migrate across the whole GH200 node
120+
121+
The configuration that is optimal for your application may be different.
122+
123+
[NVIDIA's Multi-Process Service (MPS)]: https://docs.nvidia.com/deploy/mps/index.html
54124

55125
[](){#amdcpu-slurm}
56126
## AMD CPU

0 commit comments

Comments
 (0)