Skip to content

Commit 9a50b0b

Browse files
committed
Add documentation about configuring SLURM on GH200
1 parent 64a48d3 commit 9a50b0b

File tree

1 file changed

+77
-11
lines changed

1 file changed

+77
-11
lines changed

docs/tools/slurm.md

Lines changed: 77 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -20,37 +20,103 @@ The following sections will provide detailed guidance on how to use SLURM to req
2020
[](){#gh200-slurm}
2121
### NVIDIA GH200 GPU Nodes
2222

23-
!!! todo
24-
document how slurm can be used on the Grace-Hopper nodes.
23+
The [GH200 nodes on Alps][gh200-node] have four GPUs per node, and SLURM job submissions must be configured appropriately to best make use of the resources. Applications that can saturate the GPUs with a single process per GPU should generally prefer this mode. [Configuring SLURM jobs to use a single GPU per rank][gh200-slurm-single-rank-per-gpu] is also the most straightforward setup. Some applications perform badly with a single rank per GPU, and require use of [NVIDIA's Multi-Process-Service (MPS)](https://docs.nvidia.com/deploy/mps/index.html) to oversubscribe GPUs with multiple ranks per GPU.
2524

26-
Note how you can link to this section from elsewhere using the anchor above, e.g.:
25+
The best SLURM configuration is application- and workload-specific, so it is worth testing which works best in your particular case. Also see [TODO][TODO] for information about recommended application-specific SLURM configurations.
2726

28-
```
29-
[using slurm on Grace-Hopper][gh200-slurm]
30-
```
27+
!!! warning
28+
The GH200 nodes have their GPUs configured in ["default" compute mode](https://docs.nvidia.com/deploy/mps/index.html#gpu-compute-modes). Unlike "exclusive process" mode, "default" mode allows multiple processes to submit work to a single GPU simultaneously. This also means that different ranks on the same node can inadvertently use the same GPU leading to suboptimal performance or unused GPUs, rather than job failures.
29+
30+
Some applications benefit from using multiple ranks per GPU. However, [MPS should be used][gh200-slurm-multi-rank-per-gpu] in these cases.
31+
32+
If you are unsure about which GPU is being used for a particular rank, print the `CUDA_VISIBLE_DEVICES` variable, along with e.g. `SLURM_LOCALID`, `SLURM_PROCID`, and `SLURM_NODEID` variables, in your job script. If the variable is unset or empty all GPUs are visible to the rank and the rank will in most cases only use the first GPU.
3133

32-
Link to the [Grace-Hopper overview][gh200-node].
34+
[](){#gh200-slurm-single-rank-per-gpu}
35+
#### One rank per GPU
3336

34-
An example of using tabs to show srun and sbatch useage to get one GPU per MPI rank:
37+
Configuring SLURM to use one GH200 GPU per rank is easiest done using the `--ntasks-per-node=4` and `--gpus-per-task=1` SLURM flags. For advanced users, using `--gpus-per-task` is equivalent to setting `CUDA_VISIBLE_DEVICES` to `SLURM_LOCALID`, assuming the job is using four ranks per node. The examples below launch jobs on two nodes with four ranks per node using `sbatch` and `srun`:
3538

3639
=== "sbatch"
3740

3841
```bash
3942
#!/bin/bash
4043
#SBATCH --job-name=affinity-test
41-
#SBATCH --ntasks-per-node=4
4244
#SBATCH --nodes=2
45+
#SBATCH --ntasks-per-node=4
4346
#SBATCH --gpus-per-task=1
4447

45-
srun affinity
48+
srun <application>
4649
```
4750

4851
=== "srun"
4952

5053
```
51-
> srun -n8 -N2 --gpus-per-task=1 affinity
54+
srun --nodes=2 --ntasks-per-node=4 --gpus-per-task=1 <application>
5255
```
56+
57+
Omitting the `--gpus-per-task` flag will lead to all ranks on the node using the first GPU.
58+
59+
[](){#gh200-slurm-multi-rank-per-gpu}
60+
#### Multiple ranks per GPU
61+
62+
Using multiple ranks per GPU can improve performance e.g. of applications that don't generate enough work for a GPU using a single rank, or ones that scale badly to all 72 cores of the Grace CPU. In these cases SLURM jobs must be configured to assign multiple ranks to a single GPU. This is best done using MPS. To use MPS, launch your application using the following wrapper script, which will start MPS on one rank per node and assign GPUs to ranks according to the CPU mask of a rank, ensuring the closest GPU is used:
63+
64+
```bash
65+
#!/bin/bash
66+
# Example mps-wrapper.sh usage:
67+
# > srun --cpu-bind=socket [srun args] mps-wrapper.sh [cmd] [cmd args]
68+
69+
# Only this path is supported by MPS
70+
export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
71+
export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log-$(id -un)
72+
73+
# Launch MPS from a single rank per node
74+
if [[ $SLURM_LOCALID -eq 0 ]]; then
75+
CUDA_VISIBLE_DEVICES=0,1,2,3 nvidia-cuda-mps-control -d
76+
fi
77+
78+
# Set CUDA device
79+
numa_nodes=$(hwloc-calc --physical --intersect NUMAnode $(hwloc-bind --get --taskset))
80+
export CUDA_VISIBLE_DEVICES=$numa_nodes
81+
82+
# Wait for MPS to start
83+
sleep 1
84+
85+
# Run the command
86+
numactl --membind=$numa_nodes "$@"
87+
result=$?
88+
89+
# Quit MPS control daemon before exiting
90+
if [[ $SLURM_LOCALID -eq 0 ]]; then
91+
echo quit | nvidia-cuda-mps-control
92+
fi
93+
94+
exit $result
95+
```
96+
97+
Save the above script as `mps-wrapper.sh` and make it executable with `chmod +x mps-wrapper.sh`. If the `mps-wrapper.sh` script is in the current working directory, you can then launch jobs using MPS for example as follows:
98+
99+
```bash
100+
#!/bin/bash
101+
#SBATCH --job-name=oversubscription-affinity-test
102+
#SBATCH --nodes=2
103+
#SBATCH --ntasks-per-node=32
104+
#SBATCH --cpus-per-task=8
105+
106+
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
107+
108+
srun --cpu-bind=socket ./mps-wrapper.sh <application>
109+
```
110+
111+
Note that in the example job above:
112+
113+
- `--gpus-per-node` is not set at all; the `mps-wrapper.sh` script ensures that the right GPU is visible for each rank using `CUDA_VISIBLE_DEVICES`
114+
- `--ntasks-per-node` is set to 32; this results in 8 ranks per GPU
115+
- `--cpus-per-task` is set to 8; this ensures that the CPU mask is set appropriately for each rank
116+
- `OMP_NUM_THREADS` is exported for applications that use OpenMP; this may not be needed for your application, or you may need other libraries to be configured to use the correct number of threads
117+
- `--cpu-bind=socket` is set on the `srun` command; this will expose a full CPU for each rank, allowing threads to migrate between cores within the socket, but not across sockets
53118

119+
The configuration that is optimal for your application may be different.
54120

55121
[](){#amdcpu-slurm}
56122
## AMD CPU

0 commit comments

Comments
 (0)