Skip to content

Commit edbebe5

Browse files
committed
finish first draft
1 parent 8cd17b2 commit edbebe5

File tree

2 files changed

+193
-33
lines changed

2 files changed

+193
-33
lines changed

docs/alps/hardware.md

Lines changed: 6 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,8 @@
33

44
Alps is a HPE Cray EX3000 system, a liquid cooled blade-based, high-density system.
55

6-
!!! todo
7-
this is a skeleton - all of the details need to be filled in
6+
!!! under-construction
7+
This page is a work in progress - contact us if you want us to prioritise documentation specific information that would be useful for your work.
88

99
## Alps Cabinets
1010

@@ -100,12 +100,11 @@ A schematic of a *standard memory node* below illustrates the CPU cores and [NUM
100100
* The two sockets are labelled Package L#0 and Package L#1.
101101
* Each socket has 4 NUMA nodes, with 16 cores each, for a total of 64 cores per socket.
102102

103-
Each core supports [simultaneous multi threading (SMT)](https://www.amd.com/en/blogs/2025/simultaneous-multithreading-driving-performance-a.html), whereby each core can execute two threads concurrently, which are presented as two PU per physical core.
104-
105-
* The first PU on each core are numbered 0:63 on socket 0, and 64:127 on socket 1;
106-
* The second PU on each core are numbered 128:191 on socket 0, and 192:256 on socket 1.
107-
* Hence core `n` SMT
103+
Each core supports [simultaneous multi threading (SMT)](https://www.amd.com/en/blogs/2025/simultaneous-multithreading-driving-performance-a.html), whereby each core can execute two threads concurrently, which are presented as two processing units (PU) per physical core:
108104

105+
* the first PU on each core are numbered 0:63 on socket 0, and 64:127 on socket 1;
106+
* the second PU on each core are numbered 128:191 on socket 0, and 192:256 on socket 1;
107+
* hence, core `n` has PUs `n` and `n+128`.
109108

110109
Each node has two Slingshot 11 network interface cards (NICs), which are not illustrated on the diagram.
111110

docs/running/slurm.md

Lines changed: 187 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -90,7 +90,7 @@ The following sections will provide detailed guidance on how to use SLURM to req
9090
## Affinity
9191

9292
The following sections will document how to use Slurm on different compute nodes available on Alps.
93-
To demonstrate the effects different Slurm parameters, we will use a little command line tool [affinity](https://github.com/bcumming/affinity) that prints the CPU cores and GPUs that are assinged to each MPI rank in a job, and which node they are run on.
93+
To demonstrate the effects different Slurm parameters, we will use a little command line tool [affinity](https://github.com/bcumming/affinity) that prints the CPU cores and GPUs that are assigned to each MPI rank in a job, and which node they are run on.
9494

9595
We strongly recommend using a tool like affinity to understand and test the Slurm configuration for jobs, because the behavior of Slurm is highly dependent on the system configuration.
9696
Parameters that worked on a different cluster -- or with a different Slurm version or configuration on the same cluster -- are not guaranteed to give the same results.
@@ -127,14 +127,14 @@ The build generates the following executables:
127127
$ uenv start prgenv-gnu/24.11:v2 --view=default
128128
$ srun -n8 -N2 -c72 ./affinity.mpi
129129
affinity test for 8 MPI ranks
130-
rank 0 @ nid006363: threads [ 0:71] -> cores [ 0: 71]
131-
rank 1 @ nid006363: threads [ 0:71] -> cores [ 72:143]
132-
rank 2 @ nid006363: threads [ 0:71] -> cores [144:215]
133-
rank 3 @ nid006363: threads [ 0:71] -> cores [216:287]
134-
rank 4 @ nid006375: threads [ 0:71] -> cores [ 0: 71]
135-
rank 5 @ nid006375: threads [ 0:71] -> cores [ 72:143]
136-
rank 6 @ nid006375: threads [ 0:71] -> cores [144:215]
137-
rank 7 @ nid006375: threads [ 0:71] -> cores [216:287]
130+
rank 0 @ nid006363: thread 0 -> cores [ 0: 71]
131+
rank 1 @ nid006363: thread 0 -> cores [ 72:143]
132+
rank 2 @ nid006363: thread 0 -> cores [144:215]
133+
rank 3 @ nid006363: thread 0 -> cores [216:287]
134+
rank 4 @ nid006375: thread 0 -> cores [ 0: 71]
135+
rank 5 @ nid006375: thread 0 -> cores [ 72:143]
136+
rank 6 @ nid006375: thread 0 -> cores [144:215]
137+
rank 7 @ nid006375: thread 0 -> cores [216:287]
138138
```
139139

140140
In this example there are 8 MPI ranks:
@@ -306,39 +306,200 @@ The configuration that is optimal for your application may be different.
306306
## AMD CPU Nodes
307307

308308
Alps has nodes with two AMD Epyc Rome CPU sockets per node for CPU-only workloads, most notably in the [Eiger][ref-cluster-eiger] cluster provided by the [HPC Platform][ref-platform-hpcp].
309-
310309
For a detailed description of the node hardware, see the [AMD Rome node][ref-alps-zen2-node] hardware documentation.
311310

312-
The typical Slurm workload that we want to schedule will distribute `NR` MPI ranks over nodes, with `NT` threads per .
311+
??? info "Node description"
312+
- The node has 2 x 64 core sockets
313+
- Each socket is divided into 4 NUMA regions
314+
- the 16 cores in each NUMA region have faster memory access to their of 32 GB
315+
- Each core has two processing units (PUs)
313316

314-
Each node has 128 cores: so we can reasonably expect to run a maximum of 128 MPI ranks per node.
317+
![Screenshot](../images/slurm/eiger-topo.png)
315318

316-
Each node has 2 sockets, and each socket contains 4 NUMA nodes.
317319

318-
Each MPI rank is assigned a set of cores on a specific node - to get the best performance you want to follow some best practices:
320+
Each MPI rank is assigned a set of cores on a node, and Slurm provides flags that can be used directly as flags to `srun`, or as arguments in an `sbatch` script.
321+
Here are some basic flags that we will use to distribute work.
319322

320-
* don't spread MPI ranks across multiple sockets
321-
* and it might be advantageous to have 8 ranks per node, with 16 cores each - the sweet spot is application specific
323+
| flag | meaning |
324+
| ---- | ------- |
325+
| `-n`, `--ntasks` | The total number of MPI ranks |
326+
| `-N`, `--nodes` | The total number of nodes |
327+
| `--ntasks-per-node` | The total number of nodes |
328+
| `-c`, `--cpus-per-task` | The number of cores to assign to each rank. |
329+
| `--hint=nomultithread` | Use only one PU per core |
322330

323-
!! todo "table of basic flags nodes, cores-per-task, etc"
331+
!!! info "Slurm is highly configurable"
332+
These are a subset of the most useful flags.
333+
Call `srun --help` or `sbatch --help` to get a complete list of all the flags available on your target cluster.
334+
Note that the exact set of flags available depends on the Slurm version, how Slurm was configured, and Slurm plugins.
324335

325-
Here we assign 64 cores to each, and observe that there are 128 "cores" (2 per core):
336+
The first example assigns 2 MPI ranks per node, with 64 cores per rank, with the two PUs per core:
326337
```console title="One MPI rank per socket"
327-
$ OMP_NUM_THREADS=64 srun -n2 -N1 -c64 ./affinity.mpi
338+
# one node
339+
$ srun -n2 -N1 -c64 ./affinity.mpi
340+
affinity test for 2 MPI ranks
341+
rank 0 @ nid002199: thread 0 -> cores [ 0: 31,128:159]
342+
rank 1 @ nid002199: thread 0 -> cores [ 64: 95,192:223]
343+
344+
# two nodes
345+
$ srun -n4 -N2 -c64 ./affinity.mpi
346+
affinity test for 4 MPI ranks
347+
rank 0 @ nid001512: thread 0 -> cores [ 0: 31,128:159]
348+
rank 1 @ nid001512: thread 0 -> cores [ 64: 95,192:223]
349+
rank 2 @ nid001515: thread 0 -> cores [ 0: 31,128:159]
350+
rank 3 @ nid001515: thread 0 -> cores [ 64: 95,192:223]
328351
```
329352

330-
If you want only 64, consider the `--hint=nomultithreading` option.
331-
```console title="One MPI rank per socket"
332-
$ OMP_NUM_THREADS=64 srun -n2 -N1 -c64 --hint=nomultithreading ./affinity.mpi
353+
!!! note
354+
In the above example we use `--ntasks/-n` and `--nodes/-N`.
355+
It is possible to achieve the same effect using `--nodes` and `--ntasks-per-node`, for example the following both give 8 ranks on 4 nodes:
356+
357+
```bash
358+
srun --nodes=4 --ntasks=8
359+
srun --nodes=4 --ntasks-per-node=2
360+
```
361+
362+
It is often more efficient to only run one task per core instead of the default two PU, which can be achieved using the `--hint=nomultithreading` option.
363+
```console title="One MPI rank per socket with 1 PU per core"
364+
$ srun -n2 -N1 -c64 --hint=nomultithread ./affinity.mpi
365+
affinity test for 2 MPI ranks
366+
rank 0 @ nid002199: thread 0 -> cores [ 0: 63]
367+
rank 1 @ nid002199: thread 0 -> cores [ 64:127]
333368
```
334369

370+
!!! note "Always test"
371+
The best configuration for performance is highly application specific, with no one-size-fits-all configuration.
372+
Take the time to experiment with `--hint=nomultithread`.
373+
374+
Memory on the node is divided into NUMA (non-uniform memory access) regions.
375+
The 256 GB of a standard-memory node are divided into 8 NUMA nodes of 32 GB, with 16 cores associated with each node:
376+
377+
* memory access is optimal when all the cores of a rank are on the same NUMA node;
378+
* memory access to NUMA regions on the other socket are significantly slower.
379+
380+
??? info "How to investigate the NUMA layout of a node"
381+
Use the command `numactl -H`.
382+
383+
```console
384+
$ srun -n1 numactl -H
385+
available: 8 nodes (0-7)
386+
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
387+
node 0 size: 63733 MB
388+
node 0 free: 62780 MB
389+
node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159
390+
node 1 size: 64502 MB
391+
node 1 free: 61774 MB
392+
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175
393+
node 2 size: 64456 MB
394+
node 2 free: 63385 MB
395+
node 3 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191
396+
node 3 size: 64490 MB
397+
node 3 free: 62613 MB
398+
node 4 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207
399+
node 4 size: 64502 MB
400+
node 4 free: 63897 MB
401+
node 5 cpus: 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223
402+
node 5 size: 64502 MB
403+
node 5 free: 63769 MB
404+
node 6 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239
405+
node 6 size: 64502 MB
406+
node 6 free: 63870 MB
407+
node 7 cpus: 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255
408+
node 7 size: 64428 MB
409+
node 7 free: 63712 MB
410+
node distances:
411+
node 0 1 2 3 4 5 6 7
412+
0: 10 12 12 12 32 32 32 32
413+
1: 12 10 12 12 32 32 32 32
414+
2: 12 12 10 12 32 32 32 32
415+
3: 12 12 12 10 32 32 32 32
416+
4: 32 32 32 32 10 12 12 12
417+
5: 32 32 32 32 12 10 12 12
418+
6: 32 32 32 32 12 12 10 12
419+
7: 32 32 32 32 12 12 12 10
420+
```
421+
The `node distances` table shows that the cores have the fastest memory access to memory in their own region (`10`), and fast access (`12`) to NUMA regions on the same socket.
422+
The cost of accessing memory of a NUMA node on the other socket is much higher (`32`).
423+
424+
Note that this command was run on a large-memory node that has 8 x 64 GB NUMA regions, for a total of 512 GB.
425+
426+
The examples above placed one rank per socket, which is not optimal for NUMA access.
427+
To constrain
428+
429+
!!! Note "Always test"
430+
It might still be optimal for applications that have high threading efficiency and benefit from using fewer MPI ranks to have one rank per socket or even one one rank per node.
431+
Always test!
432+
335433
```console title="One MPI rank per NUMA region"
336-
$ OMP_NUM_THREADS=64 srun -n16 -N1 -c8 ./affinity.mpi
434+
$ srun -n8 -N1 -c16 --hint=nomultithread ./affinity.mpi
435+
affinity test for 8 MPI ranks
436+
rank 0 @ nid002199: thread 0 -> cores [ 0: 15]
437+
rank 1 @ nid002199: thread 0 -> cores [ 64: 79]
438+
rank 2 @ nid002199: thread 0 -> cores [ 16: 31]
439+
rank 3 @ nid002199: thread 0 -> cores [ 80: 95]
440+
rank 4 @ nid002199: thread 0 -> cores [ 32: 47]
441+
rank 5 @ nid002199: thread 0 -> cores [ 96:111]
442+
rank 6 @ nid002199: thread 0 -> cores [ 48: 63]
443+
rank 7 @ nid002199: thread 0 -> cores [112:127]
337444
```
338445

339446
In the above examples all threads on each -- we are effectively allowing the OS to schedule the threads on the available set of cores as it sees fit.
340-
This is often gives the best performance, however sometimes it is beneficial to bind threads to explicit cores.
447+
This often gives the best performance, however sometimes it is beneficial to bind threads to explicit cores.
341448

342-
```console title="One MPI rank per NUMA region"
343-
$ OMP_BIND_PROC=true OMP_NUM_THREADS=64 srun -n16 -N1 -c8 ./affinity.mpi
449+
### OpenMP
450+
451+
The OpenMP threading runtime provides additional options for controlling the pinning of threads to the cores assinged to each MPI rank.
452+
453+
Use the `--omp` flag with `affinity.mpi` to get more detailed information about OpenMPI thread affinity.
454+
For example, four MPI ranks on one node with four cores and four OpenMP threads:
455+
456+
```console title="No OpenMP binding"
457+
$ export OMP_NUM_THREADS=4
458+
$ srun -n4 -N1 -c4 --hint=nomultithread ./affinity.mpi --omp
459+
affinity test for 4 MPI ranks
460+
rank 0 @ nid001512: threads [0:3] -> cores [ 0: 3]
461+
rank 1 @ nid001512: threads [0:3] -> cores [ 64: 67]
462+
rank 2 @ nid001512: threads [0:3] -> cores [ 4: 7]
463+
rank 3 @ nid001512: threads [0:3] -> cores [ 68: 71]
344464
```
465+
466+
The status `threads [0:3] -> cores [ 0: 3]` is shorthand "there are 4 OpenMP threads, and the OS can schedule them on cores 0, 1, 2 and 3".
467+
468+
Allowing the OS to schedule threads is usually efficient, however to get the most you can try pinning threads to specific cores.
469+
The [`OMP_PROC_BIND`](https://www.openmp.org/spec-html/5.0/openmpse52.html) environment variable can be used to tune how OpenMP sets thread affinity.
470+
For example, `OMO_PROC_BIND=true` will give each thread exclusive affinity with a core:
471+
472+
```console title="OMP_PROC_BIND=true"
473+
$ export OMP_NUM_THREADS=4
474+
$ export OMP_PROC_BIND=true
475+
$ srun -n4 -N1 -c4 --hint=nomultithread ./affinity.mpi --omp
476+
affinity test for 4 MPI ranks
477+
rank 0 @ nid001512
478+
thread 0 -> core 0
479+
thread 1 -> core 1
480+
thread 2 -> core 2
481+
thread 3 -> core 3
482+
rank 1 @ nid001512
483+
thread 0 -> core 64
484+
thread 1 -> core 65
485+
thread 2 -> core 66
486+
thread 3 -> core 67
487+
rank 2 @ nid001512
488+
thread 0 -> core 4
489+
thread 1 -> core 5
490+
thread 2 -> core 6
491+
thread 3 -> core 7
492+
rank 3 @ nid001512
493+
thread 0 -> core 68
494+
thread 1 -> core 69
495+
thread 2 -> core 70
496+
thread 3 -> core 71
497+
```
498+
499+
!!! note
500+
There are many OpenMP variables that can be used to fine tune affinity.
501+
See the [OpenMP documentation](https://www.openmp.org/spec-html/5.0/openmpch6.html) for more information.
502+
503+
!!! warning
504+
The `OMP_*` environment variables only affect thread affinity of applications that use OpenMP for thread-level parallelism.
505+
Other threading runtimes will be configured differently, and the `affinity.mpi` tool will only be able to show the set of cores assigned to the rank.

0 commit comments

Comments
 (0)