You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/alps/hardware.md
+6-7Lines changed: 6 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,8 +3,8 @@
3
3
4
4
Alps is a HPE Cray EX3000 system, a liquid cooled blade-based, high-density system.
5
5
6
-
!!! todo
7
-
this is a skeleton - all of the details need to be filled in
6
+
!!! under-construction
7
+
This page is a work in progress - contact us if you want us to prioritise documentation specific information that would be useful for your work.
8
8
9
9
## Alps Cabinets
10
10
@@ -100,12 +100,11 @@ A schematic of a *standard memory node* below illustrates the CPU cores and [NUM
100
100
* The two sockets are labelled Package L#0 and Package L#1.
101
101
* Each socket has 4 NUMA nodes, with 16 cores each, for a total of 64 cores per socket.
102
102
103
-
Each core supports [simultaneous multi threading (SMT)](https://www.amd.com/en/blogs/2025/simultaneous-multithreading-driving-performance-a.html), whereby each core can execute two threads concurrently, which are presented as two PU per physical core.
104
-
105
-
* The first PU on each core are numbered 0:63 on socket 0, and 64:127 on socket 1;
106
-
* The second PU on each core are numbered 128:191 on socket 0, and 192:256 on socket 1.
107
-
* Hence core `n` SMT
103
+
Each core supports [simultaneous multi threading (SMT)](https://www.amd.com/en/blogs/2025/simultaneous-multithreading-driving-performance-a.html), whereby each core can execute two threads concurrently, which are presented as two processing units (PU) per physical core:
108
104
105
+
* the first PU on each core are numbered 0:63 on socket 0, and 64:127 on socket 1;
106
+
* the second PU on each core are numbered 128:191 on socket 0, and 192:256 on socket 1;
107
+
* hence, core `n` has PUs `n` and `n+128`.
109
108
110
109
Each node has two Slingshot 11 network interface cards (NICs), which are not illustrated on the diagram.
@@ -90,7 +90,7 @@ The following sections will provide detailed guidance on how to use SLURM to req
90
90
## Affinity
91
91
92
92
The following sections will document how to use Slurm on different compute nodes available on Alps.
93
-
To demonstrate the effects different Slurm parameters, we will use a little command line tool [affinity](https://github.com/bcumming/affinity) that prints the CPU cores and GPUs that are assinged to each MPI rank in a job, and which node they are run on.
93
+
To demonstrate the effects different Slurm parameters, we will use a little command line tool [affinity](https://github.com/bcumming/affinity) that prints the CPU cores and GPUs that are assigned to each MPI rank in a job, and which node they are run on.
94
94
95
95
We strongly recommend using a tool like affinity to understand and test the Slurm configuration for jobs, because the behavior of Slurm is highly dependent on the system configuration.
96
96
Parameters that worked on a different cluster -- or with a different Slurm version or configuration on the same cluster -- are not guaranteed to give the same results.
@@ -127,14 +127,14 @@ The build generates the following executables:
@@ -306,39 +306,200 @@ The configuration that is optimal for your application may be different.
306
306
## AMD CPU Nodes
307
307
308
308
Alps has nodes with two AMD Epyc Rome CPU sockets per node for CPU-only workloads, most notably in the [Eiger][ref-cluster-eiger] cluster provided by the [HPC Platform][ref-platform-hpcp].
309
-
310
309
For a detailed description of the node hardware, see the [AMD Rome node][ref-alps-zen2-node] hardware documentation.
311
310
312
-
The typical Slurm workload that we want to schedule will distribute `NR` MPI ranks over nodes, with `NT` threads per .
311
+
??? info "Node description"
312
+
- The node has 2 x 64 core sockets
313
+
- Each socket is divided into 4 NUMA regions
314
+
- the 16 cores in each NUMA region have faster memory access to their of 32 GB
315
+
- Each core has two processing units (PUs)
313
316
314
-
Each node has 128 cores: so we can reasonably expect to run a maximum of 128 MPI ranks per node.
317
+

315
318
316
-
Each node has 2 sockets, and each socket contains 4 NUMA nodes.
317
319
318
-
Each MPI rank is assigned a set of cores on a specific node - to get the best performance you want to follow some best practices:
320
+
Each MPI rank is assigned a set of cores on a node, and Slurm provides flags that can be used directly as flags to `srun`, or as arguments in an `sbatch` script.
321
+
Here are some basic flags that we will use to distribute work.
319
322
320
-
* don't spread MPI ranks across multiple sockets
321
-
* and it might be advantageous to have 8 ranks per node, with 16 cores each - the sweet spot is application specific
323
+
| flag | meaning |
324
+
| ---- | ------- |
325
+
|`-n`, `--ntasks`| The total number of MPI ranks |
326
+
|`-N`, `--nodes`| The total number of nodes |
327
+
|`--ntasks-per-node`| The total number of nodes |
328
+
|`-c`, `--cpus-per-task`| The number of cores to assign to each rank. |
329
+
|`--hint=nomultithread`| Use only one PU per core |
322
330
323
-
!! todo "table of basic flags nodes, cores-per-task, etc"
331
+
!!! info "Slurm is highly configurable"
332
+
These are a subset of the most useful flags.
333
+
Call `srun --help` or `sbatch --help` to get a complete list of all the flags available on your target cluster.
334
+
Note that the exact set of flags available depends on the Slurm version, how Slurm was configured, and Slurm plugins.
324
335
325
-
Here we assign 64 cores to each, and observe that there are 128 "cores" (2 per core):
336
+
The first example assigns 2 MPI ranks per node, with 64 cores per rank, with the two PUs per core:
In the above example we use `--ntasks/-n` and `--nodes/-N`.
355
+
It is possible to achieve the same effect using `--nodes` and `--ntasks-per-node`, for example the following both give 8 ranks on 4 nodes:
356
+
357
+
```bash
358
+
srun --nodes=4 --ntasks=8
359
+
srun --nodes=4 --ntasks-per-node=2
360
+
```
361
+
362
+
It is often more efficient to only run one task per core instead of the default two PU, which can be achieved using the `--hint=nomultithreading` option.
363
+
```console title="One MPI rank per socket with 1 PU per core"
The `node distances` table shows that the cores have the fastest memory access to memory in their own region (`10`), and fast access (`12`) to NUMA regions on the same socket.
422
+
The cost of accessing memory of a NUMA node on the other socket is much higher (`32`).
423
+
424
+
Note that this command was run on a large-memory node that has 8 x 64 GB NUMA regions, for a total of 512 GB.
425
+
426
+
The examples above placed one rank per socket, which is not optimal for NUMA access.
427
+
To constrain
428
+
429
+
!!! Note "Always test"
430
+
It might still be optimal for applications that have high threading efficiency and benefit from using fewer MPI ranks to have one rank per socket or even one one rank per node.
0 commit comments