Skip to content

Commit 9335b9f

Browse files
authored
AMD CPU slurm guide for Eiger.Alps (#167)
* how to use slurm on zen2 nodes * general hardware information about zen2 nodes
1 parent e796cc0 commit 9335b9f

File tree

3 files changed

+423
-25
lines changed

3 files changed

+423
-25
lines changed

docs/alps/hardware.md

Lines changed: 41 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,8 @@
33

44
Alps is a HPE Cray EX3000 system, a liquid cooled blade-based, high-density system.
55

6-
!!! todo
7-
this is a skeleton - all of the details need to be filled in
6+
!!! under-construction
7+
This page is a work in progress - contact us if you want us to prioritise documentation specific information that would be useful for your work.
88

99
## Alps Cabinets
1010

@@ -40,13 +40,13 @@ Alps was installed in phases, starting with the installation of 1024 AMD Rome du
4040

4141
There are currently five node types in Alps:
4242

43-
| type | abbreviation | blades | nodes | CPU sockets | GPU devices |
44-
| ---- | ------- | ------:| -----:| -----------:| -----------:|
45-
| NVIDIA GH200 | gh200 | 1344 | 2688 | 10,752 | 10,752 |
46-
| AMD Rome | zen2 | 256 | 1024 | 2,048 | -- |
47-
| NVIDIA A100 | a100 | 72 | 144 | 144 | 576 |
48-
| AMD MI250x | mi200 | 12 | 24 | 24 | 96 |
49-
| AMD MI300A | mi300 | 64 | 128 | 512 | 512 |
43+
| type | abbreviation | blades | nodes | CPU sockets | GPU devices |
44+
| ---- | ------- | ------:| -----:| -----------:| -----------:|
45+
| [NVIDIA GH200][ref-alps-gh200-node] | gh200 | 1344 | 2688 | 10,752 | 10,752 |
46+
| [AMD Rome][ref-alps-zen2-node] | zen2 | 256 | 1024 | 2,048 | -- |
47+
| [NVIDIA A100][ref-alps-a100-node] | a100 | 72 | 144 | 144 | 576 |
48+
| [AMD MI250x][ref-alps-mi200-node] | mi200 | 12 | 24 | 24 | 96 |
49+
| [AMD MI300A][ref-alps-mi300-node] | mi300 | 64 | 128 | 512 | 512 |
5050

5151
[](){#ref-alps-gh200-node}
5252
### NVIDIA GH200 GPU Nodes
@@ -81,16 +81,44 @@ Each node contains four Grace-Hopper modules and four corresponding network inte
8181
[](){#ref-alps-zen2-node}
8282
### AMD Rome CPU Nodes
8383

84-
!!! todo
84+
These nodes have two [AMD Epyc 7742](https://en.wikichip.org/wiki/amd/epyc/7742) 64-core CPU sockets, and are used primarily for the [Eiger][ref-cluster-eiger] system. They come in two memory configurations:
85+
86+
* *Standard-memory*: 256 GB in 16x16 GB DDR4 DIMMs.
87+
* *Large-memory*: 512 GB in 16x32 GB DDR4 DIMMs.
88+
89+
!!! note "Not all memory is available"
90+
The total memory available to jobs on the nodes is roughly 245 GB and 497 GB on the standard and large memory nodes respectively.
91+
92+
The amount of memory available to your job also depends on the number of MPI ranks per node -- each MPI rank has a memory overhead.
93+
94+
A schematic of a *standard memory node* below illustrates the CPU cores and [NUMA nodes](https://www.kernel.org/doc/html/v4.18/vm/numa.html).(1)
95+
{.annotate}
8596

86-
EX425
97+
1. Obtained with the command `lstopo --no-caches --no-io --no-legend eiger-topo.png` on Eiger.
98+
99+
![Screenshot](../images/slurm/eiger-topo.png)
100+
101+
* The two sockets are labelled Package L#0 and Package L#1.
102+
* Each socket has 4 NUMA nodes, with 16 cores each, for a total of 64 cores per socket.
103+
104+
Each core supports [simultaneous multi threading (SMT)](https://www.amd.com/en/blogs/2025/simultaneous-multithreading-driving-performance-a.html), whereby each core can execute two threads concurrently, which are presented as two processing units (PU) per physical core:
105+
106+
* the first PU on each core are numbered 0:63 on socket 0, and 64:127 on socket 1;
107+
* the second PU on each core are numbered 128:191 on socket 0, and 192:256 on socket 1;
108+
* hence, core `n` has PUs `n` and `n+128`.
109+
110+
Each node has two Slingshot 11 network interface cards (NICs), which are not illustrated on the diagram.
87111

88112
[](){#ref-alps-a100-node}
89113
### NVIDIA A100 GPU Nodes
90114

91-
!!! todo
115+
The Grizzly Peak blades contain two nodes, where each node has:
92116

93-
Grizzly Peak
117+
* One 64-core Zen3 CPU socket
118+
* 512 GB DDR4 Memory
119+
* 4 NVIDIA A100 GPUs with 80 GB HBM3 memory each
120+
* The MCH system is the same, except the A100 have 96 GB of memory.
121+
* 4 NICs -- one per GPU.
94122

95123
[](){#ref-alps-mi200-node}
96124
### AMD MI250x GPU Nodes

docs/images/slurm/eiger-topo.png

52.1 KB
Loading

0 commit comments

Comments
 (0)