|
3 | 3 |
|
4 | 4 | Alps is a HPE Cray EX3000 system, a liquid cooled blade-based, high-density system. |
5 | 5 |
|
6 | | -!!! todo |
7 | | - this is a skeleton - all of the details need to be filled in |
| 6 | +!!! under-construction |
| 7 | + This page is a work in progress - contact us if you want us to prioritise documentation specific information that would be useful for your work. |
8 | 8 |
|
9 | 9 | ## Alps Cabinets |
10 | 10 |
|
@@ -40,13 +40,13 @@ Alps was installed in phases, starting with the installation of 1024 AMD Rome du |
40 | 40 |
|
41 | 41 | There are currently five node types in Alps: |
42 | 42 |
|
43 | | -| type | abbreviation | blades | nodes | CPU sockets | GPU devices | |
44 | | -| ---- | ------- | ------:| -----:| -----------:| -----------:| |
45 | | -| NVIDIA GH200 | gh200 | 1344 | 2688 | 10,752 | 10,752 | |
46 | | -| AMD Rome | zen2 | 256 | 1024 | 2,048 | -- | |
47 | | -| NVIDIA A100 | a100 | 72 | 144 | 144 | 576 | |
48 | | -| AMD MI250x | mi200 | 12 | 24 | 24 | 96 | |
49 | | -| AMD MI300A | mi300 | 64 | 128 | 512 | 512 | |
| 43 | +| type | abbreviation | blades | nodes | CPU sockets | GPU devices | |
| 44 | +| ---- | ------- | ------:| -----:| -----------:| -----------:| |
| 45 | +| [NVIDIA GH200][ref-alps-gh200-node] | gh200 | 1344 | 2688 | 10,752 | 10,752 | |
| 46 | +| [AMD Rome][ref-alps-zen2-node] | zen2 | 256 | 1024 | 2,048 | -- | |
| 47 | +| [NVIDIA A100][ref-alps-a100-node] | a100 | 72 | 144 | 144 | 576 | |
| 48 | +| [AMD MI250x][ref-alps-mi200-node] | mi200 | 12 | 24 | 24 | 96 | |
| 49 | +| [AMD MI300A][ref-alps-mi300-node] | mi300 | 64 | 128 | 512 | 512 | |
50 | 50 |
|
51 | 51 | [](){#ref-alps-gh200-node} |
52 | 52 | ### NVIDIA GH200 GPU Nodes |
@@ -81,16 +81,44 @@ Each node contains four Grace-Hopper modules and four corresponding network inte |
81 | 81 | [](){#ref-alps-zen2-node} |
82 | 82 | ### AMD Rome CPU Nodes |
83 | 83 |
|
84 | | -!!! todo |
| 84 | +These nodes have two [AMD Epyc 7742](https://en.wikichip.org/wiki/amd/epyc/7742) 64-core CPU sockets, and are used primarily for the [Eiger][ref-cluster-eiger] system. They come in two memory configurations: |
| 85 | + |
| 86 | +* *Standard-memory*: 256 GB in 16x16 GB DDR4 DIMMs. |
| 87 | +* *Large-memory*: 512 GB in 16x32 GB DDR4 DIMMs. |
| 88 | + |
| 89 | +!!! note "Not all memory is available" |
| 90 | + The total memory available to jobs on the nodes is roughly 245 GB and 497 GB on the standard and large memory nodes respectively. |
| 91 | + |
| 92 | + The amount of memory available to your job also depends on the number of MPI ranks per node -- each MPI rank has a memory overhead. |
| 93 | + |
| 94 | +A schematic of a *standard memory node* below illustrates the CPU cores and [NUMA nodes](https://www.kernel.org/doc/html/v4.18/vm/numa.html).(1) |
| 95 | +{.annotate} |
85 | 96 |
|
86 | | -EX425 |
| 97 | +1. Obtained with the command `lstopo --no-caches --no-io --no-legend eiger-topo.png` on Eiger. |
| 98 | + |
| 99 | + |
| 100 | + |
| 101 | +* The two sockets are labelled Package L#0 and Package L#1. |
| 102 | +* Each socket has 4 NUMA nodes, with 16 cores each, for a total of 64 cores per socket. |
| 103 | + |
| 104 | +Each core supports [simultaneous multi threading (SMT)](https://www.amd.com/en/blogs/2025/simultaneous-multithreading-driving-performance-a.html), whereby each core can execute two threads concurrently, which are presented as two processing units (PU) per physical core: |
| 105 | + |
| 106 | +* the first PU on each core are numbered 0:63 on socket 0, and 64:127 on socket 1; |
| 107 | +* the second PU on each core are numbered 128:191 on socket 0, and 192:256 on socket 1; |
| 108 | +* hence, core `n` has PUs `n` and `n+128`. |
| 109 | + |
| 110 | +Each node has two Slingshot 11 network interface cards (NICs), which are not illustrated on the diagram. |
87 | 111 |
|
88 | 112 | [](){#ref-alps-a100-node} |
89 | 113 | ### NVIDIA A100 GPU Nodes |
90 | 114 |
|
91 | | -!!! todo |
| 115 | +The Grizzly Peak blades contain two nodes, where each node has: |
92 | 116 |
|
93 | | -Grizzly Peak |
| 117 | +* One 64-core Zen3 CPU socket |
| 118 | +* 512 GB DDR4 Memory |
| 119 | +* 4 NVIDIA A100 GPUs with 80 GB HBM3 memory each |
| 120 | + * The MCH system is the same, except the A100 have 96 GB of memory. |
| 121 | +* 4 NICs -- one per GPU. |
94 | 122 |
|
95 | 123 | [](){#ref-alps-mi200-node} |
96 | 124 | ### AMD MI250x GPU Nodes |
|
0 commit comments