Skip to content

Commit ace070b

Browse files
authored
Add basic GH200 hardware description (#118)
1 parent c9e374a commit ace070b

File tree

5 files changed

+7576
-10
lines changed

5 files changed

+7576
-10
lines changed

docs/alps/hardware.md

Lines changed: 34 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -38,22 +38,44 @@ This approach to cooling provides greater efficiency for the rack-level cooling,
3838

3939
Alps was installed in phases, starting with the installation of 1024 AMD Rome dual socket CPU nodes in 2020, through to the main installation of 2,688 Grace-Hopper nodes in 2024.
4040

41-
There are currently four node types in Alps, with another becoming available in 2025:
41+
There are currently five node types in Alps:
4242

43-
| type | blades | nodes | CPU sockets | GPU devices |
44-
| ---- | ------:| -----:| -----------:| -----------:|
45-
| NVIDIA GH200 | 1344 | 2688 | 10,752 | 10,752 |
46-
| AMD Rome | 256 | 1024 | 2,048 | -- |
47-
| NVIDIA A100 | 72 | 144 | 144 | 576 |
48-
| AMD MI250x | 12 | 24 | 24 | 96 |
49-
| AMD MI300A | 64 | 128 | 512 | 512 |
43+
| type | abbreviation | blades | nodes | CPU sockets | GPU devices |
44+
| ---- | ------- | ------:| -----:| -----------:| -----------:|
45+
| NVIDIA GH200 | gh200 | 1344 | 2688 | 10,752 | 10,752 |
46+
| AMD Rome | zen2 | 256 | 1024 | 2,048 | -- |
47+
| NVIDIA A100 | a100 | 72 | 144 | 144 | 576 |
48+
| AMD MI250x | mi200 | 12 | 24 | 24 | 96 |
49+
| AMD MI300A | mi300 | 64 | 128 | 512 | 512 |
5050

5151
[](){#ref-alps-gh200-node}
5252
### NVIDIA GH200 GPU Nodes
5353

54-
!!! todo
54+
!!! under-construction
55+
The description of the GH200 nodes is a work in progress.
56+
We will add more detailed information soon.
57+
Please [get in touch](https://github.com/eth-cscs/cscs-docs/issues) if there is information that you want to see here.
58+
59+
There are 24 cabinets, in 4 rows with 6 cabinets per row, and each cabinet contains 112 nodes (for a total of 448 GH200):
60+
* 8 chassis per cabinet
61+
* 7 blades per chassis
62+
* 2 nodes per blade
63+
64+
!!! info "Why 7 blades per chassis?"
65+
A chassis can contain up to 8 blades, however Alps' gh200 chassis are underpopulated so that we can increase the amount of power delivered to each GPU.
5566

56-
Blanca Peak
67+
Each node contains four Grace-Hopper modules and four corresponding network interface cards (NICS) per blade, as illustrated below:
68+
69+
![](../images/alps/gh200-schematic.svg)
70+
71+
??? info "Node xname"
72+
There are two boards per blade with one node per board.
73+
This is different to the `zen2` CPU-only nodes (used for example in Eiger) that have two nodes per board for a total of four nodes per blade.
74+
As such, there are no `n1` nodes in the xname list, e.g.:
75+
```
76+
x1100c0s6b0n0
77+
x1100c0s6b1n0
78+
```
5779

5880
[](){#ref-alps-zen2-node}
5981
### AMD Rome CPU Nodes
@@ -79,6 +101,8 @@ Bard Peak
79101
[](){#ref-alps-mi300-node}
80102
### AMD MI300A GPU Nodes
81103

104+
![](../images/alps/mi300-schematic.svg)
105+
82106
!!! todo
83107

84108
Parry Peak

docs/contributing/index.md

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -229,6 +229,66 @@ They stand out better from the main text, and can be collapsed by default if nee
229229

230230
If an admonition is collapsed by default, it should have a title.
231231

232+
We provide some custom admonitions.
233+
234+
#### Change
235+
236+
For adding information about a change, originally designed for recording updates to clusters.
237+
238+
=== "Rendered"
239+
!!! change "2025-04-17"
240+
* SLURM was upgraded to version 25.1.
241+
* uenv was upgraded to v0.8
242+
243+
Old changes can be folded:
244+
245+
??? change "2025-02-04"
246+
* The new Scratch cleanup policy was implemented
247+
* NVIDIA driver was updated
248+
249+
=== "Markdown"
250+
```
251+
!!! change "2025-04-17"
252+
* SLURM was upgraded to version 25.1.
253+
* uenv was upgraded to v0.8
254+
```
255+
256+
Old changes can be folded:
257+
258+
```
259+
??? change "2025-02-04"
260+
* The new Scratch cleanup policy was implemented
261+
* NVIDIA driver was updated
262+
```
263+
264+
#### Under construction
265+
266+
For marking incomplete sections.
267+
268+
=== "Rendered"
269+
!!! under-construction
270+
This is not finished yet!
271+
272+
=== "Markdown"
273+
```
274+
!!! under-construction
275+
This is not finished yet!
276+
```
277+
278+
#### Todo
279+
280+
As a placeholder for documentation that needs to be written.
281+
282+
=== "Rendered"
283+
!!! todo
284+
Add some common error messages and how to fix them.
285+
286+
=== "Markdown"
287+
```
288+
!!! todo
289+
Add some common error messages and how to fix them.
290+
```
291+
232292
### Code blocks
233293

234294
Use [code blocks](https://squidfunk.github.io/mkdocs-material/reference/code-blocks/) when you want to display monospace text in a programming language, terminal output, configuration files etc.

0 commit comments

Comments
 (0)