Skip to content

Commit 8703a9a

Browse files
authored
Merge branch 'main' into cxil_map-write-error
2 parents 5a7d3b8 + 921315f commit 8703a9a

File tree

10 files changed

+475
-32
lines changed

10 files changed

+475
-32
lines changed

docs/alps/hardware.md

Lines changed: 42 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,8 @@
33

44
Alps is a HPE Cray EX3000 system, a liquid cooled blade-based, high-density system.
55

6-
!!! todo
7-
this is a skeleton - all of the details need to be filled in
6+
!!! under-construction
7+
This page is a work in progress - contact us if you want us to prioritise documentation specific information that would be useful for your work.
88

99
## Alps Cabinets
1010

@@ -40,13 +40,13 @@ Alps was installed in phases, starting with the installation of 1024 AMD Rome du
4040

4141
There are currently five node types in Alps:
4242

43-
| type | abbreviation | blades | nodes | CPU sockets | GPU devices |
44-
| ---- | ------- | ------:| -----:| -----------:| -----------:|
45-
| NVIDIA GH200 | gh200 | 1344 | 2688 | 10,752 | 10,752 |
46-
| AMD Rome | zen2 | 256 | 1024 | 2,048 | -- |
47-
| NVIDIA A100 | a100 | 72 | 144 | 144 | 576 |
48-
| AMD MI250x | mi200 | 12 | 24 | 24 | 96 |
49-
| AMD MI300A | mi300 | 64 | 128 | 512 | 512 |
43+
| type | abbreviation | blades | nodes | CPU sockets | GPU devices |
44+
| ---- | ------- | ------:| -----:| -----------:| -----------:|
45+
| [NVIDIA GH200][ref-alps-gh200-node] | gh200 | 1344 | 2688 | 10,752 | 10,752 |
46+
| [AMD Rome][ref-alps-zen2-node] | zen2 | 256 | 1024 | 2,048 | -- |
47+
| [NVIDIA A100][ref-alps-a100-node] | a100 | 72 | 144 | 144 | 576 |
48+
| [AMD MI250x][ref-alps-mi200-node] | mi200 | 12 | 24 | 24 | 96 |
49+
| [AMD MI300A][ref-alps-mi300-node] | mi300 | 64 | 128 | 512 | 512 |
5050

5151
[](){#ref-alps-gh200-node}
5252
### NVIDIA GH200 GPU Nodes
@@ -57,6 +57,7 @@ There are currently five node types in Alps:
5757
Please [get in touch](https://github.com/eth-cscs/cscs-docs/issues) if there is information that you want to see here.
5858

5959
There are 24 cabinets, in 4 rows with 6 cabinets per row, and each cabinet contains 112 nodes (for a total of 448 GH200):
60+
6061
* 8 chassis per cabinet
6162
* 7 blades per chassis
6263
* 2 nodes per blade
@@ -80,16 +81,44 @@ Each node contains four Grace-Hopper modules and four corresponding network inte
8081
[](){#ref-alps-zen2-node}
8182
### AMD Rome CPU Nodes
8283

83-
!!! todo
84+
These nodes have two [AMD Epyc 7742](https://en.wikichip.org/wiki/amd/epyc/7742) 64-core CPU sockets, and are used primarily for the [Eiger][ref-cluster-eiger] system. They come in two memory configurations:
85+
86+
* *Standard-memory*: 256 GB in 16x16 GB DDR4 DIMMs.
87+
* *Large-memory*: 512 GB in 16x32 GB DDR4 DIMMs.
88+
89+
!!! note "Not all memory is available"
90+
The total memory available to jobs on the nodes is roughly 245 GB and 497 GB on the standard and large memory nodes respectively.
91+
92+
The amount of memory available to your job also depends on the number of MPI ranks per node -- each MPI rank has a memory overhead.
8493

85-
EX425
94+
A schematic of a *standard memory node* below illustrates the CPU cores and [NUMA nodes](https://www.kernel.org/doc/html/v4.18/vm/numa.html).(1)
95+
{.annotate}
96+
97+
1. Obtained with the command `lstopo --no-caches --no-io --no-legend eiger-topo.png` on Eiger.
98+
99+
![Screenshot](../images/slurm/eiger-topo.png)
100+
101+
* The two sockets are labelled Package L#0 and Package L#1.
102+
* Each socket has 4 NUMA nodes, with 16 cores each, for a total of 64 cores per socket.
103+
104+
Each core supports [simultaneous multi threading (SMT)](https://www.amd.com/en/blogs/2025/simultaneous-multithreading-driving-performance-a.html), whereby each core can execute two threads concurrently, which are presented as two processing units (PU) per physical core:
105+
106+
* the first PU on each core are numbered 0:63 on socket 0, and 64:127 on socket 1;
107+
* the second PU on each core are numbered 128:191 on socket 0, and 192:256 on socket 1;
108+
* hence, core `n` has PUs `n` and `n+128`.
109+
110+
Each node has two Slingshot 11 network interface cards (NICs), which are not illustrated on the diagram.
86111

87112
[](){#ref-alps-a100-node}
88113
### NVIDIA A100 GPU Nodes
89114

90-
!!! todo
115+
The Grizzly Peak blades contain two nodes, where each node has:
91116

92-
Grizzly Peak
117+
* One 64-core Zen3 CPU socket
118+
* 512 GB DDR4 Memory
119+
* 4 NVIDIA A100 GPUs with 80 GB HBM3 memory each
120+
* The MCH system is the same, except the A100 have 96 GB of memory.
121+
* 4 NICs -- one per GPU.
93122

94123
[](){#ref-alps-mi200-node}
95124
### AMD MI250x GPU Nodes

docs/alps/storage.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,8 @@ HPC storage is provided by independent clusters, composed of servers and physica
1919

2020
Capstor and Iopsstor are on the same Slingshot network as Alps, while VAST is on the CSCS Ethernet network.
2121

22+
See the [Lustre guide][ref-guides-storage-lustre] for some hints on how to get the best performance out of the filesystem.
23+
2224
The mounts, and how they are used for Scratch, Store, and Home file systems that are mounted on clusters are documented in the [file system docs][ref-storage-fs].
2325

2426
[](){#ref-alps-capstor}

docs/guides/storage.md

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -113,10 +113,52 @@ To set up a default so all newly created folders and dirs inside or your desired
113113
!!! info
114114
For more information read the setfacl man page: `man setfacl`.
115115

116+
[](){#ref-guides-storage-lustre}
117+
## Lustre tuning
118+
[Capstor][ref-alps-capstor] and [Iopsstor][ref-alps-iopsstor] are both [Lustre](https://lustre.org) filesystem.
119+
120+
![Lustre architecture](../images/storage/lustre.png)
121+
122+
As shown in the schema above, Lustre uses *metadata* servers to store and query metadata, which is basically what is shown by `ls`: directory structure, file permission, and modification dates.
123+
Its performance is roughly the same on [Capstor][ref-alps-capstor] and [Iopsstor][ref-alps-iopsstor].
124+
This data is globally synchronized, which means Lustre is not well suited to handling many small files, see the discussion on [how to handle many small files][ref-guides-storage-small-files].
125+
126+
The data itself is subdivided in blocks of size `<blocksize>` and is stored by Object Storage Servers (OSS) in one or more Object Storage Targets (OST).
127+
The blocksize and number of OSTs to use is defined by the striping settings, which are applied to a path, with new files and directories ihneriting them from their parent directory.
128+
The `lfs getstripe <path>` command can be used to get information on the stripe settings of a path.
129+
For directories and empty files `lfs setstripe --stripe-count <count> --stripe-size <size> <directory/file>` can be used to set the layout.
130+
The simplest way to have the correct layout is to copy to a directory with the correct layout
131+
132+
!!! tip "A blocksize of 4MB gives good throughput, without being overly big..."
133+
... so it is a good choice when reading a file sequentially or in large chunks, but if one reads shorter chunks in random order it might be better to reduce the size, the performance will be smaller, but the performance of your application might actually increase.
134+
See the [Lustre documentation](https://doc.lustre.org/lustre_manual.xhtml#managingstripingfreespace) for more information.
135+
136+
137+
!!! example "Settings for large files"
138+
```console
139+
lfs setstripe --stripe-count -1 --stripe-size 4M <big_files_dir>`
140+
```
141+
Lustre also supports composite layouts, switching from one layout to another at a given size `--component-end` (`-E`).
142+
With it it is possible to create a Progressive file layout switching `--stripe-count` (`-c`), `--stripe-size` (`-S`), so that fewer locks are required for smaller files, but load is distributed for larger files.
143+
144+
!!! example "Good default settings"
145+
```console
146+
lfs setstripe -E 4M -c 1 -E 64M -c 4 -E -1 -c -1 -S 4M <base_dir>
147+
```
148+
149+
### Iopsstor vs Capstor
150+
151+
[Iopsstor][ref-alps-iopsstor] uses SSD as OST, thus random access is quick, and the performance of the single OST is high.
152+
[Capstor][ref-alps-capstor] on another hand uses harddisks, it has a larger capacity, and it also have many more OSS, thus the total bandwidth is larger.
153+
See for example the [ML filesystem guide][ref-mlp-storage-suitability].
154+
155+
[](){#ref-guides-storage-small-files}
116156
## Many small files vs. HPC File Systems
117157

118158
Workloads that read or create many small files are not well-suited to parallel file systems, which are designed for parallel and distributed I/O.
119159

160+
In some cases, and if enough memory is available it might be worth to unpack/repack the small files to in-memory filesystems like `/dev/shm/$USER` or `/tmp`, which are *much* faster, or to use a squashfs filesystem that is stored as a single large file on Lustre.
161+
120162
Workloads that do not play nicely with Lustre include:
121163

122164
* Configuration and compiling applications.

docs/images/slurm/eiger-topo.png

52.1 KB
Loading

docs/images/storage/lustre.png

26.7 KB
Loading

docs/platforms/mlp/index.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -52,17 +52,19 @@ Use scratch to store datasets that will be accessed by jobs, and for job output.
5252
Scratch is per user - each user gets separate scratch path and quota.
5353

5454
* The environment variable `SCRATCH=/iopsstor/scratch/cscs/$USER` is set automatically when you log into the system, and can be used as a shortcut to access scratch.
55-
* There is an additional scratch path mounted on [Capstor][ref-alps-capstor] at `/capstor/scratch/cscs/$USER`.
55+
* There is an additional scratch path mounted on [Capstor][ref-alps-capstor] at `/capstor/scratch/cscs/$USER`.
5656

5757
!!! warning "scratch cleanup policy"
5858
Files that have not been accessed in 30 days are automatically deleted.
5959

6060
**Scratch is not intended for permanent storage**: transfer files back to the capstor project storage after job runs.
6161

62+
[](){#ref-mlp-storage-suitability}
6263
!!! note "file system suitability"
6364
The Capstor scratch filesystem is based on HDDs and is optimized for large, sequential read and write operations.
6465
We recommend using Capstor for storing **checkpoint files** and other **large, contiguous outputs** generated by your training runs.
6566
In contrast, Iopstor uses high-performance NVMe drives, which excel at handling **IOPS-intensive workloads** involving frequent, random access. This makes it a better choice for storing **training datasets**, especially when accessed randomly during machine learning training.
67+
See the [Lustre guide][ref-guides-storage-lustre] for some hints on how to get the best performance out of the filesystem.
6668

6769
### Scratch Usage Recommendations
6870

0 commit comments

Comments
 (0)