Skip to content

Commit 608cbda

Browse files
authored
Merge branch 'main' into namd-eiger
2 parents 665cc37 + 95fbc40 commit 608cbda

File tree

17 files changed

+293
-441
lines changed

17 files changed

+293
-441
lines changed

.github/CODEOWNERS

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
* @bcumming @msimberg @RMeli
22
docs/services/firecrest @jpdorsch @ekouts
3-
docs/software/communication @msimberg
3+
docs/software/communication @Madeeks @msimberg
44
docs/software/devtools/linaro @jgphpc
55
docs/software/prgenv/linalg.md @finkandreas @msimberg
66
docs/software/sciapps/cp2k.md @abussy @RMeli
7-
docs/software/sciapps/lammps.md @nickjbrowning

docs/clusters/santis.md

Lines changed: 2 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -7,10 +7,7 @@ Santis is an Alps cluster that provides GPU accelerators and file systems design
77

88
### Compute nodes
99

10-
Santis consists of around 600 [Grace-Hopper nodes][ref-alps-gh200-node].
11-
12-
!!! note
13-
In late March 2025 Santis was temporarily expanded to 1233 nodes for [Gordon Bell and HPL runs][ref-gb2025].
10+
Santis consists of around 430 [Grace-Hopper nodes][ref-alps-gh200-node].
1411

1512
The number of nodes can change when nodes are added or removed from other clusters on Alps.
1613

@@ -19,7 +16,7 @@ You will be assigned to one of the four login nodes when you ssh onto the system
1916

2017
| node type | number of nodes | total CPU sockets | total GPUs |
2118
|-----------|-----------------| ----------------- | ---------- |
22-
| [gh200][ref-alps-gh200-node] | 600 | 2,400 | 2,400 |
19+
| [gh200][ref-alps-gh200-node] | 430 | 1,720 | 1,720 |
2320

2421
### Storage and file systems
2522

docs/guides/gb2025.md

Lines changed: 0 additions & 83 deletions
This file was deleted.

docs/guides/storage.md

Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,118 @@
11
[](){#ref-guides-storage}
22
# Storage
33

4+
[](){#ref-guides-storage-sharing}
5+
## Sharing files and data
6+
7+
Newly created user folders are not accessible by other groups or users on CSCS systems.
8+
Linux [Access Control Lists](https://www.redhat.com/en/blog/linux-access-control-lists) (ACLs) let you grant access to one or more groups or users.
9+
10+
In traditional POSIX, access permissions are granted to `user/group/other` in mode `read`/`write`/`execute`.
11+
The permissions can be checked with the `-l` option of the command `ls`.
12+
For instance, if `user1` owns the folder `test`, the output would be the following:
13+
14+
```console title="Checking posix permissions with ls"
15+
$ ls -lahd test/
16+
drwxr-xr-x 2 user1 csstaff 4.0K Feb 23 13:46 test/
17+
```
18+
19+
ACLs are an extension of these permissions to give one or more users or groups access to your data.
20+
The ACLs of the same `test` folder of `user1` can be shown with the command `getfacl`:
21+
22+
```console title="Checking permissions with getfacl"
23+
$ getfacl test
24+
# file: test
25+
# owner: user1
26+
# group: csstaff
27+
user::rwx
28+
group::r-x
29+
other::r-x
30+
```
31+
32+
The command `setfacl` is used to change ACLs for a file or directory.
33+
34+
To add users or groups to read/write/execute on a selected file or folder, use the `-M,--modify-file` or `-m,--modify` flags to modify the ACL of a file or directory.
35+
36+
!!! example "give user2 read+write access to test"
37+
Where `test` is owned by `user1`.
38+
```console
39+
$ setfacl -m user:user2:rw test/
40+
41+
$ getfacl test/
42+
# file: test
43+
# owner: user1
44+
# group: csstaff
45+
user::rwx
46+
user:user2:rw
47+
group::r-x
48+
mask::rwx
49+
other::r-x
50+
```
51+
52+
The `-X,--remove-file` and `-x,--remove` options will remove ACL entries.
53+
54+
!!! example "remove user2 access to test"
55+
This reverts the access that was granted in the previous example.
56+
```console
57+
$ setfacl -x user:user2 test/
58+
59+
$ getfacl test/
60+
# file: test
61+
# owner: user1
62+
# group: csstaff
63+
user::rwx
64+
group::r-x
65+
mask::rwx
66+
other::r-x
67+
```
68+
69+
Access rights can also be granted recursively to a folder and its children (if they exist) using the option `-R,--recursive`.
70+
71+
!!! note
72+
This applies only to existing files - files added after this call won't inherit the permissions.
73+
74+
!!! example "recursively grant user2 access to test and its contents"
75+
```console
76+
$ setfacl -Rm user:user2 test
77+
78+
$ getfacl test/subdir
79+
# file: test/subdir
80+
# owner: user1
81+
# group: csstaff
82+
user::rwx
83+
user:user2:rwx
84+
group::---
85+
group:csstaff:r-x
86+
mask::rwx
87+
other::---
88+
```
89+
90+
To set up a default so all newly created folders and dirs inside or your desired path will inherit the permissions, use the `-d,--default` option.
91+
92+
!!! example "recursively grant user2 access to test and its contents"
93+
`user2` will have access to files created inside `test` after this call:
94+
95+
```console
96+
$ setfacl -dm user:user2:rw test/
97+
98+
$ getfacl test
99+
# file: test
100+
# owner: user1
101+
# group: csstaff
102+
user::rwx
103+
group::r-x
104+
mask::rwx
105+
other::r-x
106+
default:user::rwx
107+
default:user:user2:rw
108+
default:group::r-x
109+
default:mask::rwx
110+
default:other::r-x
111+
```
112+
113+
!!! info
114+
For more information read the setfacl man page: `man setfacl`.
115+
4116
## Many small files vs. HPC File Systems
5117

6118
Workloads that read or create many small files are not well-suited to parallel file systems, which are designed for parallel and distributed I/O.

docs/guides/terminal.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -51,6 +51,7 @@ This approach won't work if the same home directory is mounted on two different
5151
Care needs to be taken to store executables, configuration and data for different architecures in separate locations, and automatically configure the login environment to use the correct location when you log into different systems.
5252

5353
The following example:
54+
5455
* sets architecture-specific `bin` path for installing programs
5556
* sets architecture-specific paths for installing application data and configuration
5657
* selects the correct path by running `uname -m` when you log in to a cluster

docs/policies/support.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
1-
# UserLab Support Policy
1+
[](){#ref-support}
2+
# User Support Policy
23

34
## 1. User Support Policy
45

@@ -23,6 +24,7 @@ CSCS reserves the right to decline support for requests that fall outside the sc
2324
Support will be focused on ensuring that the resources are used in alignment with the approved objectives and goals.
2425
Requests that significantly deviate from the original proposal may not be accommodated.
2526

27+
[](){#ref-support-user-apps}
2628
## 3. User Applications
2729

2830
User applications are those brought to CSCS systems by the users, whether they are developed by the users themselves or another third-party.
@@ -32,6 +34,7 @@ CSCS will provide guidance on deploying applications on our systems, including c
3234
While we can assist with infrastructure-related issues, we can not configure, optimize, debug, or fix the applications themselves.
3335
Users are responsible for resolving application-specific issues themselves or contacting the respective developers.
3436

37+
[](){#ref-support-apps}
3538
## 4. Officially Supported Applications
3639

3740
CSCS offers a range of officially supported applications and their respective versions and configurations, which are packaged and released by CSCS or its supply partners.

docs/running/slurm.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -75,7 +75,7 @@ In these cases SLURM jobs must be configured to assign multiple ranks to a singl
7575
This is best done using [NVIDIA's Multi-Process Service (MPS)].
7676
To use MPS, launch your application using the following wrapper script, which will start MPS on one rank per node and assign GPUs to ranks according to the CPU mask of a rank, ensuring the closest GPU is used:
7777

78-
```bash
78+
```bash title="mps-wrapper.sh"
7979
#!/bin/bash
8080
# Example mps-wrapper.sh usage:
8181
# > srun [srun args] mps-wrapper.sh [cmd] [cmd args]

docs/services/firecrest.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,8 @@ FirecREST is available for all three major [Alps platforms][ref-alps-platforms],
4545
<tr><th>Platform</th><th>Version</th><th>API Endpoint</th><th>Clusters</th></tr>
4646
<tr><td style="vertical-align: middle;" rowspan="2">HPC Platform</td><td>v1</td><td>https://api.cscs.ch/hpc/firecrest/v1</td><td style="vertical-align: middle;" rowspan="2"><a href="../../clusters/daint">Daint</a>, <a href="../../clusters/eiger">Eiger</a></td></tr>
4747
<tr> <td>v2</td><td>https://api.cscs.ch/hpc/firecrest/v2</td></tr>
48-
<tr><td>ML Platform</td><td>v1</td><td>https://api.cscs.ch/ml/firecrest/v1</td><td style="vertical-align: middle;"><a href="../../clusters/bristen">Bristen</a>, <a href="../../clusters/clariden">Clariden</a></td></tr>
48+
<tr><td style="vertical-align: middle;" rowspan="2">ML Platform</td><td>v1</td><td>https://api.cscs.ch/ml/firecrest/v1</td><td style="vertical-align: middle;" rowspan="2"><a href="../../clusters/bristen">Bristen</a>, <a href="../../clusters/clariden">Clariden</a></td></tr>
49+
<tr> <td>v2</td><td>https://api.cscs.ch/ml/firecrest/v2</td></tr>
4950
<tr><td style="vertical-align: middle;" rowspan="2">CW Platform</td><td>v1</td><td>https://api.cscs.ch/cw/firecrest/v1</td><td style="vertical-align: middle;" rowspan="2"><a href="../../clusters/santis">Santis</a></td></tr>
5051
<tr><td>v2</td><td>https://api.cscs.ch/cw/firecrest/v2</td></tr>
5152
</table>

docs/software/communication/cray-mpich.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -58,12 +58,14 @@ See [this page][ref-slurm-gh200] for more information on configuring SLURM to us
5858

5959
Alternatively, if you wish to not use GPU-aware MPI, either unset `MPICH_GPU_SUPPORT_ENABLED` or explicitly set it to `0` in your launch scripts.
6060

61+
[](){#ref-communication-cray-mpich-known-issues}
6162
## Known issues
6263

6364
This section documents known issues related to Cray MPICH on Alps. Resolved issues are also listed for reference.
6465

6566
### Existing Issues
6667

68+
[](){#ref-communication-cray-mpich-cache-monitor-disable}
6769
#### Cray MPICH hangs
6870

6971
Cray MPICH may sometimes hang on larger runs.

docs/software/communication/libfabric.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,21 @@
44
[Libfabric](https://ofiwg.github.io/libfabric/), or Open Fabrics Interfaces (OFI), is a low level networking library that abstracts away various networking backends.
55
It is used by Cray MPICH, and can be used together with OpenMPI, NCCL, and RCCL to make use of the [Slingshot network on Alps][ref-alps-hsn].
66

7+
## Using libfabric
8+
9+
If you are using a uenv provided by CSCS, such as [prgenv-gnu][ref-uenv-prgenv-gnu], [Cray MPICH][ref-communication-cray-mpich] is linked to libfabric and the high speed network will be used.
10+
No changes are required in applications.
11+
12+
If you are using containers, the system libfabric can be loaded into your container using the [CXI hook provided by the container engine][ref-ce-cxi-hook].
13+
Using the hook is essential to make full use of the Alps network.
14+
15+
## Tuning libfabric
16+
17+
Tuning libfabric (particularly together with [Cray MPICH][ref-communication-cray-mpich], [OpenMPI][ref-communication-openmpi], [NCCL][ref-communication-nccl], and [RCCL][ref-communication-rccl]) depends on many factors, including the application, workload, and system.
18+
For a comprehensive overview libfabric options for the CXI provider (the provider for the Slingshot network), see the [`fi_cxi` man pages](https://ofiwg.github.io/libfabric/v2.1.0/man/fi_cxi.7.html).
19+
Note that the exact version deployed on Alps may differ, and not all options may be applicable on Alps.
20+
21+
See the [Cray MPICH known issues page][ref-communication-cray-mpich-known-issues] for issues when using Cray MPICH together with libfabric.
22+
723
!!! todo
24+
More options?

0 commit comments

Comments
 (0)