Skip to content

Commit 39c2d7c

Browse files
committed
Merge remote-tracking branch 'upstream/main' into pytorch/uenv
2 parents 03fdbaa + 2b71c27 commit 39c2d7c

File tree

8 files changed

+806
-10
lines changed

8 files changed

+806
-10
lines changed

.github/CODEOWNERS

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,5 @@ docs/software/communication @msimberg
44
docs/software/devtools/linaro @jgphpc
55
docs/software/prgenv/linalg.md @finkandreas @msimberg
66
docs/software/sciapps/cp2k.md @abussy @RMeli
7+
docs/software/sciapps/lammps.md @nickjbrowning
78
docs/software/ml @boeschf

docs/alps/clusters.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ Clusters on Alps are provided as part of different [platforms][ref-alps-platform
1414

1515
[:octicons-arrow-right-24: Clariden][ref-cluster-clariden]
1616

17-
Bristen is a small system with a100 nodes, used for **todo**
17+
Bristen is a small system with A100 nodes used for data processing, development, x86 workloads and ML inference services.
1818

1919
[:octicons-arrow-right-24: Bristen][ref-cluster-bristen]
2020
</div>

docs/clusters/bristen.md

Lines changed: 86 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,90 @@
11
[](){#ref-cluster-bristen}
22
# Bristen
33

4-
!!! todo
5-
use the [clariden][clariden] as template.
4+
Bristen is an Alps cluster that provides GPU accelerators and filesystems designed to meet the needs of machine learning workloads in the [MLP][ref-platform-mlp].
65

6+
## Cluster Specification
7+
8+
### Compute Nodes
9+
Bristen consists of 32 A100 nodes [NVIDIA A100 nodes][ref-alps-a100-node]. The number of nodes can change when nodes are added or removed from other clusters on Alps.
10+
11+
| node type | number of nodes | total CPU sockets | total GPUs |
12+
|-----------|--------| ----------------- | ---------- |
13+
| [a100][ref-alps-a100-node] | 32 | 32 | 128 |
14+
15+
Nodes are in the [`normal` slurm partition][ref-slurm-partition-normal].
16+
17+
### Storage and file systems
18+
19+
Bristen uses the [MLp filesystems and storage policies][ref-mlp-storage].
20+
21+
## Getting started
22+
23+
### Logging into Bristen
24+
25+
To connect to Bristen via SSH, first refer to the [ssh guide][ref-ssh].
26+
27+
!!! example "`~/.ssh/config`"
28+
Add the following to your [SSH configuration][ref-ssh-config] to enable you to directly connect to bristen using `ssh bristen`.
29+
```
30+
Host bristen
31+
HostName bristen.alps.cscs.ch
32+
ProxyJump ela
33+
User cscsusername
34+
IdentityFile ~/.ssh/cscs-key
35+
IdentitiesOnly yes
36+
```
37+
38+
### Software
39+
40+
Users are encouraged to use containers on Bristen.
41+
42+
* Jobs using containers can be easily set up and submitted using the [container engine][ref-container-engine].
43+
* To build images, see the [guide to building container images on Alps][ref-build-containers].
44+
45+
## Running Jobs on Bristen
46+
47+
### SLURM
48+
49+
Bristen uses [SLURM][ref-slurm] as the workload manager, which is used to launch and monitor distributed workloads, such as training runs.
50+
51+
There is currently a single slurm partition on the system:
52+
53+
* the `normal` partition is for all production workloads.
54+
+ nodes in this partition are not shared.
55+
56+
| name | nodes | max nodes per job | time limit |
57+
| -- | -- | -- | -- |
58+
| `normal` | 32 | - | 24 hours |
59+
60+
<!--
61+
See the SLURM documentation for instructions on how to run jobs on the [Grace-Hopper nodes][ref-slurm-gh200].
62+
63+
??? example "how to check the number of nodes on the system"
64+
You can check the size of the system by running the following command in the terminal:
65+
```console
66+
$ sinfo --format "| %20R | %10D | %10s | %10l | %10A |"
67+
| PARTITION | NODES | JOB_SIZE | TIMELIMIT | NODES(A/I) |
68+
| debug | 32 | 1-2 | 30:00 | 3/29 |
69+
| normal | 1266 | 1-infinite | 1-00:00:00 | 812/371 |
70+
| xfer | 2 | 1 | 1-00:00:00 | 1/1 |
71+
```
72+
The last column shows the number of nodes that have been allocated in currently running jobs (`A`) and the number of jobs that are idle (`I`).
73+
-->
74+
75+
### FirecREST
76+
77+
Bristen can also be accessed using [FircREST][ref-firecrest] at the `https://api.cscs.ch/ml/firecrest/v1` API endpoint.
78+
79+
### Scheduled Maintenance
80+
81+
Wednesday morning 8-12 CET is reserved for periodic updates, with services potentially unavailable during this timeframe. If the queues must be drained (redeployment of node images, rebooting of compute nodes, etc) then a Slurm reservation will be in place that will prevent jobs from running into the maintenance window.
82+
83+
Exceptional and non-disruptive updates may happen outside this time frame and will be announced to the users mailing list, and on the [CSCS status page](https://status.cscs.ch).
84+
85+
### Change log
86+
87+
!!! change "2025-03-05 container engine updated"
88+
now supports better containers that go faster. Users do not to change their workflow to take advantage of these updates.
89+
90+
### Known issues

docs/clusters/clariden.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -76,12 +76,13 @@ There are two slurm partitions on the system:
7676

7777
| name | nodes | max nodes per job | time limit |
7878
| -- | -- | -- | -- |
79-
| `normal` | 1266 | - | 24 hours |
80-
| `debug` | 32 | 2 | 30 minutes |
79+
| `normal` | 1204 | - | 24 hours |
80+
| `debug` | 24 | 2 | 1.5 node-hours |
8181
| `xfer` | 2 | 1 | 24 hours |
8282

8383
* nodes in the `normal` and `debug` partitions are not shared
8484
* nodes in the `xfer` partition can be shared
85+
* nodes in the `debug` queue have a 1.5 node-hour time limit. This means you could for example request 2 nodes for 45 minutes each, or 1 single node for the full time limit.
8586

8687
See the SLURM documentation for instructions on how to run jobs on the [Grace-Hopper nodes][ref-slurm-gh200].
8788

docs/platforms/mlp/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ The main cluster provided by the MLP is Clariden, a large Grace-Hopper GPU syste
2525
<div class="grid cards" markdown>
2626
- :fontawesome-solid-mountain: [__Bristen__][ref-cluster-bristen]
2727

28-
Bristen is a smaller system with [A100 GPU nodes][ref-alps-a100-node] for **todo**
28+
Bristen is a smaller system with [A100 GPU nodes][ref-alps-a100-node] for data processing, development, x86 workloads and inference services.
2929
</div>
3030

3131
[](){#ref-mlp-storage}

0 commit comments

Comments
 (0)