Skip to content

Commit c7703c7

Browse files
authored
Merge branch 'main' into expand-communication
2 parents b489566 + d94f4ce commit c7703c7

File tree

8 files changed

+487
-10
lines changed

8 files changed

+487
-10
lines changed

docs/alps/clusters.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ Clusters on Alps are provided as part of different [platforms][ref-alps-platform
1414

1515
[:octicons-arrow-right-24: Clariden][ref-cluster-clariden]
1616

17-
Bristen is a small system with a100 nodes, used for **todo**
17+
Bristen is a small system with A100 nodes used for data processing, development, x86 workloads and ML inference services.
1818

1919
[:octicons-arrow-right-24: Bristen][ref-cluster-bristen]
2020
</div>

docs/clusters/bristen.md

Lines changed: 86 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,90 @@
11
[](){#ref-cluster-bristen}
22
# Bristen
33

4-
!!! todo
5-
use the [clariden][clariden] as template.
4+
Bristen is an Alps cluster that provides GPU accelerators and filesystems designed to meet the needs of machine learning workloads in the [MLP][ref-platform-mlp].
65

6+
## Cluster Specification
7+
8+
### Compute Nodes
9+
Bristen consists of 32 A100 nodes [NVIDIA A100 nodes][ref-alps-a100-node]. The number of nodes can change when nodes are added or removed from other clusters on Alps.
10+
11+
| node type | number of nodes | total CPU sockets | total GPUs |
12+
|-----------|--------| ----------------- | ---------- |
13+
| [a100][ref-alps-a100-node] | 32 | 32 | 128 |
14+
15+
Nodes are in the [`normal` slurm partition][ref-slurm-partition-normal].
16+
17+
### Storage and file systems
18+
19+
Bristen uses the [MLp filesystems and storage policies][ref-mlp-storage].
20+
21+
## Getting started
22+
23+
### Logging into Bristen
24+
25+
To connect to Bristen via SSH, first refer to the [ssh guide][ref-ssh].
26+
27+
!!! example "`~/.ssh/config`"
28+
Add the following to your [SSH configuration][ref-ssh-config] to enable you to directly connect to bristen using `ssh bristen`.
29+
```
30+
Host bristen
31+
HostName bristen.alps.cscs.ch
32+
ProxyJump ela
33+
User cscsusername
34+
IdentityFile ~/.ssh/cscs-key
35+
IdentitiesOnly yes
36+
```
37+
38+
### Software
39+
40+
Users are encouraged to use containers on Bristen.
41+
42+
* Jobs using containers can be easily set up and submitted using the [container engine][ref-container-engine].
43+
* To build images, see the [guide to building container images on Alps][ref-build-containers].
44+
45+
## Running Jobs on Bristen
46+
47+
### SLURM
48+
49+
Bristen uses [SLURM][ref-slurm] as the workload manager, which is used to launch and monitor distributed workloads, such as training runs.
50+
51+
There is currently a single slurm partition on the system:
52+
53+
* the `normal` partition is for all production workloads.
54+
+ nodes in this partition are not shared.
55+
56+
| name | nodes | max nodes per job | time limit |
57+
| -- | -- | -- | -- |
58+
| `normal` | 32 | - | 24 hours |
59+
60+
<!--
61+
See the SLURM documentation for instructions on how to run jobs on the [Grace-Hopper nodes][ref-slurm-gh200].
62+
63+
??? example "how to check the number of nodes on the system"
64+
You can check the size of the system by running the following command in the terminal:
65+
```console
66+
$ sinfo --format "| %20R | %10D | %10s | %10l | %10A |"
67+
| PARTITION | NODES | JOB_SIZE | TIMELIMIT | NODES(A/I) |
68+
| debug | 32 | 1-2 | 30:00 | 3/29 |
69+
| normal | 1266 | 1-infinite | 1-00:00:00 | 812/371 |
70+
| xfer | 2 | 1 | 1-00:00:00 | 1/1 |
71+
```
72+
The last column shows the number of nodes that have been allocated in currently running jobs (`A`) and the number of jobs that are idle (`I`).
73+
-->
74+
75+
### FirecREST
76+
77+
Bristen can also be accessed using [FircREST][ref-firecrest] at the `https://api.cscs.ch/ml/firecrest/v1` API endpoint.
78+
79+
### Scheduled Maintenance
80+
81+
Wednesday morning 8-12 CET is reserved for periodic updates, with services potentially unavailable during this timeframe. If the queues must be drained (redeployment of node images, rebooting of compute nodes, etc) then a Slurm reservation will be in place that will prevent jobs from running into the maintenance window.
82+
83+
Exceptional and non-disruptive updates may happen outside this time frame and will be announced to the users mailing list, and on the [CSCS status page](https://status.cscs.ch).
84+
85+
### Change log
86+
87+
!!! change "2025-03-05 container engine updated"
88+
now supports better containers that go faster. Users do not to change their workflow to take advantage of these updates.
89+
90+
### Known issues

docs/clusters/clariden.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -73,12 +73,13 @@ There are two slurm partitions on the system:
7373

7474
| name | nodes | max nodes per job | time limit |
7575
| -- | -- | -- | -- |
76-
| `normal` | 1266 | - | 24 hours |
77-
| `debug` | 32 | 2 | 30 minutes |
76+
| `normal` | 1204 | - | 24 hours |
77+
| `debug` | 24 | 2 | 1.5 node-hours |
7878
| `xfer` | 2 | 1 | 24 hours |
7979

8080
* nodes in the `normal` and `debug` partitions are not shared
8181
* nodes in the `xfer` partition can be shared
82+
* nodes in the `debug` queue have a 1.5 node-hour time limit. This means you could for example request 2 nodes for 45 minutes each, or 1 single node for the full time limit.
8283

8384
See the SLURM documentation for instructions on how to run jobs on the [Grace-Hopper nodes][ref-slurm-gh200].
8485

docs/platforms/mlp/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ The main cluster provided by the MLP is Clariden, a large Grace-Hopper GPU syste
2525
<div class="grid cards" markdown>
2626
- :fontawesome-solid-mountain: [__Bristen__][ref-cluster-bristen]
2727

28-
Bristen is a smaller system with [A100 GPU nodes][ref-alps-a100-node] for **todo**
28+
Bristen is a smaller system with [A100 GPU nodes][ref-alps-a100-node] for data processing, development, x86 workloads and inference services.
2929
</div>
3030

3131
[](){#ref-mlp-storage}

docs/running/slurm.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -75,7 +75,7 @@ In these cases SLURM jobs must be configured to assign multiple ranks to a singl
7575
This is best done using [NVIDIA's Multi-Process Service (MPS)].
7676
To use MPS, launch your application using the following wrapper script, which will start MPS on one rank per node and assign GPUs to ranks according to the CPU mask of a rank, ensuring the closest GPU is used:
7777

78-
```bash
78+
```bash title="mps-wrapper.sh"
7979
#!/bin/bash
8080
# Example mps-wrapper.sh usage:
8181
# > srun [srun args] mps-wrapper.sh [cmd] [cmd args]

docs/services/firecrest.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,8 @@ FirecREST is available for all three major [Alps platforms][ref-alps-platforms],
4545
<tr><th>Platform</th><th>Version</th><th>API Endpoint</th><th>Clusters</th></tr>
4646
<tr><td style="vertical-align: middle;" rowspan="2">HPC Platform</td><td>v1</td><td>https://api.cscs.ch/hpc/firecrest/v1</td><td style="vertical-align: middle;" rowspan="2"><a href="../../clusters/daint">Daint</a>, <a href="../../clusters/eiger">Eiger</a></td></tr>
4747
<tr> <td>v2</td><td>https://api.cscs.ch/hpc/firecrest/v2</td></tr>
48-
<tr><td>ML Platform</td><td>v1</td><td>https://api.cscs.ch/ml/firecrest/v1</td><td style="vertical-align: middle;"><a href="../../clusters/bristen">Bristen</a>, <a href="../../clusters/clariden">Clariden</a></td></tr>
48+
<tr><td style="vertical-align: middle;" rowspan="2">ML Platform</td><td>v1</td><td>https://api.cscs.ch/ml/firecrest/v1</td><td style="vertical-align: middle;" rowspan="2"><a href="../../clusters/bristen">Bristen</a>, <a href="../../clusters/clariden">Clariden</a></td></tr>
49+
<tr> <td>v2</td><td>https://api.cscs.ch/ml/firecrest/v2</td></tr>
4950
<tr><td style="vertical-align: middle;" rowspan="2">CW Platform</td><td>v1</td><td>https://api.cscs.ch/cw/firecrest/v1</td><td style="vertical-align: middle;" rowspan="2"><a href="../../clusters/santis">Santis</a></td></tr>
5051
<tr><td>v2</td><td>https://api.cscs.ch/cw/firecrest/v2</td></tr>
5152
</table>

0 commit comments

Comments
 (0)