Skip to content

Commit c001790

Browse files
Bristen docs (#81)
* added initial lammps docs. * update to codeowners * added bristen docs * changed firecrest ref to v2
1 parent 0d2f0bb commit c001790

File tree

4 files changed

+85
-5
lines changed

4 files changed

+85
-5
lines changed

docs/alps/clusters.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ Clusters on Alps are provided as part of different [platforms][ref-alps-platform
1414

1515
[:octicons-arrow-right-24: Clariden][ref-cluster-clariden]
1616

17-
Bristen is a small system with a100 nodes, used for **todo**
17+
Bristen is a small system with A100 nodes used for data processing, development, x86 workloads and ML inference services.
1818

1919
[:octicons-arrow-right-24: Bristen][ref-cluster-bristen]
2020
</div>

docs/clusters/bristen.md

Lines changed: 82 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,86 @@
11
[](){#ref-cluster-bristen}
22
# Bristen
33

4-
!!! todo
5-
use the [clariden][clariden] as template.
4+
Bristen is an Alps cluster that provides GPU accelerators and filesystems designed to meet the needs of machine learning workloads in the [MLP][ref-platform-mlp].
65

6+
## Cluster Specification
7+
8+
### Compute Nodes
9+
Bristen consists of 32 A100 nodes [NVIDIA A100 nodes][ref-alps-a100-node]. The number of nodes can change when nodes are added or removed from other clusters on Alps.
10+
11+
| node type | number of nodes | total CPU sockets | total GPUs |
12+
|-----------|--------| ----------------- | ---------- |
13+
| [a100][ref-alps-a100-node] | 32 | 32 | 128 |
14+
15+
Nodes are in the [`normal` slurm partition][ref-slurm-partition-normal].
16+
17+
### Storage and file systems
18+
19+
Bristen uses the [MLp filesystems and storage policies][ref-mlp-storage].
20+
21+
## Getting started
22+
23+
### Logging into Bristen
24+
25+
To connect to Bristen via SSH, first refer to the [ssh guide][ref-ssh].
26+
27+
!!! example "`~/.ssh/config`"
28+
Add the following to your [SSH configuration][ref-ssh-config] to enable you to directly connect to bristen using `ssh bristen`.
29+
```
30+
Host bristen
31+
HostName bristen.alps.cscs.ch
32+
ProxyJump ela
33+
User cscsusername
34+
IdentityFile ~/.ssh/cscs-key
35+
IdentitiesOnly yes
36+
```
37+
38+
### Software
39+
40+
Users are encouraged to use containers on Bristen.
41+
42+
* Jobs using containers can be easily set up and submitted using the [container engine][ref-container-engine].
43+
* To build images, see the [guide to building container images on Alps][ref-build-containers].
44+
45+
## Running Jobs on Bristen
46+
47+
### SLURM
48+
49+
Bristen uses [SLURM][ref-slurm] as the workload manager, which is used to launch and monitor distributed workloads, such as training runs.
50+
51+
There is currently a single slurm partition on the system:
52+
53+
* the `normal` partition is for all production workloads.
54+
+ nodes in this partition are not shared.
55+
56+
<!--
57+
See the SLURM documentation for instructions on how to run jobs on the [Grace-Hopper nodes][ref-slurm-gh200].
58+
59+
??? example "how to check the number of nodes on the system"
60+
You can check the size of the system by running the following command in the terminal:
61+
```console
62+
$ sinfo --format "| %20R | %10D | %10s | %10l | %10A |"
63+
| PARTITION | NODES | JOB_SIZE | TIMELIMIT | NODES(A/I) |
64+
| debug | 32 | 1-2 | 30:00 | 3/29 |
65+
| normal | 1266 | 1-infinite | 1-00:00:00 | 812/371 |
66+
| xfer | 2 | 1 | 1-00:00:00 | 1/1 |
67+
```
68+
The last column shows the number of nodes that have been allocated in currently running jobs (`A`) and the number of jobs that are idle (`I`).
69+
-->
70+
71+
### FirecREST
72+
73+
Bristen can also be accessed using [FircREST][ref-firecrest] at the `https://api.cscs.ch/ml/firecrest/v2` API endpoint.
74+
75+
### Scheduled Maintenance
76+
77+
Wednesday morning 8-12 CET is reserved for periodic updates, with services potentially unavailable during this timeframe. If the queues must be drained (redeployment of node images, rebooting of compute nodes, etc) then a Slurm reservation will be in place that will prevent jobs from running into the maintenance window.
78+
79+
Exceptional and non-disruptive updates may happen outside this time frame and will be announced to the users mailing list, and on the [CSCS status page](https://status.cscs.ch).
80+
81+
### Change log
82+
83+
!!! change "2025-03-05 container engine updated"
84+
now supports better containers that go faster. Users do not to change their workflow to take advantage of these updates.
85+
86+
### Known issues

docs/clusters/clariden.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -95,7 +95,7 @@ See the SLURM documentation for instructions on how to run jobs on the [Grace-Ho
9595

9696
### FirecREST
9797

98-
Clariden can also be accessed using [FircREST][ref-firecrest] at the `https://api.cscs.ch/ml/firecrest/v1` API endpoint.
98+
Clariden can also be accessed using [FircREST][ref-firecrest] at the `https://api.cscs.ch/ml/firecrest/v2` API endpoint.
9999

100100
## Maintenance and status
101101

docs/platforms/mlp/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ The main cluster provided by the MLP is Clariden, a large Grace-Hopper GPU syste
2525
<div class="grid cards" markdown>
2626
- :fontawesome-solid-mountain: [__Bristen__][ref-cluster-bristen]
2727

28-
Bristen is a smaller system with [A100 GPU nodes][ref-alps-a100-node] for **todo**
28+
Bristen is a smaller system with [A100 GPU nodes][ref-alps-a100-node] for data processing, development, x86 workloads and inference services.
2929
</div>
3030

3131
[](){#ref-mlp-storage}

0 commit comments

Comments
 (0)