Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/CODEOWNERS
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,4 @@ docs/software/communication @msimberg
docs/software/devtools/linaro @jgphpc
docs/software/prgenv/linalg.md @finkandreas @msimberg
docs/software/sciapps/cp2k.md @abussy @RMeli
docs/software/sciapps/lammps.md @nickjbrowning
2 changes: 1 addition & 1 deletion docs/alps/clusters.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ Clusters on Alps are provided as part of different [platforms][ref-alps-platform

[:octicons-arrow-right-24: Clariden][ref-cluster-clariden]

Bristen is a small system with a100 nodes, used for **todo**
Bristen is a small system with A100 nodes used for data processing, development, x86 workloads and ML inference services.

[:octicons-arrow-right-24: Bristen][ref-cluster-bristen]
</div>
Expand Down
84 changes: 82 additions & 2 deletions docs/clusters/bristen.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,86 @@
[](){#ref-cluster-bristen}
# Bristen

!!! todo
use the [clariden][clariden] as template.
Bristen is an Alps cluster that provides GPU accelerators and filesystems designed to meet the needs of machine learning workloads in the [MLP][ref-platform-mlp].

## Cluster Specification

### Compute Nodes
Bristen consists of 32 A100 nodes [NVIDIA A100 nodes][ref-alps-a100-node]. The number of nodes can change when nodes are added or removed from other clusters on Alps.

| node type | number of nodes | total CPU sockets | total GPUs |
|-----------|--------| ----------------- | ---------- |
| [a100][ref-alps-a100-node] | 32 | 32 | 128 |

Nodes are in the [`normal` slurm partition][ref-slurm-partition-normal].

### Storage and file systems

Bristen uses the [MLp filesystems and storage policies][ref-mlp-storage].

## Getting started

### Logging into Bristen

To connect to Bristen via SSH, first refer to the [ssh guide][ref-ssh].

!!! example "`~/.ssh/config`"
Add the following to your [SSH configuration][ref-ssh-config] to enable you to directly connect to bristen using `ssh bristen`.
```
Host bristen
HostName bristen.alps.cscs.ch
ProxyJump ela
User cscsusername
IdentityFile ~/.ssh/cscs-key
IdentitiesOnly yes
```

### Software

Users are encouraged to use containers on Bristen.

* Jobs using containers can be easily set up and submitted using the [container engine][ref-container-engine].
* To build images, see the [guide to building container images on Alps][ref-build-containers].

## Running Jobs on Bristen

### SLURM

Bristen uses [SLURM][ref-slurm] as the workload manager, which is used to launch and monitor distributed workloads, such as training runs.

There is currently a single slurm partition on the system:

* the `normal` partition is for all production workloads.
+ nodes in this partition are not shared.

<!--
See the SLURM documentation for instructions on how to run jobs on the [Grace-Hopper nodes][ref-slurm-gh200].

??? example "how to check the number of nodes on the system"
You can check the size of the system by running the following command in the terminal:
```console
$ sinfo --format "| %20R | %10D | %10s | %10l | %10A |"
| PARTITION | NODES | JOB_SIZE | TIMELIMIT | NODES(A/I) |
| debug | 32 | 1-2 | 30:00 | 3/29 |
| normal | 1266 | 1-infinite | 1-00:00:00 | 812/371 |
| xfer | 2 | 1 | 1-00:00:00 | 1/1 |
```
The last column shows the number of nodes that have been allocated in currently running jobs (`A`) and the number of jobs that are idle (`I`).
-->

### FirecREST

Bristen can also be accessed using [FircREST][ref-firecrest] at the `https://api.cscs.ch/ml/firecrest/v2` API endpoint.

### Scheduled Maintenance

Wednesday morning 8-12 CET is reserved for periodic updates, with services potentially unavailable during this timeframe. If the queues must be drained (redeployment of node images, rebooting of compute nodes, etc) then a Slurm reservation will be in place that will prevent jobs from running into the maintenance window.

Exceptional and non-disruptive updates may happen outside this time frame and will be announced to the users mailing list, and on the [CSCS status page](https://status.cscs.ch).

### Change log

!!! change "2025-03-05 container engine updated"
now supports better containers that go faster. Users do not to change their workflow to take advantage of these updates.

### Known issues
2 changes: 1 addition & 1 deletion docs/clusters/clariden.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,7 @@ See the SLURM documentation for instructions on how to run jobs on the [Grace-Ho

### FirecREST

Clariden can also be accessed using [FircREST][ref-firecrest] at the `https://api.cscs.ch/ml/firecrest/v1` API endpoint.
Clariden can also be accessed using [FircREST][ref-firecrest] at the `https://api.cscs.ch/ml/firecrest/v2` API endpoint.

## Maintenance and status

Expand Down
2 changes: 1 addition & 1 deletion docs/platforms/mlp/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ The main cluster provided by the MLP is Clariden, a large Grace-Hopper GPU syste
<div class="grid cards" markdown>
- :fontawesome-solid-mountain: [__Bristen__][ref-cluster-bristen]

Bristen is a smaller system with [A100 GPU nodes][ref-alps-a100-node] for **todo**
Bristen is a smaller system with [A100 GPU nodes][ref-alps-a100-node] for data processing, development, x86 workloads and inference services.
</div>

[](){#ref-mlp-storage}
Expand Down
Loading