|
1 | 1 | [](){#ref-cluster-bristen} |
2 | 2 | # Bristen |
3 | 3 |
|
4 | | -!!! todo |
5 | | - use the [clariden][clariden] as template. |
| 4 | +Bristen is an Alps cluster that provides GPU accelerators and filesystems designed to meet the needs of machine learning workloads in the [MLP][ref-platform-mlp]. |
6 | 5 |
|
| 6 | +## Cluster Specification |
| 7 | + |
| 8 | +### Compute Nodes |
| 9 | +Bristen consists of 32 A100 nodes [NVIDIA A100 nodes][ref-alps-a100-node]. The number of nodes can change when nodes are added or removed from other clusters on Alps. |
| 10 | + |
| 11 | +| node type | number of nodes | total CPU sockets | total GPUs | |
| 12 | +|-----------|--------| ----------------- | ---------- | |
| 13 | +| [a100][ref-alps-a100-node] | 32 | 32 | 128 | |
| 14 | + |
| 15 | +Nodes are in the [`normal` slurm partition][ref-slurm-partition-normal]. |
| 16 | + |
| 17 | +### Storage and file systems |
| 18 | + |
| 19 | +Bristen uses the [MLp filesystems and storage policies][ref-mlp-storage]. |
| 20 | + |
| 21 | +## Getting started |
| 22 | + |
| 23 | +### Logging into Bristen |
| 24 | + |
| 25 | +To connect to Bristen via SSH, first refer to the [ssh guide][ref-ssh]. |
| 26 | + |
| 27 | +!!! example "`~/.ssh/config`" |
| 28 | + Add the following to your [SSH configuration][ref-ssh-config] to enable you to directly connect to bristen using `ssh bristen`. |
| 29 | + ``` |
| 30 | + Host bristen |
| 31 | + HostName bristen.alps.cscs.ch |
| 32 | + ProxyJump ela |
| 33 | + User cscsusername |
| 34 | + IdentityFile ~/.ssh/cscs-key |
| 35 | + IdentitiesOnly yes |
| 36 | + ``` |
| 37 | + |
| 38 | +### Software |
| 39 | + |
| 40 | +Users are encouraged to use containers on Bristen. |
| 41 | + |
| 42 | +* Jobs using containers can be easily set up and submitted using the [container engine][ref-container-engine]. |
| 43 | +* To build images, see the [guide to building container images on Alps][ref-build-containers]. |
| 44 | + |
| 45 | +## Running Jobs on Bristen |
| 46 | + |
| 47 | +### SLURM |
| 48 | + |
| 49 | +Bristen uses [SLURM][ref-slurm] as the workload manager, which is used to launch and monitor distributed workloads, such as training runs. |
| 50 | + |
| 51 | +There is currently a single slurm partition on the system: |
| 52 | + |
| 53 | +* the `normal` partition is for all production workloads. |
| 54 | + + nodes in this partition are not shared. |
| 55 | + |
| 56 | +<!-- |
| 57 | +See the SLURM documentation for instructions on how to run jobs on the [Grace-Hopper nodes][ref-slurm-gh200]. |
| 58 | +
|
| 59 | +??? example "how to check the number of nodes on the system" |
| 60 | + You can check the size of the system by running the following command in the terminal: |
| 61 | + ```console |
| 62 | + $ sinfo --format "| %20R | %10D | %10s | %10l | %10A |" |
| 63 | + | PARTITION | NODES | JOB_SIZE | TIMELIMIT | NODES(A/I) | |
| 64 | + | debug | 32 | 1-2 | 30:00 | 3/29 | |
| 65 | + | normal | 1266 | 1-infinite | 1-00:00:00 | 812/371 | |
| 66 | + | xfer | 2 | 1 | 1-00:00:00 | 1/1 | |
| 67 | + ``` |
| 68 | + The last column shows the number of nodes that have been allocated in currently running jobs (`A`) and the number of jobs that are idle (`I`). |
| 69 | +--> |
| 70 | + |
| 71 | +### FirecREST |
| 72 | + |
| 73 | +Bristen can also be accessed using [FircREST][ref-firecrest] at the `https://api.cscs.ch/ml/firecrest/v2` API endpoint. |
| 74 | + |
| 75 | +### Scheduled Maintenance |
| 76 | + |
| 77 | +Wednesday morning 8-12 CET is reserved for periodic updates, with services potentially unavailable during this timeframe. If the queues must be drained (redeployment of node images, rebooting of compute nodes, etc) then a Slurm reservation will be in place that will prevent jobs from running into the maintenance window. |
| 78 | + |
| 79 | +Exceptional and non-disruptive updates may happen outside this time frame and will be announced to the users mailing list, and on the [CSCS status page](https://status.cscs.ch). |
| 80 | + |
| 81 | +### Change log |
| 82 | + |
| 83 | +!!! change "2025-03-05 container engine updated" |
| 84 | + now supports better containers that go faster. Users do not to change their workflow to take advantage of these updates. |
| 85 | + |
| 86 | +### Known issues |
0 commit comments