|
1 | 1 | [](){#ref-cluster-santis} |
2 | 2 | # Santis |
3 | 3 |
|
| 4 | +Santis is an Alps cluster that provides GPU accelerators and file systems designed to meet the needs of climate and weather models for the [CWp][ref-platform-cwp]. |
| 5 | + |
| 6 | +## Cluster specification |
| 7 | + |
| 8 | +### Compute nodes |
| 9 | + |
| 10 | +Santis consists of around ??? [Grace-Hopper nodes][ref-alps-gh200-node]. |
| 11 | +The number of nodes can change when nodes are added or removed from other clusters on Alps. |
| 12 | + |
| 13 | +There are four login nodes, labelled `santis-ln00[1-4]`. |
| 14 | +You will be assigned to one of the four login nodes when you ssh onto the system, from where you can edit files, compile applications and start simulation jobs. |
| 15 | + |
| 16 | +| node type | number of nodes | total CPU sockets | total GPUs | |
| 17 | +|-----------|--------| ----------------- | ---------- | |
| 18 | +| [gh200][ref-alps-gh200-node] | 1,200 | 4,800 | 4,800 | |
| 19 | + |
| 20 | +### Storage and file systems |
| 21 | + |
| 22 | +Santis uses the [CWp filesystems and storage policies][ref-cwp-storage]. |
| 23 | + |
| 24 | +## Getting started |
| 25 | + |
| 26 | +### Logging into Santis |
| 27 | + |
| 28 | +To connect to Santis via SSH, first refer to the [ssh guide][ref-ssh]. |
| 29 | + |
| 30 | +!!! example "`~/.ssh/config`" |
| 31 | + Add the following to your [SSH configuration][ref-ssh-config] to enable you to directly connect to santis using `ssh santis`. |
| 32 | + ``` |
| 33 | + Host santis |
| 34 | + HostName santis.alps.cscs.ch |
| 35 | + ProxyJump ela |
| 36 | + User cscsusername |
| 37 | + IdentityFile ~/.ssh/cscs-key |
| 38 | + IdentitiesOnly yes |
| 39 | + ``` |
| 40 | + |
| 41 | +### Software |
| 42 | + |
| 43 | +CSCS and the user community provide software environments tailored to [uenv][ref-uenv] are also available on Santis. |
| 44 | + |
| 45 | +Currently, the following uenv are provided for the climate and weather community |
| 46 | + |
| 47 | +* `icon/25.1` |
| 48 | +* `climana/25.1` |
| 49 | + |
| 50 | +In adition to the climate and weather uenv, all of the |
| 51 | + |
| 52 | +??? example "using uenv provided for other clusters" |
| 53 | + You can run uenv that were built for other Alps clusters using the `@` notation. |
| 54 | + For example, to use uenv images for [daint][ref-cluster-daint]: |
| 55 | + ```bash |
| 56 | + # list all images available for daint |
| 57 | + uenv image find @daint |
| 58 | + |
| 59 | + # download an image for daint |
| 60 | + uenv image pull namd/3.0:v3@daint |
| 61 | + |
| 62 | + # start the uenv |
| 63 | + uenv start namd/3.0:v3@daint |
| 64 | + ``` |
| 65 | + |
| 66 | +It is also possible to use HPC containers on Santis: |
| 67 | + |
| 68 | +* Jobs using containers can be easily set up and submitted using the [container engine][ref-container-engine]. |
| 69 | +* To build images, see the [guide to building container images on Alps][ref-build-containers]. |
| 70 | + |
| 71 | + |
| 72 | +## Running jobs on Santis |
| 73 | + |
| 74 | +### SLURM |
| 75 | + |
| 76 | +Santis uses [SLURM][ref-slurm] as the workload manager, which is used to launch and monitor distributed workloads, such as training runs. |
| 77 | + |
| 78 | +There are two slurm partitions on the system: |
| 79 | + |
| 80 | +* the `normal` partition is for all production workloads. |
| 81 | +* the `debug` partition can be used to access a small allocation for up to 30 minutes for debugging and testing purposes. |
| 82 | +* the `xfer` partition is for [internal data transfer][ref-data-xfer-internal] at CSCS. |
| 83 | + |
| 84 | +| name | nodes | max nodes per job | time limit | |
| 85 | +| -- | -- | -- | -- | |
| 86 | +| `normal` | 1266 | - | 24 hours | |
| 87 | +| `debug` | 32 | 2 | 30 minutes | |
| 88 | +| `xfer` | 2 | 1 | 24 hours | |
| 89 | + |
| 90 | +* nodes in the `normal` and `debug` partitions are not shared |
| 91 | +* nodes in the `xfer` partition can be shared |
| 92 | + |
| 93 | +See the SLURM documentation for instructions on how to run jobs on the [Grace-Hopper nodes][ref-slurm-gh200]. |
| 94 | + |
| 95 | +??? example "how to check the number of nodes on the system" |
| 96 | + You can check the size of the system by running the following command in the terminal: |
| 97 | + ```terminal |
| 98 | + > sinfo --format "| %20R | %10D | %10s | %10l | %10A |" |
| 99 | + | PARTITION | NODES | JOB_SIZE | TIMELIMIT | NODES(A/I) | |
| 100 | + | debug | 32 | 1-2 | 30:00 | 3/29 | |
| 101 | + | normal | 1266 | 1-infinite | 1-00:00:00 | 812/371 | |
| 102 | + | xfer | 2 | 1 | 1-00:00:00 | 1/1 | |
| 103 | + ``` |
| 104 | + The last column shows the number of nodes that have been allocted in currently running jobs (`A`) and the number of jobs that are idle (`I`). |
| 105 | + |
| 106 | +### FirecREST |
| 107 | + |
| 108 | +Santis can also be accessed using [FircREST][ref-firecrest] at the `https://api.cscs.ch/ml/firecrest/v1` API endpoint. |
| 109 | + |
| 110 | +## Maintenance and status |
| 111 | + |
| 112 | +### Scheduled maintenance |
| 113 | + |
| 114 | +Wednesday morning 8-12 CET is reserved for periodic updates, with services potentially unavailable during this timeframe. If the queues must be drained (redeployment of node images, rebooting of compute nodes, etc) then a Slurm reservation will be in place that will prevent jobs from running into the maintenance window. |
| 115 | + |
| 116 | +Exceptional and non-disruptive updates may happen outside this time frame and will be announced to the users mailing list, and on the [CSCS status page](https://status.cscs.ch). |
| 117 | + |
| 118 | +### Change log |
| 119 | + |
| 120 | +!!! change "2025-03-05 container engine updated" |
| 121 | + now supports better containers that go faster. Users do not to change their workflow to take advantage of these updates. |
| 122 | + |
| 123 | +??? change "2024-10-07 old event" |
| 124 | + this is an old update. Use `???` to automatically fold the update. |
| 125 | + |
| 126 | +### Known issues |
| 127 | + |
0 commit comments