From 0158dac36a414f0a7399a42cf43bdb878af8e834 Mon Sep 17 00:00:00 2001 From: bcumming Date: Tue, 27 May 2025 11:22:47 +0200 Subject: [PATCH 01/14] copy hpcp landing page --- .gitignore | 2 ++ docs/platforms/cwp/index.md | 4 +-- docs/platforms/hpcp/index.md | 69 ++++++++++++++++++++++++++++++++++-- docs/storage/filesystems.md | 4 ++- 4 files changed, 74 insertions(+), 5 deletions(-) diff --git a/.gitignore b/.gitignore index 89a9d0db..798ffe9c 100644 --- a/.gitignore +++ b/.gitignore @@ -1,2 +1,4 @@ # path that contains html generated by `mkdocs build` site + +*.sw[nopq] diff --git a/docs/platforms/cwp/index.md b/docs/platforms/cwp/index.md index 64ea33b7..9f36b83a 100644 --- a/docs/platforms/cwp/index.md +++ b/docs/platforms/cwp/index.md @@ -1,5 +1,5 @@ [](){#ref-platform-cwp} -# Climate and weather platform +# Climate and Weather Platform The Climate and Weather Platform (CWP) provides compute, storage and support to the climate and weather modeling community in Switzerland. @@ -9,7 +9,7 @@ The Climate and Weather Platform (CWP) provides compute, storage and support to Project administrators (PIs and deputy PIs) of projects on the CWP can to invite users to join their project, before they can use the project's resources on Alps. -This is currently performed using the [project management tool][ref-account-ump]. +This is currently performed using the [account and resource management tool][ref-account-ump]. Once invited to a project, you will receive an email, which you can need to create an account and configure [multi-factor authentication][ref-mfa] (MFA). diff --git a/docs/platforms/hpcp/index.md b/docs/platforms/hpcp/index.md index 98a9b733..ecbdfbbd 100644 --- a/docs/platforms/hpcp/index.md +++ b/docs/platforms/hpcp/index.md @@ -1,5 +1,70 @@ [](){#ref-platform-hpcp} # HPC Platform -!!! todo - follow the template of the [MLp][ref-platform-mlp] +The HPCP (HPCP) provides compute, storage and support to the CSCS User Lab projects. + +## Getting Started + +### Getting access + +Project administrators (PIs and deputy PIs) of projects on the HPCP can to invite users to join their project, before they can use the project's resources on Alps. + +This is currently performed using the [account and resource management tool][ref-account-ump]. + +Once invited to a project, you will receive an email, which you can need to create an account and configure [multi-factor authentication][ref-mfa] (MFA). + +## Systems + +
+- :fontawesome-solid-mountain: [__Daint__][ref-cluster-daint] + + Daint is a large [Grace-Hopper][ref-alps-gh200-node] cluster for GPU-enabled workloads. +
+ +
+- :fontawesome-solid-mountain: [__Eiger__][ref-cluster-eiger] + + Eiger is an [AMD Epyc][ref-alps-zen2-node] cluster for CPU-only workloads. +
+ +[](){#ref-hpcp-storage} +## File systems and storage + +There are three main file systems mounted on the HPCP clusters. + +| type |mount | filesystem | +| -- | -- | -- | +| [Home][ref-storage-home] | /users/$USER | [VAST][ref-alps-vast] | +| [Scratch][ref-storage-scratch] | `/capstor/scratch/cscs/$USER` | [Capstor][ref-alps-capstor] | +| [Store][ref-storage-store] | `/capstor/store/cscs/userlab/` | [Capstor][ref-alps-capstor] | + +### Home + +Every user has a [home][ref-storage-home] path (`$HOME`) mounted at `/users/$USER` on the [VAST][ref-alps-vast] filesystem. +The home directory has 50 GB of capacity, and is intended for configuration, small software packages and scripts. + +### Scratch + +The Scratch filesystem provides temporary storage for high-performance I/O for executing jobs. + +See the [Scratch][ref-storage-scratch] documentation for more information. + +The environment variable `SCRATCH=/capstor/scratch/cscs/$USER` is set automatically when you log into the system, and can be used as a shortcut to access scratch. + +!!! warning "scratch cleanup policy" + Files that have not been accessed in 30 days are automatically deleted. + + **Scratch is not intended for permanent storage**: transfer files back to the [Store][ref-storage-store] after job runs. + +### Project Store + +Project storage is backed up, with no cleaning policy, as intermediate storage space for datasets, shared code or configuration scripts that need to be accessed from different vClusters. + +The environment variable `PROJECT` is set automatically when you log into the system, and can be used as a shortcut to access the Store path for your primary project. + +Hard limits on capacity and inodes prevent users from writing to [Store][ref-storage-store] if the quota is reached. +You can check quota and available space by running the [`quota`][ref-storage-quota] command on a login node or ela. + +!!! warning + It is not recommended to write directly to the `$PROJECT` path from jobs. + diff --git a/docs/storage/filesystems.md b/docs/storage/filesystems.md index a61b3c7e..9768a940 100644 --- a/docs/storage/filesystems.md +++ b/docs/storage/filesystems.md @@ -57,7 +57,7 @@ The command reports both disk space and the number of files for each filesystem/ ## Cleaning Policy and Data Retention - +[](){#ref-storage-scratch} ## Scratch The scratch file system is designed for performance rather than reliability, as a fast workspace for temporary storage. @@ -85,6 +85,7 @@ Keep also in mind that data on scratch are not backed up, therefore users are ad !!! note Do not use the `touch` command to prevent the cleaning policy from removing files, because this behaviour would deprive the community of a shared resource. +[](){#ref-storage-home} ## Users Users are not supposed to run jobs from this filesystem because of the low performance. In fact the emphasis on the `/users` filesystem is reliability over performance: all home directories are backed up with GPFS snapshots and no cleaning policy is applied. @@ -97,6 +98,7 @@ Expiration !!! warning All data will be deleted 3 months after the closure of the user account without further warning. +[](){#ref-storage-store} ## Store on Capstor The `/capstor/store` mount point of the Lustre file system `capstor` is intended for high-performance per-project storage on Alps. The mount point is accessible from the User Access Nodes (UANs) of Alps vClusters. From 53bc631d81d46206afbb1e9e3e35ac8000b2e3f6 Mon Sep 17 00:00:00 2001 From: bcumming Date: Tue, 27 May 2025 11:44:45 +0200 Subject: [PATCH 02/14] harmonise the CWP and HPCP storage docs --- docs/platforms/cwp/index.md | 29 ++++++++++++----------------- 1 file changed, 12 insertions(+), 17 deletions(-) diff --git a/docs/platforms/cwp/index.md b/docs/platforms/cwp/index.md index 9f36b83a..4db40e47 100644 --- a/docs/platforms/cwp/index.md +++ b/docs/platforms/cwp/index.md @@ -37,35 +37,30 @@ There are three main file systems mounted on the CWP system Santis. ### Home -Every user has a home path (`$HOME`) mounted at `/users/$USER` on the [VAST][ref-alps-vast] filesystem. +Every user has a [home][ref-storage-home] path (`$HOME`) mounted at `/users/$USER` on the [VAST][ref-alps-vast] filesystem. The home directory has 50 GB of capacity, and is intended for configuration, small software packages and scripts. ### Scratch The Scratch filesystem provides temporary storage for high-performance I/O for executing jobs. -Use scratch to store datasets that will be accessed by jobs, and for job output. -Scratch is per user - each user gets separate scratch path and quota. -!!! info - A quota of 150 TB and 1 million inodes (files and folders) is applied to your scratch path. +See the [Scratch][ref-storage-scratch] documentation for more information. - These are implemented as soft quotas: upon reaching either limit there is a grace period of 1 week before write access to `$SCRATCH` is blocked. - - You can check your quota at any time from Ela or one of the login nodes, using the [`quota` command][ref-storage-quota]. - -!!! info - The environment variable `SCRATCH=/capstor/scratch/cscs/$USER` is set automatically when you log into the system, and can be used as a shortcut to access scratch. +The environment variable `SCRATCH=/capstor/scratch/cscs/$USER` is set automatically when you log into the system, and can be used as a shortcut to access scratch. !!! warning "scratch cleanup policy" Files that have not been accessed in 30 days are automatically deleted. - **Scratch is not intended for permanent storage**: transfer files back to the capstor project storage after job runs. + **Scratch is not intended for permanent storage**: transfer files back to the [Store][ref-storage-store] after job runs. + +### Project Store -### Project +Project storage is backed up, with no cleaning policy, as intermediate storage space for datasets, shared code or configuration scripts that need to be accessed from different vClusters. -Project storage is backed up, with no cleaning policy: it provides intermediate storage space for datasets, shared code or configuration scripts that need to be accessed from different vClusters. -Project is per project - each project gets a project folder with project-specific quota. +The environment variable `PROJECT` is set automatically when you log into the system, and can be used as a shortcut to access the Store path for your primary project. -* hard limits on capacity and inodes prevent users from writing to project if the quota is reached - you can check quota and available space by running the [`quota`][ref-storage-quota] command on a login node or ela. -* it is not recommended to write directly to the project path from jobs. +Hard limits on capacity and inodes prevent users from writing to [Store][ref-storage-store] if the quota is reached. +You can check quota and available space by running the [`quota`][ref-storage-quota] command on a login node or ela. +!!! warning + It is not recommended to write directly to the `$PROJECT` path from jobs. From cd2e37a1942209f4474ff073582b39db99aac288 Mon Sep 17 00:00:00 2001 From: twrobinson Date: Tue, 27 May 2025 14:11:52 +0200 Subject: [PATCH 03/14] Update index.md Tidied up language --- docs/platforms/hpcp/index.md | 34 ++++++++++++++++------------------ 1 file changed, 16 insertions(+), 18 deletions(-) diff --git a/docs/platforms/hpcp/index.md b/docs/platforms/hpcp/index.md index ecbdfbbd..5c1c445c 100644 --- a/docs/platforms/hpcp/index.md +++ b/docs/platforms/hpcp/index.md @@ -1,17 +1,15 @@ [](){#ref-platform-hpcp} # HPC Platform -The HPCP (HPCP) provides compute, storage and support to the CSCS User Lab projects. +The HPC Platform (HPCP) provides compute, storage, and related services for the HPC community in Switzerland and abroad. The majority of compute cycles are provided to the [User Lab](https://www.cscs.ch/user-lab/overview) via peer-reviewed allocation schemes. ## Getting Started ### Getting access -Project administrators (PIs and deputy PIs) of projects on the HPCP can to invite users to join their project, before they can use the project's resources on Alps. +Principal Investigators (PIs) and Deputy PIs can invite users to join their projects using the [account and resource management tool][ref-account-ump]. -This is currently performed using the [account and resource management tool][ref-account-ump]. - -Once invited to a project, you will receive an email, which you can need to create an account and configure [multi-factor authentication][ref-mfa] (MFA). +Once invited to a project you will receive an email with information on how to create an account and configure [multi-factor authentication][ref-mfa] (MFA). ## Systems @@ -32,39 +30,39 @@ Once invited to a project, you will receive an email, which you can need to crea There are three main file systems mounted on the HPCP clusters. -| type |mount | filesystem | +| type |mount | file system | | -- | -- | -- | | [Home][ref-storage-home] | /users/$USER | [VAST][ref-alps-vast] | | [Scratch][ref-storage-scratch] | `/capstor/scratch/cscs/$USER` | [Capstor][ref-alps-capstor] | -| [Store][ref-storage-store] | `/capstor/store/cscs/userlab/` | [Capstor][ref-alps-capstor] | +| [Store][ref-storage-store] | `/capstor/store/cscs//` | [Capstor][ref-alps-capstor] | ### Home -Every user has a [home][ref-storage-home] path (`$HOME`) mounted at `/users/$USER` on the [VAST][ref-alps-vast] filesystem. -The home directory has 50 GB of capacity, and is intended for configuration, small software packages and scripts. +Every user has a [home][ref-storage-home] path (`$HOME`) mounted at `/users/$USER` on the [VAST][ref-alps-vast] file system. +Home directories have 50 GB of capacity and are intended for keeping configuration files, small software packages, and scripts. ### Scratch -The Scratch filesystem provides temporary storage for high-performance I/O for executing jobs. +The Scratch file system is a large, temporary storage system designed for high-performance I/O. It is not backed up. See the [Scratch][ref-storage-scratch] documentation for more information. -The environment variable `SCRATCH=/capstor/scratch/cscs/$USER` is set automatically when you log into the system, and can be used as a shortcut to access scratch. +The environment variable `$SCRATCH` points to `/capstor/scratch/cscs/$USER`, and can be used as a shortcut to access your scratch folder. !!! warning "scratch cleanup policy" Files that have not been accessed in 30 days are automatically deleted. - **Scratch is not intended for permanent storage**: transfer files back to the [Store][ref-storage-store] after job runs. + **Scratch is not intended for permanent storage**: transfer files back to the [Store][ref-storage-store] after batch job completion. -### Project Store +### Store -Project storage is backed up, with no cleaning policy, as intermediate storage space for datasets, shared code or configuration scripts that need to be accessed from different vClusters. +The Store (or Project) file system is provided as a space to store datasets, code, or configuration scripts that can be accessed from different clusters. The file system is backed up and there is no automated deletion policy. -The environment variable `PROJECT` is set automatically when you log into the system, and can be used as a shortcut to access the Store path for your primary project. +The environment variable `$STORE` can be used as a shortcut to access the Store folder of your primary project. -Hard limits on capacity and inodes prevent users from writing to [Store][ref-storage-store] if the quota is reached. -You can check quota and available space by running the [`quota`][ref-storage-quota] command on a login node or ela. +Hard limits on the amount of data and number of files (inodes) will prevent you from writing to [Store][ref-storage-store] if your quotas are exceeded. +You can check how much data and inodes you are consuming -- and their respective quotas -- by running the [`quota`][ref-storage-quota] command on a login node. !!! warning - It is not recommended to write directly to the `$PROJECT` path from jobs. + It is not recommended to write directly to the `$STORE` path from batch jobs. From 023e05c093e2a9968d31ba60cac4840797243aab Mon Sep 17 00:00:00 2001 From: bcumming Date: Tue, 27 May 2025 14:38:57 +0200 Subject: [PATCH 04/14] first pass at eiger docs --- docs/clusters/eiger.md | 193 ++++++++++++++++++++++++++++++++++++++++ docs/clusters/santis.md | 17 +--- docs/running/slurm.md | 12 +++ 3 files changed, 209 insertions(+), 13 deletions(-) diff --git a/docs/clusters/eiger.md b/docs/clusters/eiger.md index 05a1fc6e..2a21a4e7 100644 --- a/docs/clusters/eiger.md +++ b/docs/clusters/eiger.md @@ -1,3 +1,196 @@ [](){#ref-cluster-eiger} # Eiger +Eiger is an Alps cluster that provides compute nodes and file systems designed to meet the needs of CPU-only workloads for the [HPC Platform][ref-platform-hpcp]. + +!!! under-construction + This documentation is for `eiger.alps.cscs.ch` - an updated version of Eiger that will replace the existing `eiger.cscs.ch` cluster. + For help using the existing Eiger, see the [Eiger User Guide](https://confluence.cscs.ch/spaces/KB/pages/284426490/Alps+Eiger+User+Guide) on the old KB documentation site. + + The target date for full deployment of the new Eiger is **July 1, 2025**. + +!!! change "Important changes for `eiger.alps`" + The redeployment of `eiger.cscs.ch` as `eiger.alps.cscs.ch` introduces some chanages that may affect some users. + + ### Breaking changes + + !!! warning "Sarus is replaced with the container engine" + The Sarus container runtime is replaced with the [container engine][ref-container-engine]. + + If you are using Sarus to run containers on Eiger, you will have to [rebuild][ref-build-containers] and adapt your containers for container engine. + + !!! warning "Cray modules and EasyBuild are no longer supported" + The Cray Programming Environment (the `cray` module) is no longer supported by CSCS, along with software that CSCS provided using EasyBuild. + + The same version of the Cray modules is still available, along with software that was installed using them, however they will not receive updates or support from CSCS. + + You are strongly encouraged to start using [uenv][ref-cluster-eiger-uenv] to access supported applications and rebuild their applications. + + * The versions of compilers, `cray-mpich`, Python and libraries in uenv are up to date. + * The scientific application uenv have up to date versions of the supported applications. + + ### Unimplemented features + + !!! under-construction "FirecREST is not available yet" + [FirecREST][ref-firecrest] has not been configured on `eiger.alps` - it is still running on the old Eiger. + + **It will be deployed, and this documentation updated when it is.** + + ### Minor changes + + !!! change "SLURM was updated from version 23.02.6 to 24.05.4" + +## Cluster specification + +### Compute nodes + +!!! under-construction + Currently there are 19 nodes for projects to test and port workflows to the new Eiger deployment. + Nodes will be moved from `eiger.cscs.ch` to `eiger.alps.cscs.ch` at a later date. + +Eiger consists of 19 [AND Epyc Rome][ref-alps-zen2-node] compute nodes. + +There is one login node, labelled `eiger-ln010`. +You will be assigned to one of the four login nodes when you ssh onto the system, from where you can edit files, compile applications and start simulation jobs. + +| node type | number of nodes | total CPU sockets | total GPUs | +|-----------|-----------------| ----------------- | ---------- | +| [zen2][ref-alps-zen2-node] | 19 | 38 | - | + +### Storage and file systems + +Eiger uses the [HPCP filesystems and storage policies][ref-hpcp-storage]. + +## Getting started + +### Logging into Eiger + +To connect to Eiger via SSH, first refer to the [ssh guide][ref-ssh]. + +!!! example "`~/.ssh/config`" + Add the following to your [SSH configuration][ref-ssh-config] to enable you to directly connect to eiger using `ssh eiger.alps`. + ``` + Host eiger.alps + HostName eiger.alps.cscs.ch + ProxyJump ela + User cscsusername + IdentityFile ~/.ssh/cscs-key + IdentitiesOnly yes + ``` + +### Software + +[](){#ref-cluster-eiger-uenv} +#### uenv + +CSCS and the user community provide [uenv][ref-uenv] software environments on Eiger. + + +
+ +- :fontawesome-solid-layer-group: __Scientific Applications__ + + Provide the latest versions of scientific applications, tuned for Eiger, and the tools required to build your own version of the applications. + + * [CP2K][ref-uenv-cp2k] + * [GROMACS][ref-uenv-gromacs] + * [LAMMPS][ref-uenv-lammps] + * [NAMD][ref-uenv-namd] + * [Quantumespresso][ref-uenv-quantumespresso] + * [VASP][ref-uenv-vasp] + +
+ +
+ +- :fontawesome-solid-layer-group: __Programming Environments__ + + Provide compilers, MPI, Python, common libraries and tools used to build your own applications. + + * [prgenv-gnu][ref-uenv-prgenv-gnu] + * [linalg][ref-uenv-linalg] + * [julia][ref-uenv-julia] +
+ +
+ +- :fontawesome-solid-layer-group: __Tools__ + + Provide tools like + + * [Linaro Forge][ref-uenv-linaro] +
+ +[](){#ref-cluster-eiger-containers} +#### Containers + +Eiger supports container workloads using the [container engine][ref-container-engine]. + +To build images, see the [guide to building container images on Alps][ref-build-containers]. + +!!! warning "Sarus is not available on Eiger.alps" + A key change on the new Eiger deployment is that the Sarus container runtime is replaced with the [container engine][ref-container-engine]. + + If you are using Sarus to run containers on Eiger, you will have to rebuild and adapt your containers for container engine. + +#### Cray Modules + +!!! warning + The Cray Programming Environment (CPE), loaded using `module load cray`, is no longer supported by CSCS. + + CSCS will continue to support and update uenv and container engine, and users are encouraged to update their workflows to use these methods at the first opportunity. + + The CPE is still installed on Eiger, however it will recieve no support or updates, and will be removed completely at a future date. + +## Running jobs on Eiger + +### SLURM + +Eiger uses [SLURM][ref-slurm] as the workload manager, which is used to launch and monitor distributed workloads. + +There are two [SLURM partitions][ref-slurm-partitions] on the system: + +* the `normal` partition is for all production workloads. +* the `debug` partition can be used to access a small allocation for up to 30 minutes for debugging and testing purposes. +* the `xfer` partition is for [internal data transfer][ref-data-xfer-internal] at CSCS. + +| name | nodes | max nodes per job | time limit | +| -- | -- | -- | -- | +| `normal` | 1266 | - | 24 hours | +| `debug` | 32 | 1 | 30 minutes | +| `xfer` | 2 | 1 | 24 hours | + +* nodes in the `normal` and `debug` partitions are not shared +* nodes in the `xfer` partition can be shared + +See the SLURM documentation for instructions on how to run jobs on the [AMD CPU nodes][ref-slurm-amdcpu]. + +### FirecREST + +!!! under-construction "FirecREST is not available yet" + [FirecREST][ref-firecrest] has not been configured on `eiger.alps` - it is still running on the old Eiger. + + **It will be deployed, and this documentation updated when it is.** + +## Maintenance and status + +### Scheduled maintenance + +Wednesday morning 8-12 CET is reserved for periodic updates, with services potentially unavailable during this timeframe. If the queues must be drained (redeployment of node images, rebooting of compute nodes, etc) then a Slurm reservation will be in place that will prevent jobs from running into the maintenance window. + +Exceptional and non-disruptive updates may happen outside this time frame and will be announced to the users mailing list, and on the [CSCS status page](https://status.cscs.ch). + +### Change log + +!!! todo + Feedback on hosting the changelog in the docs here, as opposed to our status page, as the long term solution. + +!!! change "2025-03-05 container engine updated" + now supports better containers that go faster. Users do not to change their workflow to take advantage of these updates. + +??? change "2024-10-07 old event" + this is an old update. Use `???` to automatically fold the update. + +### Known issues + + diff --git a/docs/clusters/santis.md b/docs/clusters/santis.md index b0366f0d..afbc8497 100644 --- a/docs/clusters/santis.md +++ b/docs/clusters/santis.md @@ -76,7 +76,7 @@ It is also possible to use HPC containers on Santis: Santis uses [SLURM][ref-slurm] as the workload manager, which is used to launch and monitor distributed workloads, such as training runs. -There are two slurm partitions on the system: +There are two [SLURM partitions][ref-slurm-partitions] on the system: * the `normal` partition is for all production workloads. * the `debug` partition can be used to access a small allocation for up to 30 minutes for debugging and testing purposes. @@ -93,20 +93,11 @@ There are two slurm partitions on the system: See the SLURM documentation for instructions on how to run jobs on the [Grace-Hopper nodes][ref-slurm-gh200]. -??? example "how to check the number of nodes on the system" - You can check the size of the system by running the following command in the terminal: - ```console - $ sinfo --format "| %20R | %10D | %10s | %10l | %10A |" - | PARTITION | NODES | JOB_SIZE | TIMELIMIT | NODES(A/I) | - | debug | 32 | 1-2 | 30:00 | 3/29 | - | normal | 1266 | 1-infinite | 1-00:00:00 | 812/371 | - | xfer | 2 | 1 | 1-00:00:00 | 1/1 | - ``` - The last column shows the number of nodes that have been allocated in currently running jobs (`A`) and the number of jobs that are idle (`I`). - ### FirecREST -Santis can also be accessed using [FirecREST][ref-firecrest] at the `https://api.cscs.ch/ml/firecrest/v1` API endpoint. +Santis can also be accessed using [FirecREST][ref-firecrest] at the `https://api.cscs.ch/ml/firecrest/v2` API endpoint. + +!!! warning "The FirecREST v1 API is still available, but deprecated" ## Maintenance and status diff --git a/docs/running/slurm.md b/docs/running/slurm.md index 3245fdad..07841c81 100644 --- a/docs/running/slurm.md +++ b/docs/running/slurm.md @@ -16,6 +16,18 @@ At CSCS, SLURM is configured to accommodate the diverse range of node types avai Each type of node has different resource constraints and capabilities, which SLURM takes into account when scheduling jobs. For example, CPU-only nodes may have configurations optimized for multi-threaded CPU workloads, while GPU nodes require additional parameters to allocate GPU resources efficiently. SLURM ensures that user jobs request and receive the appropriate resources while preventing conflicts or inefficient utilization. +!!! example "How to check the partitions and number of nodes therein?" + You can check the size of the system by running the following command in the terminal: + ```console + $ sinfo --format "| %20R | %10D | %10s | %10l | %10A |" + | PARTITION | NODES | JOB_SIZE | TIMELIMIT | NODES(A/I) | + | debug | 32 | 1-2 | 30:00 | 3/29 | + | normal | 1266 | 1-infinite | 1-00:00:00 | 812/371 | + | xfer | 2 | 1 | 1-00:00:00 | 1/1 | + ``` + The last column shows the number of nodes that have been allocated in currently running jobs (`A`) and the number of jobs that are idle (`I`). + + [](){#ref-slurm-partition-debug} ### Debug partition The SLURM `debug` partition is useful for quick turnaround workflows. The partition has a short maximum time (timelimit can be seen with `sinfo -p debug`), and a low number of maximum nodes (the `MaxNodes` can be seen with `scontrol show partition=debug`). From bfd6b0b36568f6498ec48a6b4a7f67568b0bee67 Mon Sep 17 00:00:00 2001 From: twrobinson Date: Tue, 27 May 2025 16:54:44 +0200 Subject: [PATCH 05/14] Update eiger.md minor changes --- docs/clusters/eiger.md | 68 ++++++++++++++++++++---------------------- 1 file changed, 33 insertions(+), 35 deletions(-) diff --git a/docs/clusters/eiger.md b/docs/clusters/eiger.md index 2a21a4e7..7c92c3f3 100644 --- a/docs/clusters/eiger.md +++ b/docs/clusters/eiger.md @@ -5,53 +5,53 @@ Eiger is an Alps cluster that provides compute nodes and file systems designed t !!! under-construction This documentation is for `eiger.alps.cscs.ch` - an updated version of Eiger that will replace the existing `eiger.cscs.ch` cluster. - For help using the existing Eiger, see the [Eiger User Guide](https://confluence.cscs.ch/spaces/KB/pages/284426490/Alps+Eiger+User+Guide) on the old KB documentation site. + For help using the existing Eiger, see the [Eiger User Guide](https://confluence.cscs.ch/spaces/KB/pages/284426490/Alps+Eiger+User+Guide) on the legacy KB documentation site. The target date for full deployment of the new Eiger is **July 1, 2025**. -!!! change "Important changes for `eiger.alps`" - The redeployment of `eiger.cscs.ch` as `eiger.alps.cscs.ch` introduces some chanages that may affect some users. +!!! change "Important changes" + The redeployment of `eiger.cscs.ch` as `eiger.alps.cscs.ch` introduces changes that may affect some users. ### Breaking changes - !!! warning "Sarus is replaced with the container engine" - The Sarus container runtime is replaced with the [container engine][ref-container-engine]. + !!! warning "Sarus is replaced with the Container Engine" + The Sarus container runtime is replaced with the [Container Engine][ref-container-engine]. - If you are using Sarus to run containers on Eiger, you will have to [rebuild][ref-build-containers] and adapt your containers for container engine. + If you are using Sarus to run containers on Eiger, you will have to [rebuild][ref-build-containers] and adapt your containers for the Container Engine. !!! warning "Cray modules and EasyBuild are no longer supported" - The Cray Programming Environment (the `cray` module) is no longer supported by CSCS, along with software that CSCS provided using EasyBuild. + The Cray Programming Environment (accessed via the `cray` module) is no longer supported by CSCS, along with software that CSCS provided using EasyBuild. The same version of the Cray modules is still available, along with software that was installed using them, however they will not receive updates or support from CSCS. - You are strongly encouraged to start using [uenv][ref-cluster-eiger-uenv] to access supported applications and rebuild their applications. + You are strongly encouraged to start using [uenv][ref-cluster-eiger-uenv] to access supported applications and to rebuild your own applications. * The versions of compilers, `cray-mpich`, Python and libraries in uenv are up to date. * The scientific application uenv have up to date versions of the supported applications. ### Unimplemented features - !!! under-construction "FirecREST is not available yet" + !!! under-construction "FirecREST is not yet available" [FirecREST][ref-firecrest] has not been configured on `eiger.alps` - it is still running on the old Eiger. **It will be deployed, and this documentation updated when it is.** ### Minor changes - !!! change "SLURM was updated from version 23.02.6 to 24.05.4" + !!! change "Slurm is updated from version 23.02.6 to 24.05.4" ## Cluster specification ### Compute nodes !!! under-construction - Currently there are 19 nodes for projects to test and port workflows to the new Eiger deployment. - Nodes will be moved from `eiger.cscs.ch` to `eiger.alps.cscs.ch` at a later date. + During this Early Access phase, there are 19 compute nodes for you to test and port your workflows to the new Eiger deployment. There is one compute node in the `debug` partition and one in the `xfer` partition for internal data transfer. The remaining compute nodes will be moved from `eiger.cscs.ch` to `eiger.alps.cscs.ch` at a later date (provisionally, 1 July 2025). -Eiger consists of 19 [AND Epyc Rome][ref-alps-zen2-node] compute nodes. +Eiger consists of 19 [AMD Epyc Rome][ref-alps-zen2-node] compute nodes. -There is one login node, labelled `eiger-ln010`. -You will be assigned to one of the four login nodes when you ssh onto the system, from where you can edit files, compile applications and start simulation jobs. +There is one login node, `eiger-ln010`. + +[//]: # (You will be assigned to one of the four login nodes when you ssh onto the system, from where you can edit files, compile applications and start simulation jobs.) | node type | number of nodes | total CPU sockets | total GPUs | |-----------|-----------------| ----------------- | ---------- | @@ -124,35 +124,36 @@ CSCS and the user community provide [uenv][ref-uenv] software environments on Ei [](){#ref-cluster-eiger-containers} #### Containers -Eiger supports container workloads using the [container engine][ref-container-engine]. +Eiger supports container workloads using the [Container Engine][ref-container-engine]. To build images, see the [guide to building container images on Alps][ref-build-containers]. -!!! warning "Sarus is not available on Eiger.alps" - A key change on the new Eiger deployment is that the Sarus container runtime is replaced with the [container engine][ref-container-engine]. +!!! warning "Sarus is not available" + A key change with the new Eiger deployment is that the Sarus container runtime is replaced with the [Container Engine][ref-container-engine]. - If you are using Sarus to run containers on Eiger, you will have to rebuild and adapt your containers for container engine. + If you are using Sarus to run containers on Eiger, you will have to rebuild and adapt your containers for the Container Engine. #### Cray Modules !!! warning The Cray Programming Environment (CPE), loaded using `module load cray`, is no longer supported by CSCS. - CSCS will continue to support and update uenv and container engine, and users are encouraged to update their workflows to use these methods at the first opportunity. + CSCS will continue to support and update uenv and the Container Engine, and users are encouraged to update their workflows to use these methods at the first opportunity. - The CPE is still installed on Eiger, however it will recieve no support or updates, and will be removed completely at a future date. + The CPE is deprecated and will be removed completely at a future date. ## Running jobs on Eiger -### SLURM +### Slurm -Eiger uses [SLURM][ref-slurm] as the workload manager, which is used to launch and monitor distributed workloads. +Eiger uses [Slurm][ref-slurm] as the workload manager, which is used to launch and monitor workloads on compute nodes. -There are two [SLURM partitions][ref-slurm-partitions] on the system: +There are four [Slurm partitions][ref-slurm-partitions] on the system: * the `normal` partition is for all production workloads. * the `debug` partition can be used to access a small allocation for up to 30 minutes for debugging and testing purposes. -* the `xfer` partition is for [internal data transfer][ref-data-xfer-internal] at CSCS. +* the `xfer` partition is for [internal data transfer][ref-data-xfer-internal]. +* the `low` partition is a low-priority partition, enabled for specific projects at specific times. | name | nodes | max nodes per job | time limit | | -- | -- | -- | -- | @@ -163,11 +164,11 @@ There are two [SLURM partitions][ref-slurm-partitions] on the system: * nodes in the `normal` and `debug` partitions are not shared * nodes in the `xfer` partition can be shared -See the SLURM documentation for instructions on how to run jobs on the [AMD CPU nodes][ref-slurm-amdcpu]. +See the Slurm documentation for instructions on how to run jobs on the [AMD CPU nodes][ref-slurm-amdcpu]. ### FirecREST -!!! under-construction "FirecREST is not available yet" +!!! under-construction "FirecREST is not yet available" [FirecREST][ref-firecrest] has not been configured on `eiger.alps` - it is still running on the old Eiger. **It will be deployed, and this documentation updated when it is.** @@ -176,20 +177,17 @@ See the SLURM documentation for instructions on how to run jobs on the [AMD CPU ### Scheduled maintenance -Wednesday morning 8-12 CET is reserved for periodic updates, with services potentially unavailable during this timeframe. If the queues must be drained (redeployment of node images, rebooting of compute nodes, etc) then a Slurm reservation will be in place that will prevent jobs from running into the maintenance window. +Wednesday mornings 8:00-12:00 CET are reserved for periodic updates, with services potentially unavailable during this time frame. If the batch queues must be drained (for redeployment of node images, rebooting of compute nodes, etc) then a Slurm reservation will be in place that will prevent jobs from running into the maintenance window. Exceptional and non-disruptive updates may happen outside this time frame and will be announced to the users mailing list, and on the [CSCS status page](https://status.cscs.ch). ### Change log -!!! todo - Feedback on hosting the changelog in the docs here, as opposed to our status page, as the long term solution. - -!!! change "2025-03-05 container engine updated" - now supports better containers that go faster. Users do not to change their workflow to take advantage of these updates. +!!! change "2025-06-02 Early access phase" + Early access phase is open -??? change "2024-10-07 old event" - this is an old update. Use `???` to automatically fold the update. +??? change "2025-23-05 Creation of Eiger on Alps" + Eiger is deployed as a vServices-enalbed cluster ### Known issues From 24229ba1250ea248bc0c8486a317b83d08df0ae1 Mon Sep 17 00:00:00 2001 From: bcumming Date: Tue, 27 May 2025 16:56:34 +0200 Subject: [PATCH 06/14] add daint first draft --- docs/clusters/daint.md | 188 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 188 insertions(+) diff --git a/docs/clusters/daint.md b/docs/clusters/daint.md index b61ab7a5..42d2a22f 100644 --- a/docs/clusters/daint.md +++ b/docs/clusters/daint.md @@ -1,2 +1,190 @@ [](){#ref-cluster-daint} # Daint + +Daint is the main [HPC Platform][ref-platform-hpcp] cluster that provides compute nodes and file systems for GPU-enabled workloads. + +## Cluster specification + +### Compute nodes + +Daint consists of around 793 [Grace-Hopper nodes][ref-alps-gh200-node]. + +The number of nodes can change when nodes are added or removed from other clusters on Alps. + +There are four login nodes, labelled `daint-ln00[1-4]`. +You will be assigned to one of the four login nodes when you ssh onto the system, from where you can edit files, compile applications and start simulation jobs. + +| node type | number of nodes | total CPU sockets | total GPUs | +|-----------|-----------------| ----------------- | ---------- | +| [gh200][ref-alps-gh200-node] | 1,022 | 4,088 | 4,088 | + +### Storage and file systems + +Daint uses the [HPCP filesystems and storage policies][ref-hpcp-storage]. + +## Getting started + +### Logging into Daint + +To connect to Daint via SSH, first refer to the [ssh guide][ref-ssh]. + +!!! example "`~/.ssh/config`" + Add the following to your [SSH configuration][ref-ssh-config] to enable you to directly connect to Daint using `ssh daint`. + ``` + Host daint + HostName daint.alps.cscs.ch + ProxyJump ela + User cscsusername + IdentityFile ~/.ssh/cscs-key + IdentitiesOnly yes + ``` + +### Software + +[](){#ref-cluster-daint-uenv} +#### uenv + +Daint provides uenv to deliver programming environments and application software. +Please refer to the [uenv documentation][ref-uenv] for detailed information on how to use the uenv tools on the system. + +
+ +- :fontawesome-solid-layer-group: __Scientific Applications__ + + Provide the latest versions of scientific applications, tuned for Daint, and the tools required to build your own version of the applications. + + * [CP2K][ref-uenv-cp2k] + * [GROMACS][ref-uenv-gromacs] + * [LAMMPS][ref-uenv-lammps] + * [NAMD][ref-uenv-namd] + * [Quantumespresso][ref-uenv-quantumespresso] + * [VASP][ref-uenv-vasp] + +
+ +
+ +- :fontawesome-solid-layer-group: __Programming Environments__ + + Provide compilers, MPI, Python, common libraries and tools used to build your own applications. + + * [prgenv-gnu][ref-uenv-prgenv-gnu] + * [prgenv-nvfortran][ref-uenv-prgenv-nvfortran] + * [linalg][ref-uenv-linalg] + * [julia][ref-uenv-julia] +
+ +
+ +- :fontawesome-solid-layer-group: __Tools__ + + Provide tools like + + * [Linaro Forge][ref-uenv-linaro] +
+ +[](){#ref-cluster-eiger-containers} +#### Containers + +Daint supports container workloads using the [container engine][ref-container-engine]. + +To build images, see the [guide to building container images on Alps][ref-build-containers]. + +#### Cray Modules + +!!! warning + The Cray Programming Environment (CPE), loaded using `module load cray`, is no longer supported by CSCS. + + CSCS will continue to support and update uenv and container engine, and users are encouraged to update their workflows to use these methods at the first opportunity. + + The CPE is still installed on Daint, however it will recieve no support or updates, and will be replaced with a container in a future update. + +## Running jobs on Daint + +### Slurm + +Santis uses [Slurm][ref-slurm] as the workload manager, which is used to launch and monitor distributed workloads, such as training runs. + +There are two [Slurm partitions][ref-slurm-partitions] on the system: + +* the `normal` partition is for all production workloads. +* the `debug` partition can be used to access a small allocation for up to 30 minutes for debugging and testing purposes. +* the `xfer` partition is for [internal data transfer][ref-data-xfer-internal] at CSCS. + +!!! todo "Timmy: add definition of the low partition" + +| name | nodes | max nodes per job | time limit | +| -- | -- | -- | -- | +| `normal` | 994 | - | 24 hours | +| `low` | 994 | 2 | 24 hours | +| `debug` | 24 | 2 | 30 minutes | +| `xfer` | 2 | 1 | 24 hours | + +* nodes in the `normal` and `debug` partitions are not shared +* nodes in the `xfer` partition can be shared + +See the Slurm documentation for instructions on how to run jobs on the [Grace-Hopper nodes][ref-slurm-gh200]. + +### FirecREST + +Daint can also be accessed using [FirecREST][ref-firecrest] at the `https://api.cscs.ch/ml/firecrest/v2` API endpoint. + +!!! warning "The FirecREST v1 API is still available, but deprecated" + +## Maintenance and status + +### Scheduled maintenance + +!!! todo "move this to HPCP top level docs" + Wednesday morning 8-12 CET is reserved for periodic updates, with services potentially unavailable during this timeframe. If the queues must be drained (redeployment of node images, rebooting of compute nodes, etc) then a Slurm reservation will be in place that will prevent jobs from running into the maintenance window. + + Exceptional and non-disruptive updates may happen outside this time frame and will be announced to the users mailing list, and on the [CSCS status page](https://status.cscs.ch). + +### Change log + +!!! change "2025-05-21" + Minor enhancements to system configuration have been applied. + These changes should reduce the frequency of compute nodes being marked as `NOT_RESPONDING` by the workload manager, while we continue to investigate the issue + +!!! change "2025-05-14" + ??? note "Performance hotfix" + The [access-counter-based memory migration feature](https://developer.nvidia.com/blog/cuda-toolkit-12-4-enhances-support-for-nvidia-grace-hopper-and-confidential-computing/#access-counter-based_migration_for_nvidia_grace_hopper_memory) in the NVIDIA driver for Grace Hopper is disabled to address performance issues affecting NCCL-based workloads (e.g. LLM training) + + ??? note "NVIDIA boost slider" + Added an option to enable the NVIDIA boost slider (vboost) via Slurm using the `-C nvidia_vboost_enabled` flag. + This feature, disabled by default, may increase GPU frequency and performance while staying within the power budget + + ??? note "Enroot update" + The container runtime is upgraded from version 2.12.0 to 2.13.0. This update includes libfabric version 1.22.0 (previously 1.15.2.0), which has demonstrated improved performance during LLM checkpointing + +!!! change "2025-04-30" + ??? note "uenv is updated from v7.0.1 to v8.1.0" + * improved uenv view management + * automatic generation of default uenv repository the first time uenv is called + * configuration files + * bash completion + * relative paths can be used for referring to squashfs images + * support for `SLURM_UENV` and `SLURM_UENV_VIEW` environment variables (useful for using inside CI/CD pipelines) + * better error messages and small bug fixes + + ??? note "Pyxis is upgraded from v24.5.0 to v24.5.3" + * Added image caching for Enroot + * Added support for environment variable expansion in EDFs + * Added support for relative paths expansion in EDFs + * Print a message about the experimental status of the --environment option when used outside of the srun command + * Merged small features and bug fixes from upstream Pyxis releases v0.16.0 to v0.20.0 + * Internal changes: various bug fixes and refactoring + +??? change "2025-03-12" + 1. The number of compute nodes has been increased to 1018 + 1. The restriction on the number of running jobs per project has been lifted. + 1. A "low" priority partition has been added, which allows some project types to consume up to 130% of the project's quarterly allocation + 1. We have increased the power cap for the GH module from 624 to 660 W. You might see increased application performance as a consequence + 1. Small changes in kernel tuning parameters + +### Known issues + +!!! todo + Most of these issues (see original [KB docs](https://confluence.cscs.ch/spaces/KB/pages/868811400/Daint.Alps#Daint.Alps-Knownissues)) should be consolidated in a location where they can be linked to by all clusters. + + We have some "know issues" documented under [communication libraries][ref-communication-cray-mpich], however these might be a bit too disperse for centralised linking. From 7c0558190d43b51eec482c75e8d88c4d2f288747 Mon Sep 17 00:00:00 2001 From: twrobinson Date: Tue, 27 May 2025 17:02:11 +0200 Subject: [PATCH 07/14] Update eiger.md partitions --- docs/clusters/eiger.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/clusters/eiger.md b/docs/clusters/eiger.md index 7c92c3f3..3cfab086 100644 --- a/docs/clusters/eiger.md +++ b/docs/clusters/eiger.md @@ -153,13 +153,14 @@ There are four [Slurm partitions][ref-slurm-partitions] on the system: * the `normal` partition is for all production workloads. * the `debug` partition can be used to access a small allocation for up to 30 minutes for debugging and testing purposes. * the `xfer` partition is for [internal data transfer][ref-data-xfer-internal]. -* the `low` partition is a low-priority partition, enabled for specific projects at specific times. +* the `low` partition is a low-priority partition, which may be enabled for specific projects at specific times. | name | nodes | max nodes per job | time limit | | -- | -- | -- | -- | | `normal` | 1266 | - | 24 hours | | `debug` | 32 | 1 | 30 minutes | | `xfer` | 2 | 1 | 24 hours | +| `low` | 1266 | - | 24 hours | * nodes in the `normal` and `debug` partitions are not shared * nodes in the `xfer` partition can be shared From 8bedc5443169ff7bb32aad33c44c43296000e339 Mon Sep 17 00:00:00 2001 From: bcumming Date: Tue, 27 May 2025 17:02:37 +0200 Subject: [PATCH 08/14] upper case for cluster names in index --- mkdocs.yml | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/mkdocs.yml b/mkdocs.yml index 04503088..61a9262e 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -23,15 +23,15 @@ nav: - 'Storage': alps/storage.md - 'Machine Learning Platform': - platforms/mlp/index.md - - 'clariden': clusters/clariden.md - - 'bristen': clusters/bristen.md + - 'Clariden': clusters/clariden.md + - 'Bristen': clusters/bristen.md - 'HPC Platform': - platforms/hpcp/index.md - - 'daint': clusters/daint.md - - 'eiger': clusters/eiger.md + - 'Daint': clusters/daint.md + - 'Eiger': clusters/eiger.md - 'Climate and Weather Platform': - platforms/cwp/index.md - - 'santis': clusters/santis.md + - 'Santis': clusters/santis.md - 'Accounts and Projects': - accounts/index.md - 'Account and Resources Management Tool': accounts/ump.md From ade1a7d3ea73eda5e465578dec275b846b64d629 Mon Sep 17 00:00:00 2001 From: twrobinson Date: Tue, 27 May 2025 17:04:56 +0200 Subject: [PATCH 09/14] Update eiger.md date format --- docs/clusters/eiger.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/clusters/eiger.md b/docs/clusters/eiger.md index 3cfab086..ce5ad357 100644 --- a/docs/clusters/eiger.md +++ b/docs/clusters/eiger.md @@ -187,7 +187,7 @@ Exceptional and non-disruptive updates may happen outside this time frame and wi !!! change "2025-06-02 Early access phase" Early access phase is open -??? change "2025-23-05 Creation of Eiger on Alps" +??? change "2025-05-23 Creation of Eiger on Alps" Eiger is deployed as a vServices-enalbed cluster ### Known issues From 187ed202fe56cc581c9ad1a2beeaf84c4704b546 Mon Sep 17 00:00:00 2001 From: twrobinson Date: Tue, 27 May 2025 17:20:47 +0200 Subject: [PATCH 10/14] Update daint.md minor updates --- docs/clusters/daint.md | 29 +++++++++++++++-------------- 1 file changed, 15 insertions(+), 14 deletions(-) diff --git a/docs/clusters/daint.md b/docs/clusters/daint.md index 42d2a22f..eab3ce43 100644 --- a/docs/clusters/daint.md +++ b/docs/clusters/daint.md @@ -7,12 +7,12 @@ Daint is the main [HPC Platform][ref-platform-hpcp] cluster that provides comput ### Compute nodes -Daint consists of around 793 [Grace-Hopper nodes][ref-alps-gh200-node]. +Daint consists of around 800-1000 [Grace-Hopper nodes][ref-alps-gh200-node]. -The number of nodes can change when nodes are added or removed from other clusters on Alps. +The number of nodes can vary as nodes are added or removed from other clusters on Alps. -There are four login nodes, labelled `daint-ln00[1-4]`. -You will be assigned to one of the four login nodes when you ssh onto the system, from where you can edit files, compile applications and start simulation jobs. +There are four login nodes, `daint-ln00[1-4]`. +You will be assigned to one of the four login nodes when you ssh onto the system, from where you can edit files, compile applications and launch batch jobs. | node type | number of nodes | total CPU sockets | total GPUs | |-----------|-----------------| ----------------- | ---------- | @@ -51,7 +51,7 @@ Please refer to the [uenv documentation][ref-uenv] for detailed information on h - :fontawesome-solid-layer-group: __Scientific Applications__ - Provide the latest versions of scientific applications, tuned for Daint, and the tools required to build your own version of the applications. + Provide the latest versions of scientific applications, tuned for Daint, and the tools required to build your own versions of the applications. * [CP2K][ref-uenv-cp2k] * [GROMACS][ref-uenv-gromacs] @@ -97,30 +97,31 @@ To build images, see the [guide to building container images on Alps][ref-build- CSCS will continue to support and update uenv and container engine, and users are encouraged to update their workflows to use these methods at the first opportunity. - The CPE is still installed on Daint, however it will recieve no support or updates, and will be replaced with a container in a future update. + The CPE is still installed on Daint, however it will receive no support or updates, and will be replaced with a container in a future update. ## Running jobs on Daint ### Slurm -Santis uses [Slurm][ref-slurm] as the workload manager, which is used to launch and monitor distributed workloads, such as training runs. +Daint uses [Slurm][ref-slurm] as the workload manager, which is used to launch and monitor compute-intensive workloads. -There are two [Slurm partitions][ref-slurm-partitions] on the system: +There are four [Slurm partitions][ref-slurm-partitions] on the system: * the `normal` partition is for all production workloads. * the `debug` partition can be used to access a small allocation for up to 30 minutes for debugging and testing purposes. -* the `xfer` partition is for [internal data transfer][ref-data-xfer-internal] at CSCS. +* the `xfer` partition is for [internal data transfer][ref-data-xfer-internal]. +* the `low` partition is a low-priority partition, which may be enabled for specific projects at specific times. + -!!! todo "Timmy: add definition of the low partition" | name | nodes | max nodes per job | time limit | | -- | -- | -- | -- | -| `normal` | 994 | - | 24 hours | -| `low` | 994 | 2 | 24 hours | +| `normal` | unlim | - | 24 hours | | `debug` | 24 | 2 | 30 minutes | | `xfer` | 2 | 1 | 24 hours | +| `low` | unlim | - | 24 hours | -* nodes in the `normal` and `debug` partitions are not shared +* nodes in the `normal` and `debug` (and `low`) partitions are not shared * nodes in the `xfer` partition can be shared See the Slurm documentation for instructions on how to run jobs on the [Grace-Hopper nodes][ref-slurm-gh200]. @@ -136,7 +137,7 @@ Daint can also be accessed using [FirecREST][ref-firecrest] at the `https://api. ### Scheduled maintenance !!! todo "move this to HPCP top level docs" - Wednesday morning 8-12 CET is reserved for periodic updates, with services potentially unavailable during this timeframe. If the queues must be drained (redeployment of node images, rebooting of compute nodes, etc) then a Slurm reservation will be in place that will prevent jobs from running into the maintenance window. + Wednesday mornings 8:00-12:00 CET are reserved for periodic updates, with services potentially unavailable during this time frame. If the batch queues must be drained (for redeployment of node images, rebooting of compute nodes, etc) then a Slurm reservation will be in place that will prevent jobs from running into the maintenance window. Exceptional and non-disruptive updates may happen outside this time frame and will be announced to the users mailing list, and on the [CSCS status page](https://status.cscs.ch). From 0e8cb9f4aca1d1b160a0549dd135bf2ab8b6c636 Mon Sep 17 00:00:00 2001 From: twrobinson Date: Tue, 27 May 2025 17:33:32 +0200 Subject: [PATCH 11/14] Update daint.md --- docs/clusters/daint.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/clusters/daint.md b/docs/clusters/daint.md index eab3ce43..fce57233 100644 --- a/docs/clusters/daint.md +++ b/docs/clusters/daint.md @@ -83,7 +83,7 @@ Please refer to the [uenv documentation][ref-uenv] for detailed information on h * [Linaro Forge][ref-uenv-linaro] -[](){#ref-cluster-eiger-containers} +[](){#ref-cluster-daint-containers} #### Containers Daint supports container workloads using the [container engine][ref-container-engine]. From 6235253634dd033b5d263ad14db1e47c73f72b60 Mon Sep 17 00:00:00 2001 From: twrobinson Date: Tue, 27 May 2025 17:34:03 +0200 Subject: [PATCH 12/14] Update eiger.md --- docs/clusters/eiger.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/clusters/eiger.md b/docs/clusters/eiger.md index ce5ad357..08ce29ab 100644 --- a/docs/clusters/eiger.md +++ b/docs/clusters/eiger.md @@ -157,10 +157,10 @@ There are four [Slurm partitions][ref-slurm-partitions] on the system: | name | nodes | max nodes per job | time limit | | -- | -- | -- | -- | -| `normal` | 1266 | - | 24 hours | +| `normal` | unlim | - | 24 hours | | `debug` | 32 | 1 | 30 minutes | | `xfer` | 2 | 1 | 24 hours | -| `low` | 1266 | - | 24 hours | +| `low` | unlim | - | 24 hours | * nodes in the `normal` and `debug` partitions are not shared * nodes in the `xfer` partition can be shared From 0350794baa57d6a172dee61b0785f9793df94f9b Mon Sep 17 00:00:00 2001 From: twrobinson Date: Tue, 27 May 2025 17:45:49 +0200 Subject: [PATCH 13/14] Update eiger.md --- docs/clusters/eiger.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/clusters/eiger.md b/docs/clusters/eiger.md index 08ce29ab..58eab865 100644 --- a/docs/clusters/eiger.md +++ b/docs/clusters/eiger.md @@ -51,7 +51,7 @@ Eiger consists of 19 [AMD Epyc Rome][ref-alps-zen2-node] compute nodes. There is one login node, `eiger-ln010`. -[//]: # (You will be assigned to one of the four login nodes when you ssh onto the system, from where you can edit files, compile applications and start simulation jobs.) +[//]: # (TODO: You will be assigned to one of the four login nodes when you ssh onto the system, from where you can edit files, compile applications and start simulation jobs.) | node type | number of nodes | total CPU sockets | total GPUs | |-----------|-----------------| ----------------- | ---------- | From 2e40c7a63a521dea01ff3f83aed979a92f6220dc Mon Sep 17 00:00:00 2001 From: bcumming Date: Wed, 28 May 2025 17:20:38 +0200 Subject: [PATCH 14/14] remove todo for HPCP from the platforms landing page --- docs/alps/platforms.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/alps/platforms.md b/docs/alps/platforms.md index e62ce969..c2c7611c 100644 --- a/docs/alps/platforms.md +++ b/docs/alps/platforms.md @@ -7,17 +7,17 @@ A platform can consist of one or multiple [clusters][ref-alps-clusters], and its
-- :fontawesome-solid-mountain: __Machine Learning Platform__ +- :fontawesome-solid-mountain: __HPC Platform__ - The Machine Learning Platform (MLP) hosts ML and AI researchers. + The HPC Platform (HPCP) provides services for the HPC community in Switzerland and abroad. The majority of compute cycles are provided to the [User Lab](https://www.cscs.ch/user-lab/overview) via peer-reviewed allocation schemes. - [:octicons-arrow-right-24: MLP][ref-platform-mlp] + [:octicons-arrow-right-24: HPCP][ref-platform-hpcp] -- :fontawesome-solid-mountain: __HPC Platform__ +- :fontawesome-solid-mountain: __Machine Learning Platform__ - !!! todo + The Machine Learning Platform (MLP) hosts ML and AI researchers, particularly the SwissAI initiative. - [:octicons-arrow-right-24: HPCP][ref-platform-hpcp] + [:octicons-arrow-right-24: MLP][ref-platform-mlp] - :fontawesome-solid-mountain: __Climate and Weather Platform__