diff --git a/docs/access/ump.md b/docs/access/ump.md index 8b2b97b6..403cffbc 100644 --- a/docs/access/ump.md +++ b/docs/access/ump.md @@ -1,6 +1,50 @@ -# User Management Portal +[](){#ump} +# Account and Resources Management Tool -!!! todo - copy over docs from [confluence](https://confluence.cscs.ch/display/KB/Account+and+Resources+Management+Tool) +The Swiss National Supercomputing Centre (CSCS) offers a web-based tool for users to manage their accounts and projects at [account.cscs.ch](https://account.cscs.ch). - 15 minute job +With this tool, users can: + +- Access their profile, manage institutional details, or reset their password. +- List the projects they belong to, including closed ones. +- Check details on each project, quotas, and current utilization. +- Get an overview of where their files are stored at CSCS (including home directories, scratch, etc.). + +For group leaders (or PIs), the tool allows: + +- Managing user membership and access control. +- Inviting users to their projects via email. Existing users can accept immediately, while new users will receive instructions to create an account and join the project. +- Removing users from their projects. +- Selecting which users can access a system (and submit jobs) and which ones can only access project data. +- Defining one or more deputies to perform such tasks. + **Note:** The responsibility of what happens within the project still belongs to the group leader or PI. + +A short guideline on how to perform these tasks is provided below. + +## Usage + +The tool is designed to be intuitive and comprises the following main areas: + +- **A) Account selector**: For users with multiple accounts (e.g., service accounts). +- **B) Profile management**: To view and edit the account's institutional details and change the password. +- **C) Project membership**: To show the selected project in detail. +- **D) Storage**: Where users can see where they have stored their files (home, scratch, and project areas). +- **E) Main view** + +![Screenshot](../images/access/ump.png) + +### Membership Management (for Group Leaders and Deputies Only) + +To invite users to a selected project, group leaders or their deputies need to: + +1. Select the project on the left menu. +2. Click the "Members" tab. +3. Scroll down to the "Users" (or "Deputies" to manage deputies) section. +4. Use the "+" (plus) button on the right of the section and enter the given and family names and email address of the invitee. + The invitee will receive instructions on how to join the project. The group leader will get a confirmation on whether the invitee has accepted or rejected the invitation. + If the invitee does not have an account, they will also receive instructions on how to create one, which needs to be verified by CSCS administration staff. + +To remove users from a selected project, group leaders or their deputies need to: + +1. Repeat steps 1 to 3 above. +2. Use the icon with the three horizontal lines (see screenshot below) that is on the right of the user and select "Remove user." diff --git a/docs/alps/hardware.md b/docs/alps/hardware.md index ec2d431d..df6d8b77 100644 --- a/docs/alps/hardware.md +++ b/docs/alps/hardware.md @@ -1,3 +1,4 @@ +[](){#alps-hardware} # Alps Hardware Alps is a HPE Cray EX3000 system, a liquid cooled blade-based, high-density system. @@ -20,7 +21,19 @@ This approach to cooling provides greater efficiency for the rack-level cooling, * Maximum of 64 quad-blade compute blades * Maximum of 64 Slingshot switch blades -## Alps Blades +## Alps High Speed Network + +!!! todo + information about the network. + + * Details about SlingShot 11. + * how many NICS per node + * raw feeds and speeds + * Some OSU benchmark results. + * GPU-aware communication + * **slingshot is not infiniband - there is no NVSwitch** + +## Alps Nodes Alps was installed in phases, starting with the installation of 1024 AMD Rome dual socket CPU nodes in 2020, through to the main installation of 2,688 Grace-Hopper nodes in 2024. @@ -34,26 +47,31 @@ There are currently four node types in Alps, with another becoming available in | AMD MI250x | 12 | 24 | 24 | 96 | | AMD MI300A | 64 | 128 | 512 | 512 | +[](){#gh200-node} ### NVIDIA GH200 GPU Nodes -[](){#gh200-hardware-description} Perry Peak +[](){#zen2-node} ### AMD Rome CPU Nodes EX425 +[](){#a100-node} ### NVIDIA A100 GPU Nodes Grizzly Peak +[](){#mi200-node} ### AMD MI250x GPU Nodes Bard Peak +[](){#mi300-node} ### AMD MI300A GPU Nodes Parry Peak !!! info "coming soon" H1 2025 + diff --git a/docs/alps/index.md b/docs/alps/index.md index 30a0262c..13049a31 100644 --- a/docs/alps/index.md +++ b/docs/alps/index.md @@ -1,22 +1,32 @@ # Alps Infrastructure -Alps is a general-purpose compute and data Research Infrastructure (RI) open to the broad community of researchers in Switzerland and the rest of the world. Alps will provide a high impact, challenging and innovative RI that will allow Switzerland to advance science and impact society. +Alps is a general-purpose compute and data Research Infrastructure (RI) open to the broad community of researchers in Switzerland and the rest of the world. +Alps provides a high impact, challenging and innovative RI that will allows Switzerland to advance science and impact society. -Alps enables the creation of versatile clusters (vClusters) that can be tailored to the specific needs of users while maintaining confidentiality. For example, a vCluster will be dedicated to MeteoSwiss’ numerical weather forecasts, another one to the User Lab and another one to Machine Learning and Artificial Intelligence. +Alps enables the creation of versatile clusters (vClusters) that can be tailored to the specific needs of users while maintaining confidentiality. +For example, a vCluster will be dedicated to MeteoSwiss’ numerical weather forecasts, another one to the User Lab and another one to Machine Learning and Artificial Intelligence. + +A key feature of Alps is multi-tenancy, where tenants are organizations, typically a research institution, that deploys, operates, or manages its platform on the Alps infrastructure. +Tenants have privileged access to resource nodes, enabling them to deploy their own services and resource configurations. +Additionally, network segregation ensures secure and isolated communication, with the option to connect to the tenant's private network.
-- :fontawesome-solid-signs-post: __Hardware__ +- :fontawesome-solid-signs-post: __Platforms__ - Learn about the node types and networking infrastructure in Alps. + [:octicons-arrow-right-24: Alps Platforms][platforms] - [:octicons-arrow-right-24: Alps Hardware](hardware.md) +- :fontawesome-solid-signs-post: __Clusters__ -- :fontawesome-solid-signs-post: __Network__ + The resources on Alps are partitioned and configured into versatile software defined clusters (vClusters). - Learn about the Slingshot 11 network on Alps. + [:octicons-arrow-right-24: Alps vClusters][clusters] - [:octicons-arrow-right-24: Alps Network](network.md) +- :fontawesome-solid-signs-post: __Hardware__ + + Learn about the node types and networking infrastructure in Alps. + + [:octicons-arrow-right-24: Alps Hardware](hardware.md) - :fontawesome-solid-signs-post: __Storage__ @@ -24,16 +34,4 @@ Alps enables the creation of versatile clusters (vClusters) that can be tailored [:octicons-arrow-right-24: Alps Storage](storage.md) -- :fontawesome-solid-signs-post: __vClusters__ - - The resources on Alps are partitioned and configured into versatile software defined clusters (vClusters). - - [:octicons-arrow-right-24: Alps vClusters](vclusters.md) - -- :fontawesome-solid-signs-post: __Tenants__ - - Alps is a multi-tenant system. - - [:octicons-arrow-right-24: Alps Tenants](tenants.md) -
diff --git a/docs/alps/platforms.md b/docs/alps/platforms.md new file mode 100644 index 00000000..e0a9337a --- /dev/null +++ b/docs/alps/platforms.md @@ -0,0 +1,28 @@ +[](){#platforms} +# Platforms on Alps + +A platform represents a set of scientific services along with compute and data resources hosted on the Alps research infrastructure, provided to a specific scientific community. +Each platform addresses particular research needs and domains, such as climate and weather modeling, machine learning, or high-performance computing applications. +A platform can consist of one or multiple [clusters][clusters], and its services can be managed either by CSCS or by the scientific community itself, including access control, usage policies, and support. + +
+ +- :fontawesome-solid-mountain: __Machine Learning Platform__ + + The Machine Learning Platform (MLp) hosts ML and AI researchers. + + [:octicons-arrow-right-24: MLp][mlp] + +- :fontawesome-solid-mountain: __HPC Platform__ + + !!! todo + + [:octicons-arrow-right-24: HPCp][hpcp] + +- :fontawesome-solid-mountain: __Climate and Weather Platform__ + + !!! todo + + [:octicons-arrow-right-24: CWp][cwp] + +
diff --git a/docs/alps/tenants.md b/docs/alps/tenants.md deleted file mode 100644 index c81b4044..00000000 --- a/docs/alps/tenants.md +++ /dev/null @@ -1,7 +0,0 @@ -# Alps Tenants - -!!! todo - This page answeres the question "what is a tenant" - - * why/how is Alps multi tenant - * who are the tenants diff --git a/docs/alps/vclusters.md b/docs/alps/vclusters.md index 6b4d920b..09418659 100644 --- a/docs/alps/vclusters.md +++ b/docs/alps/vclusters.md @@ -1,10 +1,34 @@ -# Alps vClusters +[](){#clusters} +# Alps Clusters -!!! todo - this page answers the question "what is a vCluster"? +A vCluster (versatile software-defined cluster) is a logical partition of the supercomputing resources where platform services are deployed. It serves as a dedicated environment supporting a specific platform. The composition of resources and services for each vCluster is defined in a configuration file used by an automated pipeline for deployment. Once deployed by CSCS, the vCluster becomes immutable. - * What is a vCluster? - * Examples of vClusters +## Clusters on Alps + +Clusters on Alps are provided as part of different [platforms][platforms]. + +
+- :fontawesome-solid-mountain: __Machine Learning Platform__ + + Clariden is the main Grace-Hopper cluster + + [:octicons-arrow-right-24: Clariden][clariden] + + Bristen is a small system with a100 nodes, used for **todo** + + [:octicons-arrow-right-24: Bristen][bristen] +
+ +
+- :fontawesome-solid-mountain: __HPC Platform__ { .col-span-12 } + + !!! todo +
+ +
+- :fontawesome-solid-mountain: __Climate and Weather Platform__ + + !!! todo +
- We don't document individual vClusters here - these are documented under each platform. diff --git a/docs/images/access/ump.png b/docs/images/access/ump.png new file mode 100644 index 00000000..7fd5ab0f Binary files /dev/null and b/docs/images/access/ump.png differ diff --git a/docs/index.md b/docs/index.md index e724dbec..a3cb328d 100644 --- a/docs/index.md +++ b/docs/index.md @@ -8,7 +8,7 @@ Welcome to the techincal documentation for Alps. Once you have a project at CSCS, start here to find your platform: - [:octicons-arrow-right-24: Platforms overview](platforms/index.md) + [:octicons-arrow-right-24: Platforms overview][platforms] Go straight to the documentation for the platform that hosts your project: diff --git a/docs/platforms/index.md b/docs/platforms/index.md deleted file mode 100644 index 2573e24b..00000000 --- a/docs/platforms/index.md +++ /dev/null @@ -1,26 +0,0 @@ -# Platforms on Alps - -!!! todo - A high level paragraph that describes what platforms are - -
- -- :fontawesome-solid-mountain: __Machine Learning Platform__ - - The Machine Learning Platform (MLp) hosts ML and AI researchers - - [:octicons-arrow-right-24: MLp][mlp] - -- :fontawesome-solid-mountain: __HPC Platform__ - - !!! todo - - [:octicons-arrow-right-24: HPCp][hpcp] - -- :fontawesome-solid-mountain: __Climate and Weather Platform__ - - !!! todo - - [:octicons-arrow-right-24: CWp][cwp] - -
diff --git a/docs/platforms/mlp/index.md b/docs/platforms/mlp/index.md index f2174f76..47217876 100644 --- a/docs/platforms/mlp/index.md +++ b/docs/platforms/mlp/index.md @@ -21,8 +21,17 @@ Once invited to a project, you will receive an email, which you can need to crea The main cluster provided by the MLp is Clariden, a large Grace-Hopper GPU system on Alps. -!!! todo - introduction paragraph and cards that link to Clariden and Bristen +
+- :fontawesome-solid-mountain: [__Clariden__][clariden] + + Clariden is the main [Grace-Hopper][gh200-node] cluster used for **todo** +
+ +
+- :fontawesome-solid-mountain: [__Bristen__][bristen] + + Bristen is a smaller system with [A100 GPU nodes][a100-node] for **todo** +
## Guides and Tutorials diff --git a/docs/tools/slurm.md b/docs/tools/slurm.md index acd82b5f..757cfe8c 100644 --- a/docs/tools/slurm.md +++ b/docs/tools/slurm.md @@ -29,7 +29,7 @@ The following sections will provide detailed guidance on how to use SLURM to req [using slurm on Grace-Hopper][gh200-slurm] ``` -Link to the [Grace-Hopper overview][gh200-hardware-description]. +Link to the [Grace-Hopper overview][gh200-node]. An example of using tabs to show srun and sbatch useage to get one GPU per MPI rank: diff --git a/docs/vclusters/bristen.md b/docs/vclusters/bristen.md new file mode 100644 index 00000000..3a7f2e36 --- /dev/null +++ b/docs/vclusters/bristen.md @@ -0,0 +1,6 @@ +[](){bristen} +# Bristen + +!!! todo + use the [clariden][clariden] as template. + diff --git a/mkdocs.yml b/mkdocs.yml index c77e9611..3b659487 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -15,24 +15,6 @@ plugins: - autorefs nav: - Welcome: index.md - - 'Alps': - - alps/index.md - - 'Hardware': alps/hardware.md - - 'Network': alps/network.md - - 'Storage': alps/storage.md - - 'vClusters': alps/vclusters.md - - 'Tenants': alps/tenants.md - - 'Platforms': - - platforms/index.md - - 'HPC Platform': - - platforms/hpcp/index.md - - 'Machine Learning Platform': - - platforms/mlp/index.md - # we could move all vcluster descriptions to a vcluster/name.md - # then link them into the respective platform - - 'clariden': vclusters/clariden.md - - 'Climate and Weather Platform': - - platforms/cwp/index.md - 'Access': - access/index.md - 'MFA': @@ -40,6 +22,20 @@ nav: - 'using windows': access/mfa/windows.md - 'UMP': access/ump.md - 'Waldur': access/waldur.md + - 'Alps': + - alps/index.md + - 'Platforms': alps/platforms.md + - 'Clusters': alps/vclusters.md + - 'Hardware': alps/hardware.md + - 'Storage': alps/storage.md + - 'Machine Learning Platform': + - platforms/mlp/index.md + - 'clariden': vclusters/clariden.md + - 'bristen': vclusters/bristen.md + - 'HPC Platform': + - platforms/hpcp/index.md + - 'Climate and Weather Platform': + - platforms/cwp/index.md - 'Tools': - tools/index.md - 'slurm': tools/slurm.md