diff --git a/docs/alps/hardware.md b/docs/alps/hardware.md index 27f4d3e6..9c87213f 100644 --- a/docs/alps/hardware.md +++ b/docs/alps/hardware.md @@ -21,6 +21,7 @@ This approach to cooling provides greater efficiency for the rack-level cooling, * Maximum of 64 quad-blade compute blades * Maximum of 64 Slingshot switch blades +[](){#ref-alps-slingshot-network} ## Alps High Speed Network !!! todo diff --git a/docs/alps/storage.md b/docs/alps/storage.md index a1b9448f..e8ec560a 100644 --- a/docs/alps/storage.md +++ b/docs/alps/storage.md @@ -16,16 +16,19 @@ These separate clusters are on the same Slingshot 11 network as the Alps. | IOPs | 1.5M | 8.6M read, 24M write | 200k read, 768k write | | file create/s| 374k | 214k | 97k | +[](){#ref-storage-capstor} ## capstor Capstor is the largest file system, for storing large amounts of input and output data. It is used to provide SCRATCH and STORE for different clusters - the precise details are platform-specific. +[](){#ref-storage-iopstor} ## iopstor !!! todo small text explaining what iopstor is designed to be used for. +[](){#ref-storage-vast} ## vast The Vast storage is smaller capacity system that is designed for use as home folders. diff --git a/docs/tools/container-engine.md b/docs/tools/container-engine.md index ac5c1432..49b74b63 100644 --- a/docs/tools/container-engine.md +++ b/docs/tools/container-engine.md @@ -1,4 +1,4 @@ -[](){#container-engine} +[](){#ref-container-engine} # Container Engine The Container Engine (CE) toolset is designed to enable computing jobs to seamlessly run inside Linux application containers, thus providing support for containerized user environments. diff --git a/docs/tools/slurm.md b/docs/tools/slurm.md index 4ccf419b..876a43c1 100644 --- a/docs/tools/slurm.md +++ b/docs/tools/slurm.md @@ -9,6 +9,12 @@ SLURM is an open-source, highly scalable job scheduler that allocates computing !!! todo document `--account`, `--constrant` and other generic flags. +[](){#ref-slurm-running-jobs} +## Running jobs + +!!! todo + document `srun --pty`, `sbatch`, `squeue`, `scontrol`, `salloc` with a few common usage examples + [](){#ref-slurm-partitions} ## Partitions diff --git a/docs/vclusters/clariden.md b/docs/vclusters/clariden.md index 5a253464..757a2bc8 100644 --- a/docs/vclusters/clariden.md +++ b/docs/vclusters/clariden.md @@ -9,63 +9,73 @@ ## Cluster Specification ### Hardware Clariden consists of ~1200 [Grace-Hopper nodes][ref-alps-gh200-node]. Most nodes are in the [`normal` slurm partition][ref-slurm-partition-normal], while a few nodes are in the [`debug` partition][ref-slurm-partition-debug]. +The nodes are interconnected with the [slingshot high speed network][ref-alps-slingshot-network]. +As usual the login nodes have direct internet connections, while the compute nodes use a [proxy server][ref-alps-slingshot-network] to access the internet. +!!! todo + Document proxy and the implications (normally transparent, but git needs modifications + Fix the ref +### File systems and storage +The scratch filesystem is hosted on [IOPStore][ref-storage-iopstor], but also the capacity storage [Capstor][ref-storage-capstor] is mounted at `/capstor/scratch/cscs`. +The variables `STORE` and `PROJECT` are not set on Clariden. +The home directory is hosted on [VAST][ref-storage-vast]. -!!! todo - a standardised table with information about +As usual, an overview of your quota on the different filesystems, can be obtained by the `quota` command. - * number and type of nodes +## Getting started +The project and resources are managed by [this tool][ref-account-waldur]. - and any special notes +### Connect to Clariden +You can connect to Clariden via [ssh][ref-ssh-config], ensuring that the file `~/.ssh/config` has these settings (replace `cscsusername` with your username). -## Logging into Clariden +```title="$HOME/.ssh/config" +Host ela + HostName ela.cscs.ch + User cscsusername + IdentityFile ~/.ssh/cscs-key -!!! todo - how to log in, i.e. `ssh clariden.cscs.ch` via `ela.cscs.ch` +Host clariden + HostName clariden.alps.cscs.ch + ProxyJump ela + User cscsusername + IdentityFile ~/.ssh/cscs-key + IdentitiesOnly yes +``` +You can then use `ssh clariden` to login to Clariden. - provide the snippet to add to your `~/.ssh/config`, and link to where we document this (docs not currently available) +### Available programming environments -## Software and services +#### Container engine +The recommended way for working on Clariden are containerized workflows leveraging the [container engine][ref-container-engine]. -!!! todo - information about CSCS services/tools available +#### UENV +Besides running containerized workflows, it is possible to run your jobs with a [UENV][ref-tool-uenv]. - * container engine - * uenv - * CPE - * ... etc +#### CPE +Unlike on other platforms the Cray programming environment is not supported on Clariden. -## Running Jobs on Clariden +### Running Jobs on Clariden Clariden uses [SLURM][slurm] as the workload manager, which is used to launch and monitor distributed workloads, such as training runs. See detailed instructions on how to run jobs on the [Grace-Hopper nodes][ref-slurm-gh200]. -## Storage - -!!! todo - describe the file systems that are attached, and where. - - This is where `$SCRATCH`, `$PROJECT` etc are defined for this cluster. - - Refer to the specific file systems that these map onto (capstor, iopstor, waldur), and link to the storage docs for these. - - Also discuss any specific storage policies. You might want to discuss storage policies for MLp one level up, in the [MLp docs][ref-platform-mlp]. - -* attached storage and policies +The flag `--account=` / `-A ` is mandatory for submitting jobs to SLURM, and nodehour accounting will be on the account that is specified with this flag. -## Calendar and key events +## Maintenance +### Calendar and key events -The system is updated every Tuesday, between 9 am and 12 pm. -... +- The system is updated every Wednesday, between 08:00-12:00 Zurich local time. +- Access to the system might be closed during the maintenance window. +- There is a maintenance reservation for SLURM, which prohibits jobs to start, if their `StartTime+TimeLimit` is beyond the maintenance start. !!!todo notifications a calendar widget would be useful, particularly if we can have a central calendar, and a way to filter events for specific instances -## Change log +### Change log !!! change "special text boxes for updates" they can be opened and closed. @@ -82,7 +92,7 @@ The system is updated every Tuesday, between 9 am and 12 pm. ??? change "2024-09-18 Daint users" In order to complete the preparatory work necessary to deliver Alps in production, as of September 18 2024 the vCluster Daint on Alps will no longer be accessible until further notice: the early access will still be granted on Tödi using the Slurm reservation option `--reservation=daint` -## Known issues +### Known issues __TODO__ list of know issues - include links to known issues page