Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/alps/hardware.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ This approach to cooling provides greater efficiency for the rack-level cooling,
* Maximum of 64 quad-blade compute blades
* Maximum of 64 Slingshot switch blades

[](){#ref-alps-slingshot-network}
## Alps High Speed Network

!!! todo
Expand Down
3 changes: 3 additions & 0 deletions docs/alps/storage.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,16 +16,19 @@ These separate clusters are on the same Slingshot 11 network as the Alps.
| IOPs | 1.5M | 8.6M read, 24M write | 200k read, 768k write |
| file create/s| 374k | 214k | 97k |

[](){#ref-storage-capstor}
## capstor

Capstor is the largest file system, for storing large amounts of input and output data.
It is used to provide SCRATCH and STORE for different clusters - the precise details are platform-specific.

[](){#ref-storage-iopstor}
## iopstor

!!! todo
small text explaining what iopstor is designed to be used for.

[](){#ref-storage-vast}
## vast

The Vast storage is smaller capacity system that is designed for use as home folders.
Expand Down
2 changes: 1 addition & 1 deletion docs/tools/container-engine.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
[](){#container-engine}
[](){#ref-container-engine}
# Container Engine

The Container Engine (CE) toolset is designed to enable computing jobs to seamlessly run inside Linux application containers, thus providing support for containerized user environments.
Expand Down
6 changes: 6 additions & 0 deletions docs/tools/slurm.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,12 @@ SLURM is an open-source, highly scalable job scheduler that allocates computing
!!! todo
document `--account`, `--constrant` and other generic flags.

[](){#ref-slurm-running-jobs}
## Running jobs

!!! todo
document `srun --pty`, `sbatch`, `squeue`, `scontrol`, `salloc` with a few common usage examples

[](){#ref-slurm-partitions}
## Partitions

Expand Down
76 changes: 43 additions & 33 deletions docs/vclusters/clariden.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,63 +9,73 @@
## Cluster Specification
### Hardware
Clariden consists of ~1200 [Grace-Hopper nodes][ref-alps-gh200-node]. Most nodes are in the [`normal` slurm partition][ref-slurm-partition-normal], while a few nodes are in the [`debug` partition][ref-slurm-partition-debug].
The nodes are interconnected with the [slingshot high speed network][ref-alps-slingshot-network].
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The nodes are interconnected with the [slingshot high speed network][ref-alps-slingshot-network].
The nodes are interconnected with the [Slingshot high speed network][ref-alps-slingshot-network].


As usual the login nodes have direct internet connections, while the compute nodes use a [proxy server][ref-alps-slingshot-network] to access the internet.
!!! todo
Document proxy and the implications (normally transparent, but git needs modifications
Fix the ref
Comment on lines +14 to +17
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


### File systems and storage
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### File systems and storage
### File systems and storage

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like the list of filesystems could maybe be presented more clearly in a table?

The scratch filesystem is hosted on [IOPStore][ref-storage-iopstor], but also the capacity storage [Capstor][ref-storage-capstor] is mounted at `/capstor/scratch/cscs`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The scratch filesystem is hosted on [IOPStore][ref-storage-iopstor], but also the capacity storage [Capstor][ref-storage-capstor] is mounted at `/capstor/scratch/cscs`.
The scratch filesystem is hosted on [IOPStor][ref-storage-iopstor], but also the capacity storage [Capstor][ref-storage-capstor] is mounted at `/capstor/scratch/cscs`.

? Not sure how "iopstor" should be capitalized... could be "Iopstor" as wel to mirror "Capstor"?

The variables `STORE` and `PROJECT` are not set on Clariden.
The home directory is hosted on [VAST][ref-storage-vast].

!!! todo
a standardised table with information about
As usual, an overview of your quota on the different filesystems, can be obtained by the `quota` command.

* number and type of nodes
## Getting started
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Getting started
## Getting started

The project and resources are managed by [this tool][ref-account-waldur].

and any special notes
### Connect to Clariden
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Connect to Clariden
### Connect to Clariden

You can connect to Clariden via [ssh][ref-ssh-config], ensuring that the file `~/.ssh/config` has these settings (replace `cscsusername` with your username).

## Logging into Clariden
```title="$HOME/.ssh/config"
Host ela
HostName ela.cscs.ch
User cscsusername
IdentityFile ~/.ssh/cscs-key

!!! todo
how to log in, i.e. `ssh clariden.cscs.ch` via `ela.cscs.ch`
Host clariden
HostName clariden.alps.cscs.ch
ProxyJump ela
User cscsusername
IdentityFile ~/.ssh/cscs-key
IdentitiesOnly yes
```
Comment on lines 31 to +44
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can then use `ssh clariden` to login to Clariden.

provide the snippet to add to your `~/.ssh/config`, and link to where we document this (docs not currently available)
### Available programming environments

## Software and services
#### Container engine
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#### Container engine
#### Container engine

The recommended way for working on Clariden are containerized workflows leveraging the [container engine][ref-container-engine].

!!! todo
information about CSCS services/tools available
#### UENV
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#### UENV
#### uenv

Besides running containerized workflows, it is possible to run your jobs with a [UENV][ref-tool-uenv].
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Besides running containerized workflows, it is possible to run your jobs with a [UENV][ref-tool-uenv].
Besides running containerized workflows, it is possible to run your jobs with a [uenv][ref-tool-uenv].


* container engine
* uenv
* CPE
* ... etc
#### CPE
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#### CPE
#### CPE

Unlike on other platforms the Cray programming environment is not supported on Clariden.

## Running Jobs on Clariden
### Running Jobs on Clariden

Clariden uses [SLURM][slurm] as the workload manager, which is used to launch and monitor distributed workloads, such as training runs.

See detailed instructions on how to run jobs on the [Grace-Hopper nodes][ref-slurm-gh200].

## Storage

!!! todo
describe the file systems that are attached, and where.

This is where `$SCRATCH`, `$PROJECT` etc are defined for this cluster.

Refer to the specific file systems that these map onto (capstor, iopstor, waldur), and link to the storage docs for these.

Also discuss any specific storage policies. You might want to discuss storage policies for MLp one level up, in the [MLp docs][ref-platform-mlp].

* attached storage and policies
The flag `--account=<account>` / `-A <account>` is mandatory for submitting jobs to SLURM, and nodehour accounting will be on the account that is specified with this flag.

## Calendar and key events
## Maintenance
### Calendar and key events
Comment on lines +66 to +67
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Maintenance
### Calendar and key events
## Maintenance
### Calendar and key events


The system is updated every Tuesday, between 9 am and 12 pm.
...
- The system is updated every Wednesday, between 08:00-12:00 Zurich local time.
- Access to the system might be closed during the maintenance window.
- There is a maintenance reservation for SLURM, which prohibits jobs to start, if their `StartTime+TimeLimit` is beyond the maintenance start.

!!!todo
notifications

a calendar widget would be useful, particularly if we can have a central calendar, and a way to filter events for specific instances

## Change log
### Change log

!!! change "special text boxes for updates"
they can be opened and closed.
Expand All @@ -82,7 +92,7 @@ The system is updated every Tuesday, between 9 am and 12 pm.
??? change "2024-09-18 Daint users"
In order to complete the preparatory work necessary to deliver Alps in production, as of September 18 2024 the vCluster Daint on Alps will no longer be accessible until further notice: the early access will still be granted on Tödi using the Slurm reservation option `--reservation=daint`

## Known issues
### Known issues

__TODO__ list of know issues - include links to known issues page

Expand Down