-
Notifications
You must be signed in to change notification settings - Fork 41
Clariden docs #31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clariden docs #31
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -9,63 +9,73 @@ | |||||||||||||
| ## Cluster Specification | ||||||||||||||
| ### Hardware | ||||||||||||||
| Clariden consists of ~1200 [Grace-Hopper nodes][ref-alps-gh200-node]. Most nodes are in the [`normal` slurm partition][ref-slurm-partition-normal], while a few nodes are in the [`debug` partition][ref-slurm-partition-debug]. | ||||||||||||||
| The nodes are interconnected with the [slingshot high speed network][ref-alps-slingshot-network]. | ||||||||||||||
|
|
||||||||||||||
| As usual the login nodes have direct internet connections, while the compute nodes use a [proxy server][ref-alps-slingshot-network] to access the internet. | ||||||||||||||
| !!! todo | ||||||||||||||
| Document proxy and the implications (normally transparent, but git needs modifications | ||||||||||||||
| Fix the ref | ||||||||||||||
|
Comment on lines
+14
to
+17
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||||||||||||||
|
|
||||||||||||||
| ### File systems and storage | ||||||||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I feel like the list of filesystems could maybe be presented more clearly in a table? |
||||||||||||||
| The scratch filesystem is hosted on [IOPStore][ref-storage-iopstor], but also the capacity storage [Capstor][ref-storage-capstor] is mounted at `/capstor/scratch/cscs`. | ||||||||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
? Not sure how "iopstor" should be capitalized... could be "Iopstor" as wel to mirror "Capstor"? |
||||||||||||||
| The variables `STORE` and `PROJECT` are not set on Clariden. | ||||||||||||||
| The home directory is hosted on [VAST][ref-storage-vast]. | ||||||||||||||
|
|
||||||||||||||
| !!! todo | ||||||||||||||
| a standardised table with information about | ||||||||||||||
| As usual, an overview of your quota on the different filesystems, can be obtained by the `quota` command. | ||||||||||||||
|
|
||||||||||||||
| * number and type of nodes | ||||||||||||||
| ## Getting started | ||||||||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||
| The project and resources are managed by [this tool][ref-account-waldur]. | ||||||||||||||
|
|
||||||||||||||
| and any special notes | ||||||||||||||
| ### Connect to Clariden | ||||||||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||
| You can connect to Clariden via [ssh][ref-ssh-config], ensuring that the file `~/.ssh/config` has these settings (replace `cscsusername` with your username). | ||||||||||||||
|
|
||||||||||||||
| ## Logging into Clariden | ||||||||||||||
| ```title="$HOME/.ssh/config" | ||||||||||||||
| Host ela | ||||||||||||||
| HostName ela.cscs.ch | ||||||||||||||
| User cscsusername | ||||||||||||||
| IdentityFile ~/.ssh/cscs-key | ||||||||||||||
|
|
||||||||||||||
| !!! todo | ||||||||||||||
| how to log in, i.e. `ssh clariden.cscs.ch` via `ela.cscs.ch` | ||||||||||||||
| Host clariden | ||||||||||||||
| HostName clariden.alps.cscs.ch | ||||||||||||||
| ProxyJump ela | ||||||||||||||
| User cscsusername | ||||||||||||||
| IdentityFile ~/.ssh/cscs-key | ||||||||||||||
| IdentitiesOnly yes | ||||||||||||||
| ``` | ||||||||||||||
|
Comment on lines
31
to
+44
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Link to https://eth-cscs.github.io/cscs-docs/access/ssh/ or more specifically https://eth-cscs.github.io/cscs-docs/access/ssh/#logging-in? |
||||||||||||||
| You can then use `ssh clariden` to login to Clariden. | ||||||||||||||
|
|
||||||||||||||
| provide the snippet to add to your `~/.ssh/config`, and link to where we document this (docs not currently available) | ||||||||||||||
| ### Available programming environments | ||||||||||||||
|
|
||||||||||||||
| ## Software and services | ||||||||||||||
| #### Container engine | ||||||||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||
| The recommended way for working on Clariden are containerized workflows leveraging the [container engine][ref-container-engine]. | ||||||||||||||
|
|
||||||||||||||
| !!! todo | ||||||||||||||
| information about CSCS services/tools available | ||||||||||||||
| #### UENV | ||||||||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||
| Besides running containerized workflows, it is possible to run your jobs with a [UENV][ref-tool-uenv]. | ||||||||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||
|
|
||||||||||||||
| * container engine | ||||||||||||||
| * uenv | ||||||||||||||
| * CPE | ||||||||||||||
| * ... etc | ||||||||||||||
| #### CPE | ||||||||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||
| Unlike on other platforms the Cray programming environment is not supported on Clariden. | ||||||||||||||
|
|
||||||||||||||
| ## Running Jobs on Clariden | ||||||||||||||
| ### Running Jobs on Clariden | ||||||||||||||
|
|
||||||||||||||
| Clariden uses [SLURM][slurm] as the workload manager, which is used to launch and monitor distributed workloads, such as training runs. | ||||||||||||||
|
|
||||||||||||||
| See detailed instructions on how to run jobs on the [Grace-Hopper nodes][ref-slurm-gh200]. | ||||||||||||||
|
|
||||||||||||||
| ## Storage | ||||||||||||||
|
|
||||||||||||||
| !!! todo | ||||||||||||||
| describe the file systems that are attached, and where. | ||||||||||||||
|
|
||||||||||||||
| This is where `$SCRATCH`, `$PROJECT` etc are defined for this cluster. | ||||||||||||||
|
|
||||||||||||||
| Refer to the specific file systems that these map onto (capstor, iopstor, waldur), and link to the storage docs for these. | ||||||||||||||
|
|
||||||||||||||
| Also discuss any specific storage policies. You might want to discuss storage policies for MLp one level up, in the [MLp docs][ref-platform-mlp]. | ||||||||||||||
|
|
||||||||||||||
| * attached storage and policies | ||||||||||||||
| The flag `--account=<account>` / `-A <account>` is mandatory for submitting jobs to SLURM, and nodehour accounting will be on the account that is specified with this flag. | ||||||||||||||
|
|
||||||||||||||
| ## Calendar and key events | ||||||||||||||
| ## Maintenance | ||||||||||||||
| ### Calendar and key events | ||||||||||||||
|
Comment on lines
+66
to
+67
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||
|
|
||||||||||||||
| The system is updated every Tuesday, between 9 am and 12 pm. | ||||||||||||||
| ... | ||||||||||||||
| - The system is updated every Wednesday, between 08:00-12:00 Zurich local time. | ||||||||||||||
| - Access to the system might be closed during the maintenance window. | ||||||||||||||
| - There is a maintenance reservation for SLURM, which prohibits jobs to start, if their `StartTime+TimeLimit` is beyond the maintenance start. | ||||||||||||||
|
|
||||||||||||||
| !!!todo | ||||||||||||||
| notifications | ||||||||||||||
|
|
||||||||||||||
| a calendar widget would be useful, particularly if we can have a central calendar, and a way to filter events for specific instances | ||||||||||||||
|
|
||||||||||||||
| ## Change log | ||||||||||||||
| ### Change log | ||||||||||||||
|
|
||||||||||||||
| !!! change "special text boxes for updates" | ||||||||||||||
| they can be opened and closed. | ||||||||||||||
|
|
@@ -82,7 +92,7 @@ The system is updated every Tuesday, between 9 am and 12 pm. | |||||||||||||
| ??? change "2024-09-18 Daint users" | ||||||||||||||
| In order to complete the preparatory work necessary to deliver Alps in production, as of September 18 2024 the vCluster Daint on Alps will no longer be accessible until further notice: the early access will still be granted on Tödi using the Slurm reservation option `--reservation=daint` | ||||||||||||||
|
|
||||||||||||||
| ## Known issues | ||||||||||||||
| ### Known issues | ||||||||||||||
|
|
||||||||||||||
| __TODO__ list of know issues - include links to known issues page | ||||||||||||||
|
|
||||||||||||||
|
|
||||||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.