Merge branch 'main' into patch-1

bcumming · web-flow · commit e605fd10f6e5 · 2025-10-16T13:55:42.000+02:00
diff --git a/.github/actions/spelling/allow.txt b/.github/actions/spelling/allow.txt
@@ -4,6 +4,7 @@ AMD
 Alpstein
 Balfrin
 Besard
+Besso
 Broyden
 CFLAGS
 CHARMM
@@ -121,6 +122,7 @@ artifactory
 autodetection
 aws
 baremetal
+besso
 biomolecular
 blaspp
 blt
@@ -326,6 +328,7 @@ uenv
 uenvs
 uids
 ultrasoft
+unsquashfs
 utkin
 vCluster
 vClusters
diff --git a/.github/workflows/docs_preview_delete.yaml b/.github/workflows/docs_preview_delete.yaml
@@ -0,0 +1,14 @@
+name: Delete PR preview
+on:
+  pull_request_target:
+    branches: ['main']
+    types: ['closed']
+
+jobs:
+  preview_delete:
+    name: Delete preview
+    runs-on: ubuntu-latest
+    steps:
+      - name: delete-preview
+        run: |
+          curl --fail -X DELETE -H "Authorization: Bearer ${{ secrets.UPLOAD_TOKEN }}" https://docs.tds.cscs.ch/upload?path=${{ github.event.pull_request.number }}
diff --git a/.github/workflows/welcome.yaml b/.github/workflows/welcome.yaml
diff --git a/docs/clusters/besso.md b/docs/clusters/besso.md
@@ -0,0 +1,80 @@
+[](){#ref-cluster-besso}
+# Besso
+
+Besso is a small Alps cluster that provides development resources for porting software for selected customers.
+It is provided as is, without the same level of support as the main platform clusters.
+
+### Storage and file systems
+
+Besso uses the [HPCP filesystems and storage policies][ref-hpcp-storage].
+
+## Getting started
+
+### Logging into Besso
+
+To connect to Besso via SSH, first refer to the [ssh guide][ref-ssh].
+
+!!! example "`~/.ssh/config`"
+    Add the following to your [SSH configuration][ref-ssh-config] to enable you to directly connect to besso using `ssh besso`.
+    ```
+    Host besso
+        HostName besso.vc.cscs.ch
+        ProxyJump ela
+        User cscsusername
+        IdentityFile ~/.ssh/cscs-key
+        IdentitiesOnly yes
+    ```
+
+### Software
+
+[](){#ref-cluster-besso-uenv}
+#### uenv
+
+Besso is a development and testing system, for which CSCS does not provide supported applications.
+
+Instead, the [prgenv-gnu][ref-uenv-prgenv-gnu] programming environment is provided for the both the [a100][ref-alps-a100-node] and [mi200][ref-alps-mi200-node] node types.
+
+[](){#ref-cluster-besso-containers}
+#### Containers
+
+Besso supports container workloads using the [Container Engine][ref-container-engine].
+
+To build images, see the [guide to building container images on Alps][ref-build-containers].
+
+#### Cray Modules
+
+!!! warning
+    The Cray Programming Environment (CPE), loaded using `module load cray`, is no longer supported by CSCS.
+
+    CSCS will continue to support and update uenv and the Container Engine, and users are encouraged to update their workflows to use these methods at the first opportunity.
+
+    The CPE is still installed on Besso, however it will receive no support or updates, and will be [replaced with a container][ref-cpe] in a future update.
+
+## Running jobs on Besso
+
+### Slurm
+
+Besso uses [Slurm][ref-slurm] as the workload manager, which is used to launch and monitor workloads on compute nodes.
+
+There are multiple [Slurm partitions][ref-slurm-partitions] on the system:
+
+* the `a100` partition contains [NVIDIA A100 GPU][ref-alps-a100-node] nodes
+* the `mi200` partition contains [AMD Mi250x GPU][ref-alps-mi200-node] nodes
+* the `normal` partition contains all of the nodes in the system.
+
+| name | max nodes per job | time limit |
+| --   |  -- | -- |
+| `a100`   | 2    | 24 hours |
+| `mi200`  | 2    | 24 hours |
+| `normal` | 4    | 24 hours |
+
+See the Slurm documentation for instructions on how to [run jobs][ref-slurm].
+
+### FirecREST
+
+!!! under-construction
+    Besso will have support for [FirecREST][ref-firecrest] access.
+
+## Maintenance and status
+
+There is no regular scheduled maintenance for this system.
diff --git a/docs/clusters/index.md b/docs/clusters/index.md
@@ -43,4 +43,14 @@ The following clusters are part of the platforms that are fully operated by CSCS
     [:octicons-arrow-right-24: Santis][ref-cluster-santis]
 </div>
 
+## Other systems
+
+<div class="grid cards" markdown>
+-   :fontawesome-solid-mountain: __Porting and Development__
+
+    Besso is a small system used by some partners for development and porting with AMD and NVIDIA GPUs.
+
+    [:octicons-arrow-right-24: Besso][ref-cluster-besso]
+</div>
+
 
diff --git a/docs/guides/storage.md b/docs/guides/storage.md
@@ -206,6 +206,8 @@ The first step is to create the virtual environment using the usual workflow.
     # create and activate a new relocatable venv using uv
     # in this case we explicitly select python 3.12
     uv venv -p 3.12 --relocatable --link-mode=copy /dev/shm/sqfs-demo/.venv
+    # You can also point to the uenv python with `uv venv -p $(which python) ...`
+    # which, among other things, enables user portability of the venv
     cd /dev/shm/sqfs-demo
     source .venv/bin/activate
 
diff --git a/docs/index.md b/docs/index.md
@@ -30,15 +30,15 @@ Find out more about Alps...
 
     Learn more about the Alps research infrastructure
 
-    [:octicons-arrow-right-24: Alps Overview](alps/index.md)
+    [:octicons-arrow-right-24: Alps Overview][ref-alps]
 
     Get detailed information about the main components of the infrastructure
 
-    [:octicons-arrow-right-24: Alps Clusters](alps/clusters.md)
+    [:octicons-arrow-right-24: Alps Clusters][ref-alps-clusters]
 
-    [:octicons-arrow-right-24: Alps Hardware](alps/hardware.md)
+    [:octicons-arrow-right-24: Alps Hardware][ref-alps-hardware]
 
-    [:octicons-arrow-right-24: Alps Storage](alps/storage.md)
+    [:octicons-arrow-right-24: Alps Storage][ref-alps-storage]
 
 -   :fontawesome-solid-key: __Logging In__
 
diff --git a/docs/software/container-engine/known-issue.md b/docs/software/container-engine/known-issue.md
@@ -79,3 +79,32 @@ The use of `--environment` as `#SBATCH` is known to cause **unexpected behaviors
  - **Nested use of `--environment`**: running `srun --environment` in `#SBATCH --environment` results in double-entering EDF containers, causing unexpected errors in the underlying container runtime.
 
 To avoid any unexpected confusion, users are advised **not** to use `--environment` as `#SBATCH`. If users encounter a problem while using this, it's recommended to move `--environment` from `#SBATCH` to each `srun` and see if the problem disappears.
+
+[](){#ref-ce-no-user-id}
+## Container start fails with `id: cannot find name for user ID`
+
+If your slurm job using a container fails to start with an error message similar to:
+```console
+slurmstepd: error: pyxis: container start failed with error code: 1
+slurmstepd: error: pyxis: container exited too soon
+slurmstepd: error: pyxis: printing engine log file:
+slurmstepd: error: pyxis:     id: cannot find name for user ID 42
+slurmstepd: error: pyxis:     id: cannot find name for user ID 42
+slurmstepd: error: pyxis:     id: cannot find name for user ID 42
+slurmstepd: error: pyxis:     mkdir: cannot create directory ‘/iopsstor/scratch/cscs/42’: Permission denied
+slurmstepd: error: pyxis: couldn't start container
+slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
+slurmstepd: error: Failed to invoke spank plugin stack
+srun: error: nid001234: task 0: Exited with exit code 1
+srun: Terminating StepId=12345.0
+```
+it does not indicate an issue with your container, but instead means that one or more of the compute nodes have user databases that are not fully synchronized.
+If the problematic node is not automatically drained, please [let us know][ref-get-in-touch] so that we can ensure the node is in a good state.
+You can check the state of a node using `sinfo --nodes=<node>`, e.g.:
+```console
+$ sinfo --nodes=nid006886
+PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
+debug        up    1:30:00      0    n/a
+normal*      up   12:00:00      1 drain$ nid006886
+xfer         up 1-00:00:00      0    n/a
+```
diff --git a/docs/software/container-engine/run.md b/docs/software/container-engine/run.md
@@ -24,6 +24,10 @@ There are three ways to do so:
 !!! note "Shared container at the node-level"
     For memory efficiency reasons, all Slurm tasks on an individual compute node share the same container, including its filesystem. As a consequence, any write operation to the container filesystem by one task will eventually become visible to all other tasks on the same node.
 
+!!! warning "Container start failure with `id: cannot find name for user ID`"
+    Containers may fail to start due to user database issues on compute nodes.
+    See [this section][ref-ce-no-user-id] for more details.
+
 ### Use from batch scripts
 
 Use `--environment` with the Slurm command (e.g., `srun` or `salloc`):
diff --git a/docs/software/sciviz/paraview.md b/docs/software/sciviz/paraview.md
@@ -136,5 +136,7 @@ You will need to add the corresponding XML code to your local ParaView installat
               <Argument value="6000"/>
             </Arguments>
           </Command>
+      </CommandStartup>
+      </Server>
     </Servers>
     ```
diff --git a/docs/software/uenv/release-notes.md b/docs/software/uenv/release-notes.md
@@ -4,6 +4,31 @@
 The latest version of uenv deployed on [Alps clusters][ref-alps-clusters] is **v8.1.0**.
 You can check the version available on a specific system with the `uenv --version` command.
 
+[](){#ref-uenv-release-notes-v9.0.0}
+## v9.0.0
+
+This [version](https://github.com/eth-cscs/uenv2/releases/tag/v9.0.0) will replace v8.1.0 on Alps clusters.
+
+### Features
+
+- elastic logging.
+- Add `--json` option to `image ls` and `image find`.
+- add `--format` flag to uenv status.
+
+### Improvements
+
+- force unsquashfs to use a single thread when unpacking meta data.
+- reimplement squashfs-mount in the main repository.
+- improve file name completion in bash.
+
+### Fixes
+
+- Turn some CLI flags into options, so that they can be set with or without `=`. e.g. `uenv --repo=$HOME/uenv` or `uenv --repo $HOME/uenv`.
+- Only use the meta data path adjacent to a uenv image if it contains an env.json file.
+- `image push` was not pushing the correct meta data path.
+- a bug where the `--only-meta` flag was ignored on `image pull`.
+- add hints to error message when uenv is not found.
+
 [](){#ref-uenv-release-notes-v8.1.0}
 ## v8.1.0
 
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -17,7 +17,14 @@ nav:
   - 'Alps':
     - alps/index.md
     - 'Platforms': alps/platforms.md
-    - 'Clusters': alps/clusters.md
+    - 'Clusters':
+      - clusters/index.md
+      - 'Besso': clusters/besso.md
+      - 'Bristen': clusters/bristen.md
+      - 'Clariden': clusters/clariden.md
+      - 'Daint': clusters/daint.md
+      - 'Eiger': clusters/eiger.md
+      - 'Santis': clusters/santis.md
     - 'Hardware': alps/hardware.md
     - 'Storage': alps/storage.md
     - 'Machine Learning Platform':