Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions docs/software/container-engine/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,8 +34,8 @@ A more detailed explanation of each entry for the EDF can be seen in the [EDF r

```toml
image = "library/ubuntu:24.04"
mounts = ["/capstor/scratch/cscs/${USER}:/capstor/scratch/cscs/${USER}"]
workdir = "/capstor/scratch/cscs/${USER}"
mounts = ["${SCRATCH}:${SCRATCH}"]
workdir = "${SCRATCH}"
```

### Step 2. Launch a program
Expand Down
7 changes: 4 additions & 3 deletions docs/software/container-engine/resource-hook.md
Original file line number Diff line number Diff line change
Expand Up @@ -239,14 +239,15 @@ com.hooks.ssh.enabled = "true"
com.hooks.ssh.authorize_ssh_key = "<public-key>" # (1)
```

1. Replace `<public-key>` with your SSH public key.
1. Replace `<public-key>` with the path to your SSH public key file.

The SSH hook runs a lightweight, statically-linked SSH server (a build of [Dropbear](https://matt.ucc.asn.au/dropbear/dropbear.html)) inside the container.
While the container is running, it's possible to connect to it from a remote host using a private key matching the public one authorized in the EDF annotation.
It can be useful to add SSH connectivity to containers (for example, enabling remote debugging) without bundling an SSH server into the container image or creating ad-hoc image variants for such purposes.

The `com.hooks.ssh.authorize_ssh_key` annotation allows the authorization of a custom public SSH key for remote connections.
The annotation value must be the absolute path to a text file containing the public key (just the public key without any extra signature/certificate).
The annotation value must be the absolute path to a *text file* containing the public key (just the public key without any extra signature/certificate).
The annotation value should not be the public SSH key itself.
After the container starts, it is possible to get a remote shell inside the container by connecting with SSH to the listening port.

By default, the server started by the SSH hook listens to port 15263, but this setting can be controlled through the `com.hooks.ssh.port` annotation in the EDF.
Expand Down Expand Up @@ -312,7 +313,7 @@ The hook can be activated by setting the `com.hooks.nvidia_cuda_mps.enabled` to
8
```

??? example "Available GPUs and oversubscription error"
??? example "Available GPUs and oversubscription error *without* the CUDA MPS hook"
```toml title="EDF: vectoradd-cuda.toml"
image = "nvcr.io#nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0-ubuntu22.04" # (1)
```
Expand Down
25 changes: 20 additions & 5 deletions docs/software/container-engine/run.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,14 +84,14 @@ If an EDF is located in the search path, its name can be used in the `--environm
...
```

## Using container images
## Managing container images

By default, images defined in the EDF as remote registry references (e.g. a Docker reference) are automatically pulled and locally cached.
A cached image would be preferred to pulling the image again in later usage.

An image cache is automatically created at `.edf_imagestore` in the user's scratch folder (i.e., `${SCRATCH}/.edf_imagestore`). Cached images are stored with the corresponding CPU architecture suffix (e.g., `x86` and `aarch64`). Remove the cached image to force re-pull.

An alternative image store path can be specify by defining the environment variable `EDF_IMAGESTORE`. `EDF_IMAGESTORE` must be an absolute path to an existing folder. Image caching may also be disable by setting `EDF_IMAGESTORE` to `void` (currently only available on Daint and Santis).
An alternative image store path can be specify by defining the environment variable `EDF_IMAGESTORE`. `EDF_IMAGESTORE` must be an absolute path to an existing folder. Image caching may also be disable by setting `EDF_IMAGESTORE` to `void`.

!!! note
* If the CE cannot create a directory for the image cache, it operates in cache-free mode, meaning that it pulls an ephemeral image before every container launch and discards it upon termination.
Expand Down Expand Up @@ -227,9 +227,6 @@ See [the EDF reference][ref-ce-edf-reference] for the full specification of the
[](){#ref-ce-run-mounting-squashfs}
### Mounting a SquashFS image

!!! warning
This feature is only available on some vClusters (Daint and Santis, as of 17.06.2025).

A SquashFS image, essentially being a compressed data archive, can also be mounted _as a directory_ so that the image contents are readable inside the container. For this, `:sqsh` should be appended after the destination.

!!! example "Mounting a SquashFS image `${SCRATCH}/data.sqsh` to `/data`"
Expand All @@ -238,3 +235,21 @@ A SquashFS image, essentially being a compressed data archive, can also be mount
```

This is particularly useful if a job should read _multiple_ data files _frequently_, which may cause severe file access overheads. Instead, it is recommended to pack data files into one data SquashFS image and mount it inside a container. See the *"magic phrase"* in [this documentation](https://tldp.org/HOWTO/SquashFS-HOWTO/creatingandusing.html) for creating a SquashFS image.


## Differences from upstream Pyxis

The Container Engine currently uses a customized version of [NVIDIA Pyxis](https://github.com/NVIDIA/pyxis) to integrate containers with Slurm.

Compared to the original, upstream Pyxis code, the following user-facing differences should be noted:

!!! note
As of September 10th, 2025, these items apply only to the Clariden and Santis vClusters.

* **Disabled remapping of PyTorch-related variables:** upstream Pyxis automatically remaps the `RANK` and `LOCAL_RANK` environment variables used by PyTorch to match the `SLURM_PROCID` and `SLURM_LOCALID` variables, respectively, if the `PYTORCH_VERSION` variable is detected in the container's environment.
This behavior has been **disabled** by default.
The remapping can be reactivated by setting the [annotation][ref-ce-annotations] `com.pyxis.pytorch_remap_vars="true"` in the EDF.

* **Logging container entrypoint output through EDF annotation:** by default, Pyxis hides the output of the container's entrypoint, if the latter is used.
To make the entrypoint output printed on the stdout stream of the Slurm job, upstream Pyxis provides the `--container-entrypoint-log` CLI option for `srun`.
In the Pyxis version used by the Container Engine, entrypoint output printing can also be enabled by setting the [annotation][ref-ce-annotations] `com.pyxis.entrypoint_log="true"` in the EDF.
4 changes: 2 additions & 2 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -55,8 +55,8 @@ nav:
- 'Release notes': software/uenv/release-notes.md
- 'Container Engine':
- software/container-engine/index.md
- 'Using container engine': software/container-engine/run.md
- 'Resource and hooks': software/container-engine/resource-hook.md
- 'Using the Container Engine': software/container-engine/run.md
- 'Hooks and native resources': software/container-engine/resource-hook.md
- 'EDF reference': software/container-engine/edf.md
- 'Known issues': software/container-engine/known-issue.md
- 'Building and Installing Software':
Expand Down