Skip to content

Commit bae93b8

Browse files
Madeeksbcummingmsimberg
authored
Updates to CE documentation (#261)
- Added section on user-facing feature differences from upstream Pyxis - Miscellaneous adjustments and corrections --------- Co-authored-by: Ben Cumming <[email protected]> Co-authored-by: Mikael Simberg <[email protected]>
1 parent 1d3d15a commit bae93b8

File tree

4 files changed

+28
-12
lines changed

4 files changed

+28
-12
lines changed

docs/software/container-engine/index.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -34,8 +34,8 @@ A more detailed explanation of each entry for the EDF can be seen in the [EDF r
3434

3535
```toml
3636
image = "library/ubuntu:24.04"
37-
mounts = ["/capstor/scratch/cscs/${USER}:/capstor/scratch/cscs/${USER}"]
38-
workdir = "/capstor/scratch/cscs/${USER}"
37+
mounts = ["${SCRATCH}:${SCRATCH}"]
38+
workdir = "${SCRATCH}"
3939
```
4040

4141
### Step 2. Launch a program

docs/software/container-engine/resource-hook.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -239,14 +239,15 @@ com.hooks.ssh.enabled = "true"
239239
com.hooks.ssh.authorize_ssh_key = "<public-key>" # (1)
240240
```
241241

242-
1. Replace `<public-key>` with your SSH public key.
242+
1. Replace `<public-key>` with the path to your SSH public key file.
243243

244244
The SSH hook runs a lightweight, statically-linked SSH server (a build of [Dropbear](https://matt.ucc.asn.au/dropbear/dropbear.html)) inside the container.
245245
While the container is running, it's possible to connect to it from a remote host using a private key matching the public one authorized in the EDF annotation.
246246
It can be useful to add SSH connectivity to containers (for example, enabling remote debugging) without bundling an SSH server into the container image or creating ad-hoc image variants for such purposes.
247247

248248
The `com.hooks.ssh.authorize_ssh_key` annotation allows the authorization of a custom public SSH key for remote connections.
249-
The annotation value must be the absolute path to a text file containing the public key (just the public key without any extra signature/certificate).
249+
The annotation value must be the absolute path to a *text file* containing the public key (just the public key without any extra signature/certificate).
250+
The annotation value should not be the public SSH key itself.
250251
After the container starts, it is possible to get a remote shell inside the container by connecting with SSH to the listening port.
251252

252253
By default, the server started by the SSH hook listens to port 15263, but this setting can be controlled through the `com.hooks.ssh.port` annotation in the EDF.
@@ -312,7 +313,7 @@ The hook can be activated by setting the `com.hooks.nvidia_cuda_mps.enabled` to
312313
8
313314
```
314315

315-
??? example "Available GPUs and oversubscription error"
316+
??? example "Available GPUs and oversubscription error *without* the CUDA MPS hook"
316317
```toml title="EDF: vectoradd-cuda.toml"
317318
image = "nvcr.io#nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0-ubuntu22.04" # (1)
318319
```

docs/software/container-engine/run.md

Lines changed: 20 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -84,14 +84,14 @@ If an EDF is located in the search path, its name can be used in the `--environm
8484
...
8585
```
8686

87-
## Using container images
87+
## Managing container images
8888

8989
By default, images defined in the EDF as remote registry references (e.g. a Docker reference) are automatically pulled and locally cached.
9090
A cached image would be preferred to pulling the image again in later usage.
9191

9292
An image cache is automatically created at `.edf_imagestore` in the user's scratch folder (i.e., `${SCRATCH}/.edf_imagestore`). Cached images are stored with the corresponding CPU architecture suffix (e.g., `x86` and `aarch64`). Remove the cached image to force re-pull.
9393

94-
An alternative image store path can be specify by defining the environment variable `EDF_IMAGESTORE`. `EDF_IMAGESTORE` must be an absolute path to an existing folder. Image caching may also be disable by setting `EDF_IMAGESTORE` to `void` (currently only available on Daint and Santis).
94+
An alternative image store path can be specify by defining the environment variable `EDF_IMAGESTORE`. `EDF_IMAGESTORE` must be an absolute path to an existing folder. Image caching may also be disable by setting `EDF_IMAGESTORE` to `void`.
9595

9696
!!! note
9797
* If the CE cannot create a directory for the image cache, it operates in cache-free mode, meaning that it pulls an ephemeral image before every container launch and discards it upon termination.
@@ -227,9 +227,6 @@ See [the EDF reference][ref-ce-edf-reference] for the full specification of the
227227
[](){#ref-ce-run-mounting-squashfs}
228228
### Mounting a SquashFS image
229229

230-
!!! warning
231-
This feature is only available on some vClusters (Daint and Santis, as of 17.06.2025).
232-
233230
A SquashFS image, essentially being a compressed data archive, can also be mounted _as a directory_ so that the image contents are readable inside the container. For this, `:sqsh` should be appended after the destination.
234231

235232
!!! example "Mounting a SquashFS image `${SCRATCH}/data.sqsh` to `/data`"
@@ -238,3 +235,21 @@ A SquashFS image, essentially being a compressed data archive, can also be mount
238235
```
239236

240237
This is particularly useful if a job should read _multiple_ data files _frequently_, which may cause severe file access overheads. Instead, it is recommended to pack data files into one data SquashFS image and mount it inside a container. See the *"magic phrase"* in [this documentation](https://tldp.org/HOWTO/SquashFS-HOWTO/creatingandusing.html) for creating a SquashFS image.
238+
239+
240+
## Differences from upstream Pyxis
241+
242+
The Container Engine currently uses a customized version of [NVIDIA Pyxis](https://github.com/NVIDIA/pyxis) to integrate containers with Slurm.
243+
244+
Compared to the original, upstream Pyxis code, the following user-facing differences should be noted:
245+
246+
!!! note
247+
As of September 10th, 2025, these items apply only to the Clariden and Santis vClusters.
248+
249+
* **Disabled remapping of PyTorch-related variables:** upstream Pyxis automatically remaps the `RANK` and `LOCAL_RANK` environment variables used by PyTorch to match the `SLURM_PROCID` and `SLURM_LOCALID` variables, respectively, if the `PYTORCH_VERSION` variable is detected in the container's environment.
250+
This behavior has been **disabled** by default.
251+
The remapping can be reactivated by setting the [annotation][ref-ce-annotations] `com.pyxis.pytorch_remap_vars="true"` in the EDF.
252+
253+
* **Logging container entrypoint output through EDF annotation:** by default, Pyxis hides the output of the container's entrypoint, if the latter is used.
254+
To make the entrypoint output printed on the stdout stream of the Slurm job, upstream Pyxis provides the `--container-entrypoint-log` CLI option for `srun`.
255+
In the Pyxis version used by the Container Engine, entrypoint output printing can also be enabled by setting the [annotation][ref-ce-annotations] `com.pyxis.entrypoint_log="true"` in the EDF.

mkdocs.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -55,8 +55,8 @@ nav:
5555
- 'Release notes': software/uenv/release-notes.md
5656
- 'Container Engine':
5757
- software/container-engine/index.md
58-
- 'Using container engine': software/container-engine/run.md
59-
- 'Resource and hooks': software/container-engine/resource-hook.md
58+
- 'Using the Container Engine': software/container-engine/run.md
59+
- 'Hooks and native resources': software/container-engine/resource-hook.md
6060
- 'EDF reference': software/container-engine/edf.md
6161
- 'Known issues': software/container-engine/known-issue.md
6262
- 'Building and Installing Software':

0 commit comments

Comments
 (0)