review comments

boeschf · boeschf · commit 2cf1aea97340 · 2025-04-17T14:15:01.000+02:00
diff --git a/docs/clusters/clariden.md b/docs/clusters/clariden.md
@@ -42,6 +42,9 @@ Users are encouraged to use containers on Clariden.
 
 * Jobs using containers can be easily set up and submitted using the [container engine][ref-container-engine].
 * To build images, see the [guide to building container images on Alps][ref-build-containers].
+* Base images which include the necessary libraries and compilers are for example available from the [Nvidia NGC Catalog](https://catalog.ngc.nvidia.com/containers):
+    * [HPC NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nvhpc)
+    * [PyTorch NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch)
 
 Alternatively, [uenv][ref-uenv] are also available on Clariden. Currently deployed on Clariden:
 
diff --git a/docs/software/ml/index.md b/docs/software/ml/index.md
@@ -10,8 +10,11 @@ Users can choose between running containers, using provided uenv software stacks
 
 Containerization is the recommended approach for ML workloads on Alps, as it simplifies software management and maximizes compatibility with other systems.
 
-* CSCS does not provide ready-to-use ML container images
-* Users are encouraged to build their own containers, starting from popular sources such as the [Nvidia NGC Catalog](https://catalog.ngc.nvidia.com/containers)
+* Users are encouraged to build their own containers, starting from popular sources such as the [Nvidia NGC Catalog](https://catalog.ngc.nvidia.com/containers), which offers a variety of pre-built images optimized for HPC and ML workloads.
+Examples include:
+    * [PyTorch NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch)
+    * [TensorFlow NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorflow)
+* For frequently changing dependencies, consider creating a virtual environment (venv) mounted into the container.
 
 Helpful references:
 
@@ -35,7 +38,7 @@ See this [PyTorch venv example][ref-uenv-pytorch-venv] for details.
 
 ## Building custom Python environments
 
-Users may also choose to build entirely custom software stacks using Python package managers such as `pip` or `conda`.
+Users may also choose to build entirely custom software stacks using Python package managers such as `uv` or `conda`.
 Most ML libraries are available via the [Python Package Index (PyPI)](https://pypi.org/).
 
 To ensure optimal performance on CSCS systems, we recommend starting from an environment that already includes:
@@ -46,6 +49,8 @@ To ensure optimal performance on CSCS systems, we recommend starting from an env
 
 This can be achieved either by:
 
-* Building a [custom container image][ref-build-containers] based on a suitable ML-ready base image.
-* Starting from a provided uenv (e.g., [PrgEnv GNU][ref-uenv-prgenv-gnu] or [PyTorch uenv][ref-uenv-pytorch]) and extending it with a virtual environment.
+* building a [custom container image][ref-build-containers] based on a suitable ML-ready base image,
+* or starting from a provided uenv (e.g., [PrgEnv GNU][ref-uenv-prgenv-gnu] or [PyTorch uenv][ref-uenv-pytorch]),
+
+and extending it with a virtual environment.
 
diff --git a/docs/software/ml/pytorch.md b/docs/software/ml/pytorch.md
@@ -307,6 +307,7 @@ $ exit # (6)!
    This will restore the original Python executable provided by the uenv.
 6. The uenv can be exited using the `exit` command or by typing `ctrl-d`.
 
+
 !!! note "Squashing the virtual environment"
     Python virtual environments can be slow on the parallel Lustre file system due to the amount of small files and potentially many processes accessing it.
     If this becomes a bottleneck, consider [squashing the venv][ref-guides-storage-venv] into its own memory-mapped, read-only file system to enhance scalability and reduce load times.
@@ -380,7 +381,7 @@ srun bash -c "
    The optimal number depends on the workload and should be determined by testing.
    Consider for example that typical workloads using PyTorch may fork the processes, so the number of threads should be around the number of cores per task divided by the number of processes.
 3. These variables are used by PyTorch to initialize the distributed backend.
-   The `MASTER_ADDR` and `MASTER_PORT` variables are used to determine the address and port of the master node.
+   The `MASTER_ADDR`, `MASTER_PORT` and `WORLD_SIZE` variables are used to determine the address and port of the master node.
    Additionally we also need `RANK` and `LOCAL_RANK` but these must be set per-process, see below.
 4. Enable more graceful exception handling, see [PyTorch documentation](https://pytorch.org/docs/stable/torch_nccl_environment_variables.html)
 5. Set the Trition home to a local path (e.g. `/dev/shm`) to avoid writing to the (distributed) file system.