Skip to content

Commit 2cf1aea

Browse files
committed
review comments
1 parent c6f9e9c commit 2cf1aea

File tree

3 files changed

+15
-6
lines changed

3 files changed

+15
-6
lines changed

docs/clusters/clariden.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,9 @@ Users are encouraged to use containers on Clariden.
4242

4343
* Jobs using containers can be easily set up and submitted using the [container engine][ref-container-engine].
4444
* To build images, see the [guide to building container images on Alps][ref-build-containers].
45+
* Base images which include the necessary libraries and compilers are for example available from the [Nvidia NGC Catalog](https://catalog.ngc.nvidia.com/containers):
46+
* [HPC NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nvhpc)
47+
* [PyTorch NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch)
4548

4649
Alternatively, [uenv][ref-uenv] are also available on Clariden. Currently deployed on Clariden:
4750

docs/software/ml/index.md

Lines changed: 10 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -10,8 +10,11 @@ Users can choose between running containers, using provided uenv software stacks
1010

1111
Containerization is the recommended approach for ML workloads on Alps, as it simplifies software management and maximizes compatibility with other systems.
1212

13-
* CSCS does not provide ready-to-use ML container images
14-
* Users are encouraged to build their own containers, starting from popular sources such as the [Nvidia NGC Catalog](https://catalog.ngc.nvidia.com/containers)
13+
* Users are encouraged to build their own containers, starting from popular sources such as the [Nvidia NGC Catalog](https://catalog.ngc.nvidia.com/containers), which offers a variety of pre-built images optimized for HPC and ML workloads.
14+
Examples include:
15+
* [PyTorch NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch)
16+
* [TensorFlow NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorflow)
17+
* For frequently changing dependencies, consider creating a virtual environment (venv) mounted into the container.
1518

1619
Helpful references:
1720

@@ -35,7 +38,7 @@ See this [PyTorch venv example][ref-uenv-pytorch-venv] for details.
3538

3639
## Building custom Python environments
3740

38-
Users may also choose to build entirely custom software stacks using Python package managers such as `pip` or `conda`.
41+
Users may also choose to build entirely custom software stacks using Python package managers such as `uv` or `conda`.
3942
Most ML libraries are available via the [Python Package Index (PyPI)](https://pypi.org/).
4043

4144
To ensure optimal performance on CSCS systems, we recommend starting from an environment that already includes:
@@ -46,6 +49,8 @@ To ensure optimal performance on CSCS systems, we recommend starting from an env
4649

4750
This can be achieved either by:
4851

49-
* Building a [custom container image][ref-build-containers] based on a suitable ML-ready base image.
50-
* Starting from a provided uenv (e.g., [PrgEnv GNU][ref-uenv-prgenv-gnu] or [PyTorch uenv][ref-uenv-pytorch]) and extending it with a virtual environment.
52+
* building a [custom container image][ref-build-containers] based on a suitable ML-ready base image,
53+
* or starting from a provided uenv (e.g., [PrgEnv GNU][ref-uenv-prgenv-gnu] or [PyTorch uenv][ref-uenv-pytorch]),
54+
55+
and extending it with a virtual environment.
5156

docs/software/ml/pytorch.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -307,6 +307,7 @@ $ exit # (6)!
307307
This will restore the original Python executable provided by the uenv.
308308
6. The uenv can be exited using the `exit` command or by typing `ctrl-d`.
309309

310+
310311
!!! note "Squashing the virtual environment"
311312
Python virtual environments can be slow on the parallel Lustre file system due to the amount of small files and potentially many processes accessing it.
312313
If this becomes a bottleneck, consider [squashing the venv][ref-guides-storage-venv] into its own memory-mapped, read-only file system to enhance scalability and reduce load times.
@@ -380,7 +381,7 @@ srun bash -c "
380381
The optimal number depends on the workload and should be determined by testing.
381382
Consider for example that typical workloads using PyTorch may fork the processes, so the number of threads should be around the number of cores per task divided by the number of processes.
382383
3. These variables are used by PyTorch to initialize the distributed backend.
383-
The `MASTER_ADDR` and `MASTER_PORT` variables are used to determine the address and port of the master node.
384+
The `MASTER_ADDR`, `MASTER_PORT` and `WORLD_SIZE` variables are used to determine the address and port of the master node.
384385
Additionally we also need `RANK` and `LOCAL_RANK` but these must be set per-process, see below.
385386
4. Enable more graceful exception handling, see [PyTorch documentation](https://pytorch.org/docs/stable/torch_nccl_environment_variables.html)
386387
5. Set the Trition home to a local path (e.g. `/dev/shm`) to avoid writing to the (distributed) file system.

0 commit comments

Comments
 (0)