You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/clusters/clariden.md
+3Lines changed: 3 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -42,6 +42,9 @@ Users are encouraged to use containers on Clariden.
42
42
43
43
* Jobs using containers can be easily set up and submitted using the [container engine][ref-container-engine].
44
44
* To build images, see the [guide to building container images on Alps][ref-build-containers].
45
+
* Base images which include the necessary libraries and compilers are for example available from the [Nvidia NGC Catalog](https://catalog.ngc.nvidia.com/containers):
46
+
*[HPC NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nvhpc)
47
+
*[PyTorch NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch)
45
48
46
49
Alternatively, [uenv][ref-uenv] are also available on Clariden. Currently deployed on Clariden:
Copy file name to clipboardExpand all lines: docs/software/ml/index.md
+10-5Lines changed: 10 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -10,8 +10,11 @@ Users can choose between running containers, using provided uenv software stacks
10
10
11
11
Containerization is the recommended approach for ML workloads on Alps, as it simplifies software management and maximizes compatibility with other systems.
12
12
13
-
* CSCS does not provide ready-to-use ML container images
14
-
* Users are encouraged to build their own containers, starting from popular sources such as the [Nvidia NGC Catalog](https://catalog.ngc.nvidia.com/containers)
13
+
* Users are encouraged to build their own containers, starting from popular sources such as the [Nvidia NGC Catalog](https://catalog.ngc.nvidia.com/containers), which offers a variety of pre-built images optimized for HPC and ML workloads.
14
+
Examples include:
15
+
* [PyTorch NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch)
16
+
* [TensorFlow NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorflow)
17
+
* For frequently changing dependencies, consider creating a virtual environment (venv) mounted into the container.
15
18
16
19
Helpful references:
17
20
@@ -35,7 +38,7 @@ See this [PyTorch venv example][ref-uenv-pytorch-venv] for details.
35
38
36
39
## Building custom Python environments
37
40
38
-
Users may also choose to build entirely custom software stacks using Python package managers such as `pip` or `conda`.
41
+
Users may also choose to build entirely custom software stacks using Python package managers such as `uv` or `conda`.
39
42
Most ML libraries are available via the [Python Package Index (PyPI)](https://pypi.org/).
40
43
41
44
To ensure optimal performance on CSCS systems, we recommend starting from an environment that already includes:
@@ -46,6 +49,8 @@ To ensure optimal performance on CSCS systems, we recommend starting from an env
46
49
47
50
This can be achieved either by:
48
51
49
-
* Building a [custom container image][ref-build-containers] based on a suitable ML-ready base image.
50
-
* Starting from a provided uenv (e.g., [PrgEnv GNU][ref-uenv-prgenv-gnu] or [PyTorch uenv][ref-uenv-pytorch]) and extending it with a virtual environment.
52
+
* building a [custom container image][ref-build-containers] based on a suitable ML-ready base image,
53
+
* or starting from a provided uenv (e.g., [PrgEnv GNU][ref-uenv-prgenv-gnu] or [PyTorch uenv][ref-uenv-pytorch]),
Copy file name to clipboardExpand all lines: docs/software/ml/pytorch.md
+2-1Lines changed: 2 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -307,6 +307,7 @@ $ exit # (6)!
307
307
This will restore the original Python executable provided by the uenv.
308
308
6. The uenv can be exited using the `exit` command or by typing `ctrl-d`.
309
309
310
+
310
311
!!! note "Squashing the virtual environment"
311
312
Python virtual environments can be slow on the parallel Lustre file system due to the amount of small files and potentially many processes accessing it.
312
313
If this becomes a bottleneck, consider [squashing the venv][ref-guides-storage-venv] into its own memory-mapped, read-only file system to enhance scalability and reduce load times.
@@ -380,7 +381,7 @@ srun bash -c "
380
381
The optimal number depends on the workload and should be determined by testing.
381
382
Consider for example that typical workloads using PyTorch may fork the processes, so the number of threads should be around the number of cores per task divided by the number of processes.
382
383
3. These variables are used by PyTorch to initialize the distributed backend.
383
-
The `MASTER_ADDR`and `MASTER_PORT` variables are used to determine the address and port of the master node.
384
+
The `MASTER_ADDR`, `MASTER_PORT`and `WORLD_SIZE` variables are used to determine the address and port of the master node.
384
385
Additionally we also need `RANK` and `LOCAL_RANK` but these must be set per-process, see below.
385
386
4. Enable more graceful exception handling, see [PyTorch documentation](https://pytorch.org/docs/stable/torch_nccl_environment_variables.html)
386
387
5. Set the Trition home to a local path (e.g. `/dev/shm`) to avoid writing to the (distributed) file system.
0 commit comments