You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/software/ml/index.md
+17-33Lines changed: 17 additions & 33 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,67 +3,51 @@
3
3
4
4
## Overview
5
5
6
-
CSCS supports a wide range of machine learning (ML) applications and frameworks
7
-
on its systems. Most ML workloads are containerized to ensure portability,
8
-
reproducibility, and ease of use across environments.
6
+
CSCS supports a wide range of machine learning (ML) applications and frameworks on its systems.
7
+
Most ML workloads are containerized to ensure portability, reproducibility, and ease of use across environments.
9
8
10
-
Users can choose between running containers, using provided uenv software
11
-
stacks, or building custom Python environments tailored to their needs.
9
+
Users can choose between running containers, using provided uenv software stacks, or building custom Python environments tailored to their needs.
12
10
13
11
## Running Machine Learning Applications with Containers
14
12
15
-
Containerization is the recommended approach for ML workloads on Alps, as it
16
-
simplifies software management and maximizes compatibility with other systems.
13
+
Containerization is the recommended approach for ML workloads on Alps, as it simplifies software management and maximizes compatibility with other systems.
17
14
18
15
* CSCS does not provide ready-to-use ML container images
19
-
* Users are encouraged to build their own containers, starting from popular
* Users are encouraged to build their own containers, starting from popular sources such as the [Nvidia NGC Catalog](https://catalog.ngc.nvidia.com/containers)
22
17
23
18
Helpful references:
24
19
25
20
* Running containers on Alps: [Container Engine Guide][ref-container-engine]
26
-
* Building custom container images: [Container Build
27
-
Guide][ref-build-containers]
21
+
* Building custom container images: [Container Build Guide][ref-build-containers]
that can serve as a starting point for machine learning projects. These
33
-
environments provide optimized compilers, libraries, and selected ML
34
-
frameworks.
25
+
Alternatively, CSCS provides pre-configured software stacks ([uenvs][ref-uenv]) that can serve as a starting point for machine learning projects.
26
+
These environments provide optimized compilers, libraries, and selected ML frameworks.
35
27
36
28
Available ML-related uenvs:
37
29
38
-
*[PyTorch][ref-uenv-pytorch] — available on [Clariden][ref-cluster-clariden]
39
-
and [Daint][ref-cluster-daint]
30
+
*[PyTorch][ref-uenv-pytorch] — available on [Clariden][ref-cluster-clariden] and [Daint][ref-cluster-daint]
40
31
41
-
To extend these environments with additional Python packages, it is recommended
42
-
to create a Python Virtual Environment (venv). See this [PyTorch venv
43
-
example][ref-uenv-pytorch-venv] for details.
32
+
To extend these environments with additional Python packages, it is recommended to create a Python Virtual Environment (venv).
33
+
See this [PyTorch venv example][ref-uenv-pytorch-venv] for details.
44
34
45
35
!!! note
46
-
While many Python packages provide pre-built binaries for common
47
-
architectures, some may require building from source.
36
+
While many Python packages provide pre-built binaries for common architectures, some may require building from source.
48
37
49
38
## Building Custom Python Environments
50
39
51
-
Users may also choose to build entirely custom software stacks using Python
52
-
package managers such as `pip` or `conda`. Most ML libraries are available via
53
-
the [Python Package Index (PyPI)](https://pypi.org/).
40
+
Users may also choose to build entirely custom software stacks using Python package managers such as `pip` or `conda`.
41
+
Most ML libraries are available via the [Python Package Index (PyPI)](https://pypi.org/).
54
42
55
-
To ensure optimal performance on CSCS systems, we recommend starting from an
56
-
environment that already includes:
43
+
To ensure optimal performance on CSCS systems, we recommend starting from an environment that already includes:
57
44
58
45
* CUDA, cuDNN
59
46
* MPI, NCCL
60
47
* C/C++ compilers
61
48
62
49
This can be achieved either by:
63
50
64
-
* Building a [custom container image][ref-build-containers] based on a suitable
65
-
ML-ready base image.
66
-
* Starting from a provided uenv (e.g., [PrgEnv GNU][ref-uenv-prgenv-gnu] or
67
-
[PyTorch uenv][ref-uenv-pytorch]) and extending it with a virtual
68
-
environment.
51
+
* Building a [custom container image][ref-build-containers] based on a suitable ML-ready base image.
52
+
* Starting from a provided uenv (e.g., [PrgEnv GNU][ref-uenv-prgenv-gnu] or [PyTorch uenv][ref-uenv-pytorch]) and extending it with a virtual environment.
workloads out of the box. Thus, it comes with batteries included and does not
7
-
just provide the bare [PyTorch framework](https://github.com/pytorch/pytorch).
4
+
The PyTorch software stack was designed with the intention of being able to run [Megatron-LM](https://github.com/NVIDIA/Megatron-LM)-based pre-training workloads out of the box.
5
+
Thus, it comes with batteries included and does not just provide the bare [PyTorch framework](https://github.com/pytorch/pytorch).
8
6
9
7
!!! note "uenv"
10
8
@@ -249,8 +247,7 @@ There are two ways to access the software provided by the uenv, once it has been
249
247
250
248
=== "the default view"
251
249
252
-
The simplest way to get started is to use the `default` file system view,
253
-
which automatically loads all of the packages when the uenv is started.
250
+
The simplest way to get started is to use the `default` file system view, which automatically loads all of the packages when the uenv is started.
254
251
255
252
!!! example "test mpi compilers and python provided by pytorch/v2.6.0"
256
253
```console
@@ -278,9 +275,7 @@ There are two ways to access the software provided by the uenv, once it has been
278
275
279
276
=== "Spack"
280
277
281
-
The pytorch uenv can also be used as a base for building software with
282
-
Spack, because it provides compilers, MPI, Python and common packages like
283
-
HDF5.
278
+
The pytorch uenv can also be used as a base for building software with Spack, because it provides compilers, MPI, Python and common packages like HDF5.
284
279
285
280
[Check out the guide for using Spack with uenv][ref-building-uenv-spack].
286
281
@@ -315,17 +310,11 @@ Uenvs are read-only, and cannot be modified. However, it is possible to add Pyth
315
310
6. The uenv can be exited using the `exit` command or by typing `ctrl-d`.
316
311
317
312
!!! note
318
-
Python virtual environments can be slow on the parallel Lustre file system
319
-
due to the amount of small files and potentially many processes accessing
320
-
it. If this becomes a bottleneck, consider [squashing the
321
-
venv][ref-guides-storage-venv] into its own memory-mapped, read-only file
322
-
system to enhance scalability and reduce load times.
323
-
324
-
Alternatively one can use the uenv as [upstream Spack
325
-
instance][ref-building-uenv-spack] to to add both Python and non-Python
326
-
packages. However, this workflow is more involved and intended for advanced
327
-
Spack users.
313
+
Python virtual environments can be slow on the parallel Lustre file system due to the amount of small files and potentially many processes accessing it.
314
+
If this becomes a bottleneck, consider [squashing the venv][ref-guides-storage-venv] into its own memory-mapped, read-only file system to enhance scalability and reduce load times.
328
315
316
+
Alternatively one can use the uenv as [upstream Spack instance][ref-building-uenv-spack] to to add both Python and non-Python packages.
317
+
However, this workflow is more involved and intended for advanced Spack users.
6. Avoid writing JITed binaries to the (distributed) file system, which
403
-
could lead to performance issues.
404
-
7. These variables should always be set for correctness and optimal
405
-
performance when using NCCL, see [the detailed
406
-
explanation][ref-communication-nccl].
382
+
3. These variables are used by PyTorch to initialize the distributed backend.
383
+
The `MASTER_ADDR` and `MASTER_PORT` variables are used to determine the address and port of the master node.
384
+
Additionally we also need `RANK` and `LOCAL_RANK` but these must be set per-process, see below.
385
+
4. Enable more graceful exception handling, see [PyTorch documentation](https://pytorch.org/docs/stable/torch_nccl_environment_variables.html)
386
+
5. Set the Trition home to a local path (e.g. `/dev/shm`) to avoid writing to the (distributed) file system.
387
+
This is important for performance, as writing to the Lustre file system can be slow due to the amount of small files and potentially many processes accessing it.
388
+
6. Disable GPU support in MPICH, as it [can lead to deadlocks](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/mpi.html#inter-gpu-communication-with-cuda-aware-mpi) when using together with nccl.
389
+
6. Avoid writing JITed binaries to the (distributed) file system, which could lead to performance issues.
390
+
7. These variables should always be set for correctness and optimal performance when using NCCL, see [the detailed explanation][ref-communication-nccl].
407
391
8. `RANK` and `LOCAL_RANK` are set per-process by the SLURM job launcher.
408
-
9. Activate the virtual environment created on top of the uenv (if any).
409
-
Please follow the guidelines for [python virtual environments with
410
-
uenv][ref-guides-storage-venv] to enhance scalability and reduce load
411
-
times.
392
+
10. Activate the virtual environment created on top of the uenv (if any).
393
+
Please follow the guidelines for [python virtual environments with uenv][ref-guides-storage-venv] to enhance scalability and reduce load times.
0 commit comments