Skip to content

Commit 4d9b9c0

Browse files
authored
pytorch: uenv (#84)
1 parent 95fbc40 commit 4d9b9c0

File tree

6 files changed

+463
-1
lines changed

6 files changed

+463
-1
lines changed

.github/CODEOWNERS

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,3 +4,4 @@ docs/software/communication @Madeeks @msimberg
44
docs/software/devtools/linaro @jgphpc
55
docs/software/prgenv/linalg.md @finkandreas @msimberg
66
docs/software/sciapps/cp2k.md @abussy @RMeli
7+
docs/software/ml @boeschf

docs/clusters/clariden.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -42,8 +42,14 @@ Users are encouraged to use containers on Clariden.
4242

4343
* Jobs using containers can be easily set up and submitted using the [container engine][ref-container-engine].
4444
* To build images, see the [guide to building container images on Alps][ref-build-containers].
45+
* Base images which include the necessary libraries and compilers are for example available from the [Nvidia NGC Catalog](https://catalog.ngc.nvidia.com/containers):
46+
* [HPC NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nvhpc)
47+
* [PyTorch NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch)
4548

46-
Alternatively, [uenv][ref-uenv] are also available on Clariden. Currently the only uenv that is deployed on Clariden is [prgenv-gnu][ref-uenv-prgenv-gnu].
49+
Alternatively, [uenv][ref-uenv] are also available on Clariden. Currently deployed on Clariden:
50+
51+
* [prgenv-gnu][ref-uenv-prgenv-gnu]
52+
* [pytorch][ref-uenv-pytorch]
4753

4854
??? example "using uenv provided for other clusters"
4955
You can run uenv that were built for other Alps clusters using the `@` notation.

docs/guides/storage.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -126,6 +126,7 @@ At first it can seem strange that a "high-performance" file system is significan
126126

127127
Meta data lookups on Lustre are expensive compared to your laptop, where the local file system is able to aggressively cache meta data.
128128

129+
[](){#ref-guides-storage-venv}
129130
### Python virtual environments with uenv
130131

131132
Python virtual environments can be very slow on Lustre, for example a simple `import numpy` command run on Lustre might take seconds, compared to milliseconds on your laptop.

docs/software/ml/index.md

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
[](){#ref-software-ml}
2+
# Machine learning applications and frameworks
3+
4+
CSCS supports a wide range of machine learning (ML) applications and frameworks on its systems.
5+
Most ML workloads are containerized to ensure portability, reproducibility, and ease of use across environments.
6+
7+
Users can choose between running containers, using provided uenv software stacks, or building custom Python environments tailored to their needs.
8+
9+
## Running machine learning applications with containers
10+
11+
Containerization is the recommended approach for ML workloads on Alps, as it simplifies software management and maximizes compatibility with other systems.
12+
13+
* Users are encouraged to build their own containers, starting from popular sources such as the [Nvidia NGC Catalog](https://catalog.ngc.nvidia.com/containers), which offers a variety of pre-built images optimized for HPC and ML workloads.
14+
Examples include:
15+
* [PyTorch NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch)
16+
* [TensorFlow NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorflow)
17+
* For frequently changing dependencies, consider creating a virtual environment (venv) mounted into the container.
18+
19+
Helpful references:
20+
21+
* Running containers on Alps: [Container Engine Guide][ref-container-engine]
22+
* Building custom container images: [Container Build Guide][ref-build-containers]
23+
24+
## Using provided uenv software stacks
25+
26+
Alternatively, CSCS provides pre-configured software stacks ([uenvs][ref-uenv]) that can serve as a starting point for machine learning projects.
27+
These environments provide optimized compilers, libraries, and selected ML frameworks.
28+
29+
Available ML-related uenvs:
30+
31+
* [PyTorch][ref-uenv-pytorch] — available on [Clariden][ref-cluster-clariden] and [Daint][ref-cluster-daint]
32+
33+
To extend these environments with additional Python packages, it is recommended to create a Python Virtual Environment (venv).
34+
See this [PyTorch venv example][ref-uenv-pytorch-venv] for details.
35+
36+
!!! note
37+
While many Python packages provide pre-built binaries for common architectures, some may require building from source.
38+
39+
## Building custom Python environments
40+
41+
Users may also choose to build entirely custom software stacks using Python package managers such as `uv` or `conda`.
42+
Most ML libraries are available via the [Python Package Index (PyPI)](https://pypi.org/).
43+
44+
To ensure optimal performance on CSCS systems, we recommend starting from an environment that already includes:
45+
46+
* CUDA, cuDNN
47+
* MPI, NCCL
48+
* C/C++ compilers
49+
50+
This can be achieved either by:
51+
52+
* building a [custom container image][ref-build-containers] based on a suitable ML-ready base image,
53+
* or starting from a provided uenv (e.g., [PrgEnv GNU][ref-uenv-prgenv-gnu] or [PyTorch uenv][ref-uenv-pytorch]),
54+
55+
and extending it with a virtual environment.
56+

0 commit comments

Comments
 (0)