You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/access/jupyterlab.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -199,7 +199,7 @@ Examples of notebooks with `ipcmagic` can be found [here](https://github.com/
199
199
200
200
While it is generally recommended to submit long-running machine learning training and inference jobs via `sbatch`, certain use cases can benefit from an interactive Jupyter environment.
201
201
202
-
A popular approach to run multi-GPU ML workloads is with [`accelerate`](https://github.com/huggingface/accelerate) and [`torchrun`](https://docs.pytorch.org/docs/stable/elastic/run.html) as demonstrated in the [tutorials][ref-guides-mlp-tutorials]. In particular, the `accelerate launch` script in the [LLM fine-tuning tutorial][ref-mlp-llm-fine-tuning-tutorial] can be directly carried over to a Jupyter cell with a `%%bash` header (to run its contents interpreted by bash). For `torchrun`, one can adapt the command from the multi-node [nanotron tutorial][ref-mlp-llm-nanotron-tutorial] to run on a single GH200 node using the following line in a Jupyter cell
202
+
A popular approach to run multi-GPU ML workloads is with [`accelerate`](https://github.com/huggingface/accelerate) and [`torchrun`](https://docs.pytorch.org/docs/stable/elastic/run.html) as demonstrated in the [tutorials][ref-software-ml-tutorials]. In particular, the `accelerate launch` script in the [LLM fine-tuning tutorial][software-ml-llm-fine-tuning-tutorial] can be directly carried over to a Jupyter cell with a `%%bash` header (to run its contents interpreted by bash). For `torchrun`, one can adapt the command from the multi-node [nanotron tutorial][software-ml-llm-nanotron-tutorial] to run on a single GH200 node using the following line in a Jupyter cell
Copy file name to clipboardExpand all lines: docs/build-install/containers.md
+28-3Lines changed: 28 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,17 +4,21 @@
4
4
Building OCI container images on Alps vClusters is supported through [Podman](https://podman.io/), an open-source container engine that adheres to OCI standards and supports rootless containers by leveraging Linux [user namespaces](https://www.man7.org/linux/man-pages/man7/user_namespaces.7.html).
5
5
Its command-line interface (CLI) closely mirrors Docker’s, providing a consistent and familiar experience for users of established container tools.
6
6
7
+
[](){#ref-build-containers-configure-podman}
7
8
## Preliminary step: configuring Podman's storage
8
9
9
-
The first step in order to use Podman on Alps is to create a valid Container Storage configuration file at `$HOME/.config/containers/storage.conf` (or `$XDG_CONFIG_HOME/containers/storage.conf`, if you have `$XDG_CONFIG_HOME` set), according to the following minimal template:
10
+
The first step in order to use Podman on Alps is to create a valid Container Storage configuration file in your home according to the following minimal template:
If `$XDG_CONFIG_HOME` is set, place this file at `$XDG_CONFIG_HOME/containers/storage.conf` instead.
21
+
18
22
!!! warning
19
23
In the above configuration, `/dev/shm` is used to store the container images.
20
24
`/dev/shm` is the mount point of a [tmpfs filesystem](https://www.kernel.org/doc/html/latest/filesystems/tmpfs.html#tmpfs) and is compatible with the user namespaces used by Podman.
@@ -43,11 +47,33 @@ podman build -t <image:tag> .
43
47
44
48
In general, [`podman build`](https://docs.podman.io/en/stable/markdown/podman-build.1.html) follows the Docker options convention.
45
49
50
+
!!! info "Debugging the container build"
51
+
If the container build fails, you can run an interactive shell using the image from the last successfully built layer with
52
+
53
+
```bash
54
+
podman run -it --rm -e NVIDIA_VISIBLE_DEVICES=void <last-layer-hash> bash # (1)!
55
+
```
56
+
57
+
1. Setting `NVIDIA_VISIBLE_DEVICES` in the environment is required specifically to run NGC containers with podman
58
+
59
+
replacing `<last-layer-hash>` by the actual hash output in the build job and interactively test the failing command.
60
+
61
+
46
62
## Importing images in the Container Engine
47
63
48
64
An image built using Podman can be easily imported as a squashfs archive in order to be used with our Container Engine solution.
49
65
It is important to keep in mind that the import has to take place in the same job allocation where the image creation took place, otherwise the image is lost due to the temporary nature of `/dev/shm`.
50
66
67
+
!!! warning "Preliminary configuration: Lustre settings for container images"
68
+
Since container images are large files and the filesystem is a shared resource, you need to configure the target directory according to [best practices for Lustre][ref-guides-storage-lustre] before importing the container image so it will be properly distributed across storage nodes.
1. This makes sure that files stored subsequently end up on the same storage node (up to 4 MB), on 4 storage nodes (between 4 and 64 MB) or are striped across all storage nodes (above 64 MB)
75
+
76
+
51
77
To import the image:
52
78
53
79
```
@@ -62,7 +88,6 @@ image = "/<path to image directory>/<image_name.sqsh>"
Copy file name to clipboardExpand all lines: docs/platforms/mlp/index.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -91,4 +91,4 @@ Project is per project - each project gets a project folder with project-specifi
91
91
92
92
## Guides and tutorials
93
93
94
-
Tutorials for fine-tuning and running inference of LLMs as well as training an LLM with Nanotron can be found in the [MLP Tutorials][ref-guides-mlp-tutorials]page.
94
+
Tutorials on how to set up and configure a machine learning environment in order to run LLM workloads such as inference, fine-tuning and multi-node training can be found in the [tutorials section][ref-software-ml-tutorials]under machine learning software.
Copy file name to clipboardExpand all lines: docs/software/communication/nccl.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -14,7 +14,7 @@ When using e.g. the `default` view of `prgenv-gnu` the `aws-ofi-nccl` plugin wil
14
14
Alternatively, loading the `aws-ofi-nccl` module with the `modules` view also makes the plugin available in the environment.
15
15
The environment variables described below must be set to ensure that NCCL uses the plugin.
16
16
17
-
While the container engine sets these automatically when using the NCCL hook, the following environment variables should always be set for correctness and optimal performance when using NCCL:
17
+
While the container engine sets these automatically when using the NCCL hook, the following environment variables should always be set for correctness and optimal performance when using NCCL with uenv:
Copy file name to clipboardExpand all lines: docs/software/ml/index.md
+23-11Lines changed: 23 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,22 +2,33 @@
2
2
# Machine learning applications and frameworks
3
3
4
4
CSCS supports a wide range of machine learning (ML) applications and frameworks on its systems.
5
-
Most ML workloads are containerized to ensure portability, reproducibility, and ease of use across environments.
5
+
Most ML workloads are containerized to ensure portability, reproducibility, and ease of use across machines.
6
6
7
7
Users can choose between running containers, using provided uenv software stacks, or building custom Python environments tailored to their needs.
8
8
9
-
## Running machine learning applications with containers
9
+
First time users are recommended to consult the [LLM tutorials][ref-software-ml-tutorials] to get familiar with the concepts of the Machine Learning platform in a series of hands-on examples.
10
+
11
+
## Running ML applications with containers (recommended)
10
12
11
13
Containerization is the recommended approach for ML workloads on Alps, as it simplifies software management and maximizes compatibility with other systems.
12
14
13
-
*Users are encouraged to build their own containers, starting from popular sources such as the [Nvidia NGC Catalog](https://catalog.ngc.nvidia.com/containers), which offers a variety of pre-built images optimized for HPC and ML workloads.
15
+
Users are encouraged to build their own containers, starting from popular sources such as the [Nvidia NGC Catalog](https://catalog.ngc.nvidia.com/containers), which offers a variety of pre-built images optimized for HPC and ML workloads.
14
16
Examples include:
15
-
* [PyTorch NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch)
16
-
* [TensorFlow NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorflow)
17
-
* For frequently changing dependencies, consider creating a virtual environment (venv) mounted into the container.
17
+
18
+
*[PyTorch NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) ([Release Notes](https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/index.html))
19
+
*[JAX NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/jax) ([Release Notes](https://docs.nvidia.com/deeplearning/frameworks/jax-release-notes/index.html))
20
+
*[TensorFlow NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorflow) (deprecated since 25.02, see [Release Notes](https://docs.nvidia.com/deeplearning/frameworks/tensorflow-release-notes/index.html))
21
+
22
+
Documented best practices are available for:
23
+
24
+
*[PyTorch][ref-ce-pytorch]
25
+
26
+
!!! note "Extending a container with a virtual environment"
27
+
For frequently changing Python dependencies during development, consider creating a Virtual Environment (venv) on top of the packages in the container (see [this example][ref-ce-pytorch-venv]).
18
28
19
29
Helpful references:
20
30
31
+
* Introduction to concepts of the Machine Learning platform: [LLM tutorials][ref-software-ml-tutorials]
21
32
* Running containers on Alps: [Container Engine Guide][ref-container-engine]
22
33
* Building custom container images: [Container Build Guide][ref-build-containers]
23
34
@@ -30,17 +41,18 @@ Available ML-related uenvs:
30
41
31
42
*[PyTorch][ref-uenv-pytorch] — available on [Clariden][ref-cluster-clariden] and [Daint][ref-cluster-daint]
32
43
33
-
To extend these environments with additional Python packages, it is recommended to create a Python Virtual Environment (venv).
34
-
See this [PyTorch venv example][ref-uenv-pytorch-venv] for details.
35
-
36
-
!!! note
37
-
While many Python packages provide pre-built binaries for common architectures, some may require building from source.
44
+
!!! note "Extending a uenv with a virtual environment"
45
+
To extend these environments with additional Python packages, it is recommended to create a Python Virtual Environment (venv) layered on top of the packages in the uenv.
46
+
See this [PyTorch venv example][ref-uenv-pytorch-venv] for details.
38
47
39
48
## Building custom Python environments
40
49
41
50
Users may also choose to build entirely custom software stacks using Python package managers such as `uv` or `conda`.
42
51
Most ML libraries are available via the [Python Package Index (PyPI)](https://pypi.org/).
43
52
53
+
!!! note
54
+
While many Python packages provide pre-built binaries for common architectures, some may require building from source.
55
+
44
56
To ensure optimal performance on CSCS systems, we recommend starting from an environment that already includes:
0 commit comments