Skip to content

Commit 8cea92b

Browse files
lukasgdboeschfbcummingteojgo
authored
Move MLP tutorials under software, add CE section to Pytorch including best practice for large-scale training (#231)
Co-authored-by: boeschf <[email protected]> Co-authored-by: Ben Cumming <[email protected]> Co-authored-by: Theofilos Manitaras <[email protected]>
1 parent 5ab761a commit 8cea92b

File tree

18 files changed

+440
-106
lines changed

18 files changed

+440
-106
lines changed

.github/CODEOWNERS

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,6 @@ docs/software/prgenv/linalg.md @finkandreas @msimberg
99
docs/software/sciapps/cp2k.md @abussy @RMeli
1010
docs/software/sciapps/lammps.md @nickjbrowning
1111
docs/software/sciapps/gromacs.md @kanduri
12-
docs/software/ml @boeschf
12+
docs/software/ml @boeschf @henrique @lukasgd
1313
docs/storage @mpasserini
1414
docs/alps/storage.md @mpasserini
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
JAX
2+
nvitop
3+
NVRTC
4+
placeholders

docs/access/jupyterlab.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -86,7 +86,7 @@ If the default base images do not meet your requirements, you can specify a cust
8686
3. Currently only required on Daint and Santis, not on Clariden
8787
4. Set working directory of Jupyter session (file browser root directory)
8888
5. Use environment settings for optimized communication
89-
6. Disable CUDA JIT cache
89+
6. Avoid writing JITed binaries to the (distributed) file system, which could lead to performance issues.
9090
7. Async error handling when an exception is observed in NCCL watchdog: aborting NCCL communicator and tearing down process upon error
9191
8. Disable GPU support in MPICH, as it can lead to deadlocks when using together with NCCL
9292

@@ -199,7 +199,9 @@ Examples of notebooks with `ipcmagic` can be found [here](https://github.com/
199199

200200
While it is generally recommended to submit long-running machine learning training and inference jobs via `sbatch`, certain use cases can benefit from an interactive Jupyter environment.
201201

202-
A popular approach to run multi-GPU ML workloads is with [`accelerate`](https://github.com/huggingface/accelerate) and [`torchrun`](https://docs.pytorch.org/docs/stable/elastic/run.html) as demonstrated in the [tutorials][ref-guides-mlp-tutorials]. In particular, the `accelerate launch` script in the [LLM fine-tuning tutorial][ref-mlp-llm-fine-tuning-tutorial] can be directly carried over to a Jupyter cell with a `%%bash` header (to run its contents interpreted by bash). For `torchrun`, one can adapt the command from the multi-node [nanotron tutorial][ref-mlp-llm-nanotron-tutorial] to run on a single GH200 node using the following line in a Jupyter cell
202+
A popular approach to run multi-GPU ML workloads is with [`accelerate`](https://github.com/huggingface/accelerate) and [`torchrun`](https://docs.pytorch.org/docs/stable/elastic/run.html) as demonstrated in the [tutorials][ref-tutorials-ml].
203+
In particular, the `accelerate launch` script in the [LLM fine-tuning tutorial][software-ml-llm-fine-tuning-tutorial] can be directly carried over to a Jupyter cell with a `%%bash` header (to run its contents interpreted by bash).
204+
For `torchrun`, one can adapt the command from the multi-node [nanotron tutorial][software-ml-llm-nanotron-tutorial] to run on a single GH200 node using the following line in a Jupyter cell
203205

204206
```bash
205207
!python -m torch.distributed.run --standalone --nproc_per_node=4 run_train.py ...

docs/build-install/containers.md

Lines changed: 23 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,17 +4,22 @@
44
Building OCI container images on Alps vClusters is supported through [Podman](https://podman.io/), an open-source container engine that adheres to OCI standards and supports rootless containers by leveraging Linux [user namespaces](https://www.man7.org/linux/man-pages/man7/user_namespaces.7.html).
55
Its command-line interface (CLI) closely mirrors Docker’s, providing a consistent and familiar experience for users of established container tools.
66

7+
[](){#ref-build-containers-configure-podman}
78
## Preliminary step: configuring Podman's storage
89

9-
The first step in order to use Podman on Alps is to create a valid Container Storage configuration file at `$HOME/.config/containers/storage.conf` (or `$XDG_CONFIG_HOME/containers/storage.conf`, if you have `$XDG_CONFIG_HOME` set), according to the following minimal template:
10+
The first step in order to use Podman on Alps is to create a valid Container Storage configuration file in your home directory, according to the following minimal template:
1011

11-
```toml
12+
```toml title="$HOME/.config/containers/storage.conf"
1213
[storage]
1314
driver = "overlay"
1415
runroot = "/dev/shm/$USER/runroot"
1516
graphroot = "/dev/shm/$USER/root"
1617
```
1718

19+
!!! warning
20+
If `$XDG_CONFIG_HOME` is set, place this file at `$XDG_CONFIG_HOME/containers/storage.conf` instead.
21+
See the [terminal user guide][ref-guides-terminal-arch] for further information about XDG variables.
22+
1823
!!! warning
1924
In the above configuration, `/dev/shm` is used to store the container images.
2025
`/dev/shm` is the mount point of a [tmpfs filesystem](https://www.kernel.org/doc/html/latest/filesystems/tmpfs.html#tmpfs) and is compatible with the user namespaces used by Podman.
@@ -43,11 +48,27 @@ podman build -t <image:tag> .
4348

4449
In general, [`podman build`](https://docs.podman.io/en/stable/markdown/podman-build.1.html) follows the Docker options convention.
4550

51+
!!! info "Debugging the container build"
52+
If the container build fails, you can run an interactive shell using the image from the last successfully built layer with
53+
54+
```bash
55+
podman run -it --rm -e NVIDIA_VISIBLE_DEVICES=void <last-layer-hash> bash # (1)!
56+
```
57+
58+
1. Setting `NVIDIA_VISIBLE_DEVICES` in the environment is required specifically to run NGC containers with podman
59+
60+
replacing `<last-layer-hash>` with the actual hash output in the build job and interactively test the failing command.
61+
62+
4663
## Importing images in the Container Engine
4764

4865
An image built using Podman can be easily imported as a squashfs archive in order to be used with our Container Engine solution.
4966
It is important to keep in mind that the import has to take place in the same job allocation where the image creation took place, otherwise the image is lost due to the temporary nature of `/dev/shm`.
5067

68+
!!! info "Preliminary configuration: Lustre settings for container images"
69+
Container images are stored in a single [SquashFS]() file, that is typically between 1-20 GB in size (particularly for large ML containers).
70+
To ensure good performance for jobs on multiple nodes, take the time to configure the target directory using `lfs setstripe` according to [best practices for Lustre][ref-guides-storage-lustre] before importing the container image, or using `lfs migrate` to fix files that are already imported.
71+
5172
To import the image:
5273

5374
```
@@ -62,7 +83,6 @@ image = "/<path to image directory>/<image_name.sqsh>"
6283
mounts = ["/capstor/scratch/cscs/<username>:/capstor/scratch/cscs/<username>"]
6384
workdir = "/capstor/scratch/cscs/<username>"
6485
```
65-
6686
## Pushing Images to a Container Registry
6787

6888
In order to push an image to a container registry, you first need to follow three steps:

docs/clusters/clariden.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -65,6 +65,8 @@ Alternatively, [uenv][ref-uenv] are also available on Clariden. Currently deploy
6565
uenv start namd/3.0:v3@daint
6666
```
6767

68+
For detailed instructions and best practices with ML frameworks, please refer to the dedicated pages under [ML software][ref-software-ml].
69+
6870
## Running Jobs on Clariden
6971

7072
### Slurm

docs/guides/mlp_tutorials/index.md

Lines changed: 0 additions & 10 deletions
This file was deleted.

docs/index.md

Lines changed: 39 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,3 @@
1-
!!! info ""
2-
This is the new CSCS documentation site, which replaces the [CSCS Knowledge Base](https://confluence.cscs.ch/display/KB).
3-
4-
The migration of old documentation is still not fully complete.
5-
If you find documentation that is missing, please create a ticket on the documentation's [GitHub issue tracker](https://github.com/eth-cscs/cscs-docs/issues).
6-
71
# CSCS Documentation
82

93
<div class="grid cards" markdown>
@@ -66,32 +60,26 @@ The Alps Research infrastructure hosts multiple platforms and clusters targeting
6660

6761
</div>
6862

69-
[](){#ref-get-in-touch}
70-
## Get in Touch
63+
## Tutorials and Guides
7164

72-
If you cannot find the information that you need in the documentation, help is available.
65+
Learn by doing with our guides and tutorials.
7366

7467
<div class="grid cards" markdown>
68+
- :fontawesome-solid-layer-group: __Tutorials__
7569

76-
- :fontawesome-solid-headset: __Get Help__
77-
78-
Contact the CSCS Service Desk for help.
79-
80-
[:octicons-arrow-right-24: Service Desk](https://jira.cscs.ch/plugins/servlet/desk)
70+
Hands on tutorials that show how to implement workflows on Alps.
8171

82-
- :fontawesome-regular-comments: __Chat__
72+
[:octicons-arrow-right-24: Machine Learning][ref-tutorials-ml]
8373

84-
Discuss Alps with other users and CSCS staff on Slack.
74+
- :fontawesome-solid-mountain-sun: __Guides__
8575

86-
[:octicons-arrow-right-24: CSCS User Slack](https://cscs-users.slack.com/)
76+
Guides with practical advice, hints and tips for key topics.
8777

88-
<div class="grid cards" markdown>
89-
- :fontawesome-solid-hammer: __Contribute__
78+
[:octicons-arrow-right-24: Using storage effectively][ref-guides-storage]
9079

91-
The source for the documentation is hosted on GitHub.
80+
[:octicons-arrow-right-24: Accessing internet and external services][ref-guides-internet-access]
9281

93-
[:octicons-arrow-right-24: Contribute to the docs ](contributing/index.md)
94-
</div>
82+
[:octicons-arrow-right-24: Using and configuring the terminal][ref-guides-terminal]
9583

9684
</div>
9785

@@ -142,3 +130,32 @@ If you cannot find the information that you need in the documentation, help is a
142130

143131
</div>
144132

133+
[](){#ref-get-in-touch}
134+
## Get in Touch
135+
136+
If you cannot find the information that you need in the documentation, help is available.
137+
138+
<div class="grid cards" markdown>
139+
140+
- :fontawesome-solid-headset: __Get Help__
141+
142+
Contact the CSCS Service Desk for help.
143+
144+
[:octicons-arrow-right-24: Service Desk](https://jira.cscs.ch/plugins/servlet/desk)
145+
146+
- :fontawesome-regular-comments: __Chat__
147+
148+
Discuss Alps with other users and CSCS staff on Slack.
149+
150+
[:octicons-arrow-right-24: CSCS User Slack](https://cscs-users.slack.com/)
151+
152+
<div class="grid cards" markdown>
153+
- :fontawesome-solid-hammer: __Contribute__
154+
155+
The source for the documentation is hosted on GitHub.
156+
157+
[:octicons-arrow-right-24: Contribute to the docs ](contributing/index.md)
158+
</div>
159+
160+
</div>
161+

docs/platforms/mlp/index.md

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,15 @@
33

44
The Machine Learning Platform (MLP) provides compute, storage and expertise to the machine learning and AI community in Switzerland, with the main user being the [Swiss AI Initiative](https://www.swiss-ai.org/).
55

6+
<div class="grid cards" markdown>
7+
- :fontawesome-solid-mountain: [__Tutorials__][ref-tutorials-ml]
8+
9+
Tutorials on how to set up and configure a machine learning environment in order to run LLM workloads such as inference, fine-tuning and multi-node training can be found in the [tutorials section][ref-tutorials-ml].
10+
11+
Also check out the [PyTorch documentation][ref-software-ml-pytorch] for information about how to run PyTorch.
12+
13+
</div>
14+
615
## Getting started
716

817
### Getting access
@@ -89,6 +98,3 @@ Project is per project - each project gets a project folder with project-specifi
8998
* hard limits on capacity and inodes prevent users from writing to project if the quota is reached - you can check quota and available space by running the [`quota`][ref-storage-quota] command on a login node or ela
9099
* it is not recommended to write directly to the project path from jobs.
91100

92-
## Guides and tutorials
93-
94-
Tutorials for fine-tuning and running inference of LLMs as well as training an LLM with Nanotron can be found in the [MLP Tutorials][ref-guides-mlp-tutorials] page.

docs/software/communication/nccl.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ When using e.g. the `default` view of `prgenv-gnu` the `aws-ofi-nccl` plugin wil
1414
Alternatively, loading the `aws-ofi-nccl` module with the `modules` view also makes the plugin available in the environment.
1515
The environment variables described below must be set to ensure that NCCL uses the plugin.
1616

17-
While the container engine sets these automatically when using the NCCL hook, the following environment variables should always be set for correctness and optimal performance when using NCCL:
17+
While the container engine sets these automatically when using the NCCL hook, the following environment variables should always be set for correctness and optimal performance when using NCCL with uenv:
1818

1919
```bash
2020
--8<-- "docs/software/communication/nccl_env_vars"

docs/software/ml/index.md

Lines changed: 23 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -2,22 +2,33 @@
22
# Machine learning applications and frameworks
33

44
CSCS supports a wide range of machine learning (ML) applications and frameworks on its systems.
5-
Most ML workloads are containerized to ensure portability, reproducibility, and ease of use across environments.
5+
Most ML workloads are containerized to ensure portability, reproducibility, and ease of use across systems.
66

77
Users can choose between running containers, using provided uenv software stacks, or building custom Python environments tailored to their needs.
88

9-
## Running machine learning applications with containers
9+
First time users are recommended to consult the [LLM tutorials][ref-tutorials-ml] to get familiar with the concepts of the Machine Learning platform in a series of hands-on examples.
10+
11+
## Running ML applications with containers (recommended)
1012

1113
Containerization is the recommended approach for ML workloads on Alps, as it simplifies software management and maximizes compatibility with other systems.
1214

13-
* Users are encouraged to build their own containers, starting from popular sources such as the [Nvidia NGC Catalog](https://catalog.ngc.nvidia.com/containers), which offers a variety of pre-built images optimized for HPC and ML workloads.
15+
Users are encouraged to build their own containers, starting from popular sources such as the [Nvidia NGC Catalog](https://catalog.ngc.nvidia.com/containers), which offers a variety of pre-built images optimized for HPC and ML workloads.
1416
Examples include:
15-
* [PyTorch NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch)
16-
* [TensorFlow NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorflow)
17-
* For frequently changing dependencies, consider creating a virtual environment (venv) mounted into the container.
17+
18+
* [PyTorch NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) ([Release Notes](https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/index.html))
19+
* [JAX NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/jax) ([Release Notes](https://docs.nvidia.com/deeplearning/frameworks/jax-release-notes/index.html))
20+
* [TensorFlow NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorflow) (deprecated since 25.02, see [Release Notes](https://docs.nvidia.com/deeplearning/frameworks/tensorflow-release-notes/index.html))
21+
22+
Documented best practices are available for:
23+
24+
* [PyTorch][ref-ce-pytorch]
25+
26+
!!! note "Extending a container with a virtual environment"
27+
For frequently changing Python dependencies during development, consider creating a Virtual Environment (venv) on top of the packages in the container (see [this example][ref-ce-pytorch-venv]).
1828

1929
Helpful references:
2030

31+
* Introduction to concepts of the Machine Learning platform: [LLM tutorials][ref-tutorials-ml]
2132
* Running containers on Alps: [Container Engine Guide][ref-container-engine]
2233
* Building custom container images: [Container Build Guide][ref-build-containers]
2334

@@ -30,17 +41,18 @@ Available ML-related uenvs:
3041

3142
* [PyTorch][ref-uenv-pytorch] — available on [Clariden][ref-cluster-clariden] and [Daint][ref-cluster-daint]
3243

33-
To extend these environments with additional Python packages, it is recommended to create a Python Virtual Environment (venv).
34-
See this [PyTorch venv example][ref-uenv-pytorch-venv] for details.
35-
36-
!!! note
37-
While many Python packages provide pre-built binaries for common architectures, some may require building from source.
44+
!!! note "Extending a uenv with a virtual environment"
45+
To extend these environments with additional Python packages, it is recommended to create a Python Virtual Environment (venv) layered on top of the packages in the uenv.
46+
See this [PyTorch venv example][ref-uenv-pytorch-venv] for details.
3847

3948
## Building custom Python environments
4049

4150
Users may also choose to build entirely custom software stacks using Python package managers such as `uv` or `conda`.
4251
Most ML libraries are available via the [Python Package Index (PyPI)](https://pypi.org/).
4352

53+
!!! note
54+
While many Python packages provide pre-built binaries for common architectures, some may require building from source.
55+
4456
To ensure optimal performance on CSCS systems, we recommend starting from an environment that already includes:
4557

4658
* CUDA, cuDNN

0 commit comments

Comments
 (0)