Skip to content
Merged
Show file tree
Hide file tree
Changes from 21 commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
1418931
Moved MLP tutorials under software, added CE section to Pytorch
lukasgd Aug 19, 2025
a0fc65a
Merge branch 'main' into mlp-tutorials-update-iii
lukasgd Aug 19, 2025
5109d87
Update docs/build-install/containers.md
lukasgd Aug 20, 2025
34bbcf8
Update docs/build-install/containers.md
lukasgd Aug 20, 2025
5a62e5c
Update docs/platforms/mlp/index.md
lukasgd Aug 20, 2025
9c1c90a
Update docs/software/ml/pytorch.md
lukasgd Aug 20, 2025
2c4381f
Update docs/software/ml/pytorch.md
lukasgd Aug 20, 2025
6c9b1a5
Update docs/software/ml/pytorch.md
lukasgd Aug 20, 2025
7c580b8
Update docs/software/ml/pytorch.md
lukasgd Aug 20, 2025
42082ce
Update docs/software/ml/pytorch.md
lukasgd Aug 20, 2025
c4e9cb6
Update docs/software/ml/pytorch.md
lukasgd Aug 20, 2025
61b0ffe
Update docs/software/ml/pytorch.md
lukasgd Aug 20, 2025
6d22e8d
Update docs/software/ml/pytorch.md
lukasgd Aug 20, 2025
d438dcd
Integrating Fabian's feedback, updating code owners
lukasgd Aug 20, 2025
205d4b5
Merge branch 'main' into mlp-tutorials-update-iii
lukasgd Aug 20, 2025
2c7904c
Updating known issues
lukasgd Aug 21, 2025
ef0fdf6
Update check-spelling metadata
lukasgd Aug 21, 2025
f099e3c
Update docs/build-install/containers.md
bcumming Aug 22, 2025
9862ec3
Update docs/software/ml/index.md
bcumming Aug 22, 2025
48ccb03
Update docs/software/ml/tutorials/index.md
bcumming Aug 22, 2025
c1c72e7
Update docs/software/ml/pytorch.md
bcumming Aug 22, 2025
bedceb6
move ml tutorials into a tutorials section
bcumming Aug 22, 2025
c5383db
small cleanup
bcumming Aug 22, 2025
cfe4fe8
Merge branch 'mlp-tutorials-update-iii' of github.com:lukasgd/cscs-do…
bcumming Aug 22, 2025
998eefd
fix links; add more links to ML tutorials/pytorch; add guides/tutoria…
bcumming Aug 22, 2025
17e0eff
Merge branch 'main' into mlp-tutorials-update-iii
bcumming Aug 22, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/CODEOWNERS
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,6 @@ docs/software/prgenv/linalg.md @finkandreas @msimberg
docs/software/sciapps/cp2k.md @abussy @RMeli
docs/software/sciapps/lammps.md @nickjbrowning
docs/software/sciapps/gromacs.md @kanduri
docs/software/ml @boeschf
docs/software/ml @boeschf @henrique @lukasgd
docs/storage @mpasserini
docs/alps/storage.md @mpasserini
4 changes: 4 additions & 0 deletions .github/actions/spelling/expect.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
JAX
nvitop
NVRTC
placeholders
4 changes: 2 additions & 2 deletions docs/access/jupyterlab.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,7 @@ If the default base images do not meet your requirements, you can specify a cust
3. Currently only required on Daint and Santis, not on Clariden
4. Set working directory of Jupyter session (file browser root directory)
5. Use environment settings for optimized communication
6. Disable CUDA JIT cache
6. Avoid writing JITed binaries to the (distributed) file system, which could lead to performance issues.
7. Async error handling when an exception is observed in NCCL watchdog: aborting NCCL communicator and tearing down process upon error
8. Disable GPU support in MPICH, as it can lead to deadlocks when using together with NCCL

Expand Down Expand Up @@ -199,7 +199,7 @@ Examples of notebooks with `ipcmagic` can be found [here](https://github.com/

While it is generally recommended to submit long-running machine learning training and inference jobs via `sbatch`, certain use cases can benefit from an interactive Jupyter environment.

A popular approach to run multi-GPU ML workloads is with [`accelerate`](https://github.com/huggingface/accelerate) and [`torchrun`](https://docs.pytorch.org/docs/stable/elastic/run.html) as demonstrated in the [tutorials][ref-guides-mlp-tutorials]. In particular, the `accelerate launch` script in the [LLM fine-tuning tutorial][ref-mlp-llm-fine-tuning-tutorial] can be directly carried over to a Jupyter cell with a `%%bash` header (to run its contents interpreted by bash). For `torchrun`, one can adapt the command from the multi-node [nanotron tutorial][ref-mlp-llm-nanotron-tutorial] to run on a single GH200 node using the following line in a Jupyter cell
A popular approach to run multi-GPU ML workloads is with [`accelerate`](https://github.com/huggingface/accelerate) and [`torchrun`](https://docs.pytorch.org/docs/stable/elastic/run.html) as demonstrated in the [tutorials][ref-software-ml-tutorials]. In particular, the `accelerate launch` script in the [LLM fine-tuning tutorial][software-ml-llm-fine-tuning-tutorial] can be directly carried over to a Jupyter cell with a `%%bash` header (to run its contents interpreted by bash). For `torchrun`, one can adapt the command from the multi-node [nanotron tutorial][software-ml-llm-nanotron-tutorial] to run on a single GH200 node using the following line in a Jupyter cell

```bash
!python -m torch.distributed.run --standalone --nproc_per_node=4 run_train.py ...
Expand Down
31 changes: 28 additions & 3 deletions docs/build-install/containers.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,17 +4,21 @@
Building OCI container images on Alps vClusters is supported through [Podman](https://podman.io/), an open-source container engine that adheres to OCI standards and supports rootless containers by leveraging Linux [user namespaces](https://www.man7.org/linux/man-pages/man7/user_namespaces.7.html).
Its command-line interface (CLI) closely mirrors Docker’s, providing a consistent and familiar experience for users of established container tools.

[](){#ref-build-containers-configure-podman}
## Preliminary step: configuring Podman's storage

The first step in order to use Podman on Alps is to create a valid Container Storage configuration file at `$HOME/.config/containers/storage.conf` (or `$XDG_CONFIG_HOME/containers/storage.conf`, if you have `$XDG_CONFIG_HOME` set), according to the following minimal template:
The first step in order to use Podman on Alps is to create a valid Container Storage configuration file in your home directory, according to the following minimal template:

```toml
```toml title="$HOME/.config/containers/storage.conf"
[storage]
driver = "overlay"
runroot = "/dev/shm/$USER/runroot"
graphroot = "/dev/shm/$USER/root"
```

!!! warning
If `$XDG_CONFIG_HOME` is set, place this file at `$XDG_CONFIG_HOME/containers/storage.conf` instead. See also [this guide][ref-guides-terminal-arch] for further information about XDG variables.

!!! warning
In the above configuration, `/dev/shm` is used to store the container images.
`/dev/shm` is the mount point of a [tmpfs filesystem](https://www.kernel.org/doc/html/latest/filesystems/tmpfs.html#tmpfs) and is compatible with the user namespaces used by Podman.
Expand Down Expand Up @@ -43,11 +47,33 @@ podman build -t <image:tag> .

In general, [`podman build`](https://docs.podman.io/en/stable/markdown/podman-build.1.html) follows the Docker options convention.

!!! info "Debugging the container build"
If the container build fails, you can run an interactive shell using the image from the last successfully built layer with

```bash
podman run -it --rm -e NVIDIA_VISIBLE_DEVICES=void <last-layer-hash> bash # (1)!
```

1. Setting `NVIDIA_VISIBLE_DEVICES` in the environment is required specifically to run NGC containers with podman

replacing `<last-layer-hash>` with the actual hash output in the build job and interactively test the failing command.


## Importing images in the Container Engine

An image built using Podman can be easily imported as a squashfs archive in order to be used with our Container Engine solution.
It is important to keep in mind that the import has to take place in the same job allocation where the image creation took place, otherwise the image is lost due to the temporary nature of `/dev/shm`.

!!! warning "Preliminary configuration: Lustre settings for container images"
Since container images are large files and the filesystem is a shared resource, you need to configure the target directory according to [best practices for Lustre][ref-guides-storage-lustre] before importing the container image so it will be properly distributed across storage nodes.

```bash
lfs setstripe -E 4M -c 1 -E 64M -c 4 -E -1 -c -1 -S 4M <path to image directory> # (1)!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is not a good a idea to duplicate the command as you already linked it above. Also 64MB still seems a bit little for full striping, doesn't it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good observation that this keeps reappearing. Since it seems largely ignored by users and has caused job interference previously, I think repeating it doesn't harm. But probably we should think about a new default as this is complicated to remember for the average user.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you remove the repetition for now please? so we only have 1 place to change after, and we can add it back once that is done...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Henrique.
I have updated the docs to:

To ensure good performance for jobs on multiple nodes, take the time to configure the target directory using `lfs setstripe` according to [best practices for Lustre][ref-guides-storage-lustre] before importing the container image, or using `lfs migrate` to fix files that are already imported.

This makes the commands explicit, but let's us provide guidance on specific flags in one location.

```

1. This makes sure that files stored subsequently end up on the same storage node (up to 4 MB), on 4 storage nodes (between 4 and 64 MB) or are striped across all storage nodes (above 64 MB)


To import the image:

```
Expand All @@ -62,7 +88,6 @@ image = "/<path to image directory>/<image_name.sqsh>"
mounts = ["/capstor/scratch/cscs/<username>:/capstor/scratch/cscs/<username>"]
workdir = "/capstor/scratch/cscs/<username>"
```

## Pushing Images to a Container Registry

In order to push an image to a container registry, you first need to follow three steps:
Expand Down
2 changes: 2 additions & 0 deletions docs/clusters/clariden.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,8 @@ Alternatively, [uenv][ref-uenv] are also available on Clariden. Currently deploy
uenv start namd/3.0:v3@daint
```

For detailed instructions and best practices with ML frameworks, please refer to the dedicated pages under [ML software][ref-software-ml].

## Running Jobs on Clariden

### Slurm
Expand Down
10 changes: 0 additions & 10 deletions docs/guides/mlp_tutorials/index.md

This file was deleted.

2 changes: 1 addition & 1 deletion docs/platforms/mlp/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,4 +91,4 @@ Project is per project - each project gets a project folder with project-specifi

## Guides and tutorials

Tutorials for fine-tuning and running inference of LLMs as well as training an LLM with Nanotron can be found in the [MLP Tutorials][ref-guides-mlp-tutorials] page.
Tutorials on how to set up and configure a machine learning environment in order to run LLM workloads such as inference, fine-tuning and multi-node training can be found in the [tutorials section][ref-software-ml-tutorials].
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I move this to the top of the platform page.

2 changes: 1 addition & 1 deletion docs/software/communication/nccl.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ When using e.g. the `default` view of `prgenv-gnu` the `aws-ofi-nccl` plugin wil
Alternatively, loading the `aws-ofi-nccl` module with the `modules` view also makes the plugin available in the environment.
The environment variables described below must be set to ensure that NCCL uses the plugin.

While the container engine sets these automatically when using the NCCL hook, the following environment variables should always be set for correctness and optimal performance when using NCCL:
While the container engine sets these automatically when using the NCCL hook, the following environment variables should always be set for correctness and optimal performance when using NCCL with uenv:

```bash
--8<-- "docs/software/communication/nccl_env_vars"
Expand Down
34 changes: 23 additions & 11 deletions docs/software/ml/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,22 +2,33 @@
# Machine learning applications and frameworks

CSCS supports a wide range of machine learning (ML) applications and frameworks on its systems.
Most ML workloads are containerized to ensure portability, reproducibility, and ease of use across environments.
Most ML workloads are containerized to ensure portability, reproducibility, and ease of use across systems.

Users can choose between running containers, using provided uenv software stacks, or building custom Python environments tailored to their needs.

## Running machine learning applications with containers
First time users are recommended to consult the [LLM tutorials][ref-software-ml-tutorials] to get familiar with the concepts of the Machine Learning platform in a series of hands-on examples.

## Running ML applications with containers (recommended)

Containerization is the recommended approach for ML workloads on Alps, as it simplifies software management and maximizes compatibility with other systems.

* Users are encouraged to build their own containers, starting from popular sources such as the [Nvidia NGC Catalog](https://catalog.ngc.nvidia.com/containers), which offers a variety of pre-built images optimized for HPC and ML workloads.
Users are encouraged to build their own containers, starting from popular sources such as the [Nvidia NGC Catalog](https://catalog.ngc.nvidia.com/containers), which offers a variety of pre-built images optimized for HPC and ML workloads.
Examples include:
* [PyTorch NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch)
* [TensorFlow NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorflow)
* For frequently changing dependencies, consider creating a virtual environment (venv) mounted into the container.

* [PyTorch NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) ([Release Notes](https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/index.html))
* [JAX NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/jax) ([Release Notes](https://docs.nvidia.com/deeplearning/frameworks/jax-release-notes/index.html))
* [TensorFlow NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorflow) (deprecated since 25.02, see [Release Notes](https://docs.nvidia.com/deeplearning/frameworks/tensorflow-release-notes/index.html))

Documented best practices are available for:

* [PyTorch][ref-ce-pytorch]

!!! note "Extending a container with a virtual environment"
For frequently changing Python dependencies during development, consider creating a Virtual Environment (venv) on top of the packages in the container (see [this example][ref-ce-pytorch-venv]).

Helpful references:

* Introduction to concepts of the Machine Learning platform: [LLM tutorials][ref-software-ml-tutorials]
* Running containers on Alps: [Container Engine Guide][ref-container-engine]
* Building custom container images: [Container Build Guide][ref-build-containers]

Expand All @@ -30,17 +41,18 @@ Available ML-related uenvs:

* [PyTorch][ref-uenv-pytorch] — available on [Clariden][ref-cluster-clariden] and [Daint][ref-cluster-daint]

To extend these environments with additional Python packages, it is recommended to create a Python Virtual Environment (venv).
See this [PyTorch venv example][ref-uenv-pytorch-venv] for details.

!!! note
While many Python packages provide pre-built binaries for common architectures, some may require building from source.
!!! note "Extending a uenv with a virtual environment"
To extend these environments with additional Python packages, it is recommended to create a Python Virtual Environment (venv) layered on top of the packages in the uenv.
See this [PyTorch venv example][ref-uenv-pytorch-venv] for details.

## Building custom Python environments

Users may also choose to build entirely custom software stacks using Python package managers such as `uv` or `conda`.
Most ML libraries are available via the [Python Package Index (PyPI)](https://pypi.org/).

!!! note
While many Python packages provide pre-built binaries for common architectures, some may require building from source.

To ensure optimal performance on CSCS systems, we recommend starting from an environment that already includes:

* CUDA, cuDNN
Expand Down
Loading
Loading