From 141893191766b14ae2b6a7feb8ed9ec7f277aa22 Mon Sep 17 00:00:00 2001 From: Lukas Drescher Date: Tue, 19 Aug 2025 18:12:05 +0200 Subject: [PATCH 01/22] Moved MLP tutorials under software, added CE section to Pytorch --- docs/access/jupyterlab.md | 2 +- docs/build-install/containers.md | 31 +- docs/clusters/clariden.md | 2 + docs/guides/mlp_tutorials/index.md | 10 - docs/platforms/mlp/index.md | 2 +- docs/software/communication/nccl.md | 2 +- docs/software/ml/index.md | 34 ++- docs/software/ml/pytorch.md | 281 +++++++++++++++--- docs/software/ml/torch_distributed_env_vars | 5 + docs/software/ml/tutorials/index.md | 13 + .../ml/tutorials}/llm-fine-tuning.md | 8 +- .../ml/tutorials}/llm-inference.md | 2 +- .../ml/tutorials}/llm-nanotron-training.md | 15 +- mkdocs.yml | 10 +- 14 files changed, 336 insertions(+), 81 deletions(-) delete mode 100644 docs/guides/mlp_tutorials/index.md create mode 100644 docs/software/ml/torch_distributed_env_vars create mode 100644 docs/software/ml/tutorials/index.md rename docs/{guides/mlp_tutorials => software/ml/tutorials}/llm-fine-tuning.md (95%) rename docs/{guides/mlp_tutorials => software/ml/tutorials}/llm-inference.md (99%) rename docs/{guides/mlp_tutorials => software/ml/tutorials}/llm-nanotron-training.md (92%) diff --git a/docs/access/jupyterlab.md b/docs/access/jupyterlab.md index 8c233781..89fe1e79 100644 --- a/docs/access/jupyterlab.md +++ b/docs/access/jupyterlab.md @@ -199,7 +199,7 @@ Examples of notebooks with `ipcmagic` can be found [here](https://github.com/ While it is generally recommended to submit long-running machine learning training and inference jobs via `sbatch`, certain use cases can benefit from an interactive Jupyter environment. -A popular approach to run multi-GPU ML workloads is with [`accelerate`](https://github.com/huggingface/accelerate) and [`torchrun`](https://docs.pytorch.org/docs/stable/elastic/run.html) as demonstrated in the [tutorials][ref-guides-mlp-tutorials]. In particular, the `accelerate launch` script in the [LLM fine-tuning tutorial][ref-mlp-llm-fine-tuning-tutorial] can be directly carried over to a Jupyter cell with a `%%bash` header (to run its contents interpreted by bash). For `torchrun`, one can adapt the command from the multi-node [nanotron tutorial][ref-mlp-llm-nanotron-tutorial] to run on a single GH200 node using the following line in a Jupyter cell +A popular approach to run multi-GPU ML workloads is with [`accelerate`](https://github.com/huggingface/accelerate) and [`torchrun`](https://docs.pytorch.org/docs/stable/elastic/run.html) as demonstrated in the [tutorials][ref-software-ml-tutorials]. In particular, the `accelerate launch` script in the [LLM fine-tuning tutorial][software-ml-llm-fine-tuning-tutorial] can be directly carried over to a Jupyter cell with a `%%bash` header (to run its contents interpreted by bash). For `torchrun`, one can adapt the command from the multi-node [nanotron tutorial][software-ml-llm-nanotron-tutorial] to run on a single GH200 node using the following line in a Jupyter cell ```bash !python -m torch.distributed.run --standalone --nproc_per_node=4 run_train.py ... diff --git a/docs/build-install/containers.md b/docs/build-install/containers.md index 5d6405c0..3b6f35f8 100644 --- a/docs/build-install/containers.md +++ b/docs/build-install/containers.md @@ -4,17 +4,21 @@ Building OCI container images on Alps vClusters is supported through [Podman](https://podman.io/), an open-source container engine that adheres to OCI standards and supports rootless containers by leveraging Linux [user namespaces](https://www.man7.org/linux/man-pages/man7/user_namespaces.7.html). Its command-line interface (CLI) closely mirrors Docker’s, providing a consistent and familiar experience for users of established container tools. +[](){#ref-build-containers-configure-podman} ## Preliminary step: configuring Podman's storage -The first step in order to use Podman on Alps is to create a valid Container Storage configuration file at `$HOME/.config/containers/storage.conf` (or `$XDG_CONFIG_HOME/containers/storage.conf`, if you have `$XDG_CONFIG_HOME` set), according to the following minimal template: +The first step in order to use Podman on Alps is to create a valid Container Storage configuration file in your home according to the following minimal template: -```toml +```toml title="$HOME/.config/containers/storage.conf" [storage] driver = "overlay" runroot = "/dev/shm/$USER/runroot" graphroot = "/dev/shm/$USER/root" ``` +!!! warning + If `$XDG_CONFIG_HOME` is set, place this file at `$XDG_CONFIG_HOME/containers/storage.conf` instead. + !!! warning In the above configuration, `/dev/shm` is used to store the container images. `/dev/shm` is the mount point of a [tmpfs filesystem](https://www.kernel.org/doc/html/latest/filesystems/tmpfs.html#tmpfs) and is compatible with the user namespaces used by Podman. @@ -43,11 +47,33 @@ podman build -t . In general, [`podman build`](https://docs.podman.io/en/stable/markdown/podman-build.1.html) follows the Docker options convention. +!!! info "Debugging the container build" + If the container build fails, you can run an interactive shell using the image from the last successfully built layer with + + ```bash + podman run -it --rm -e NVIDIA_VISIBLE_DEVICES=void bash # (1)! + ``` + + 1. Setting `NVIDIA_VISIBLE_DEVICES` in the environment is required specifically to run NGC containers with podman + + replacing `` by the actual hash output in the build job and interactively test the failing command. + + ## Importing images in the Container Engine An image built using Podman can be easily imported as a squashfs archive in order to be used with our Container Engine solution. It is important to keep in mind that the import has to take place in the same job allocation where the image creation took place, otherwise the image is lost due to the temporary nature of `/dev/shm`. +!!! warning "Preliminary configuration: Lustre settings for container images" + Since container images are large files and the filesystem is a shared resource, you need to configure the target directory according to [best practices for Lustre][ref-guides-storage-lustre] before importing the container image so it will be properly distributed across storage nodes. + + ```bash + lfs setstripe -E 4M -c 1 -E 64M -c 4 -E -1 -c -1 -S 4M # (1)! + ``` + + 1. This makes sure that files stored subsequently end up on the same storage node (up to 4 MB), on 4 storage nodes (between 4 and 64 MB) or are striped across all storage nodes (above 64 MB) + + To import the image: ``` @@ -62,7 +88,6 @@ image = "//" mounts = ["/capstor/scratch/cscs/:/capstor/scratch/cscs/"] workdir = "/capstor/scratch/cscs/" ``` - ## Pushing Images to a Container Registry In order to push an image to a container registry, you first need to follow three steps: diff --git a/docs/clusters/clariden.md b/docs/clusters/clariden.md index 62401236..e5564770 100644 --- a/docs/clusters/clariden.md +++ b/docs/clusters/clariden.md @@ -65,6 +65,8 @@ Alternatively, [uenv][ref-uenv] are also available on Clariden. Currently deploy uenv start namd/3.0:v3@daint ``` +For detailed instructions and best practices with ML frameworks, please refer to the dedicated pages under [ML software][ref-software-ml]. + ## Running Jobs on Clariden ### Slurm diff --git a/docs/guides/mlp_tutorials/index.md b/docs/guides/mlp_tutorials/index.md deleted file mode 100644 index da7cb242..00000000 --- a/docs/guides/mlp_tutorials/index.md +++ /dev/null @@ -1,10 +0,0 @@ -[](){#ref-guides-mlp-tutorials} -# Machine Learning Platform Tutorials - -These tutorials gradually introduce key concepts of the Machine Learning Platform. A particular focus is on the [Container Engine][ref-container-engine] for managing the runtime environment. - -In a [first tutorial][ref-mlp-llm-inference-tutorial], you will learn how to run inference with a LLM on a single node using a container from the NVIDIA GPU Cloud (NGC). Concepts such as container environment description, layering a thin virtual environment on top of the container image, and job launching and monitoring will be introduced. - -Building on the first tutorial, in the [second tutorial][ref-mlp-llm-fine-tuning-tutorial] you will learn how to train (fine-tune) a LLM on multiple GPUs on a single node. For this purpose, you will use HuggingFace's `accelerate` and see best practices for dataset management. - -In the [third tutorial][ref-mlp-llm-nanotron-tutorial], you will apply the techniques from the previous tutorials to enable distributed (pre-)training of a model in `nanotron` on multiple nodes. In particular, this tutorial makes use of model-parallelism and introduces the usage of `torchrun` to manage jobs on individual nodes. diff --git a/docs/platforms/mlp/index.md b/docs/platforms/mlp/index.md index eb1f6f7c..5ea7fdb9 100644 --- a/docs/platforms/mlp/index.md +++ b/docs/platforms/mlp/index.md @@ -91,4 +91,4 @@ Project is per project - each project gets a project folder with project-specifi ## Guides and tutorials -Tutorials for fine-tuning and running inference of LLMs as well as training an LLM with Nanotron can be found in the [MLP Tutorials][ref-guides-mlp-tutorials] page. +Tutorials on how to set up and configure a machine learning environment in order to run LLM workloads such as inference, fine-tuning and multi-node training can be found in the [tutorials section][ref-software-ml-tutorials] under machine learning software. diff --git a/docs/software/communication/nccl.md b/docs/software/communication/nccl.md index 6a0068ad..5a16c39c 100644 --- a/docs/software/communication/nccl.md +++ b/docs/software/communication/nccl.md @@ -14,7 +14,7 @@ When using e.g. the `default` view of `prgenv-gnu` the `aws-ofi-nccl` plugin wil Alternatively, loading the `aws-ofi-nccl` module with the `modules` view also makes the plugin available in the environment. The environment variables described below must be set to ensure that NCCL uses the plugin. -While the container engine sets these automatically when using the NCCL hook, the following environment variables should always be set for correctness and optimal performance when using NCCL: +While the container engine sets these automatically when using the NCCL hook, the following environment variables should always be set for correctness and optimal performance when using NCCL with uenv: ```bash --8<-- "docs/software/communication/nccl_env_vars" diff --git a/docs/software/ml/index.md b/docs/software/ml/index.md index ba101c0b..ac136472 100644 --- a/docs/software/ml/index.md +++ b/docs/software/ml/index.md @@ -2,22 +2,33 @@ # Machine learning applications and frameworks CSCS supports a wide range of machine learning (ML) applications and frameworks on its systems. -Most ML workloads are containerized to ensure portability, reproducibility, and ease of use across environments. +Most ML workloads are containerized to ensure portability, reproducibility, and ease of use across machines. Users can choose between running containers, using provided uenv software stacks, or building custom Python environments tailored to their needs. -## Running machine learning applications with containers +First time users are recommended to consult the [LLM tutorials][ref-software-ml-tutorials] to get familiar with the concepts of the Machine Learning platform in a series of hands-on examples. + +## Running ML applications with containers (recommended) Containerization is the recommended approach for ML workloads on Alps, as it simplifies software management and maximizes compatibility with other systems. -* Users are encouraged to build their own containers, starting from popular sources such as the [Nvidia NGC Catalog](https://catalog.ngc.nvidia.com/containers), which offers a variety of pre-built images optimized for HPC and ML workloads. +Users are encouraged to build their own containers, starting from popular sources such as the [Nvidia NGC Catalog](https://catalog.ngc.nvidia.com/containers), which offers a variety of pre-built images optimized for HPC and ML workloads. Examples include: - * [PyTorch NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) - * [TensorFlow NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorflow) -* For frequently changing dependencies, consider creating a virtual environment (venv) mounted into the container. + +* [PyTorch NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) ([Release Notes](https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/index.html)) +* [JAX NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/jax) ([Release Notes](https://docs.nvidia.com/deeplearning/frameworks/jax-release-notes/index.html)) +* [TensorFlow NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorflow) (deprecated since 25.02, see [Release Notes](https://docs.nvidia.com/deeplearning/frameworks/tensorflow-release-notes/index.html)) + +Documented best practices are available for: + +* [PyTorch][ref-ce-pytorch] + +!!! note "Extending a container with a virtual environment" + For frequently changing Python dependencies during development, consider creating a Virtual Environment (venv) on top of the packages in the container (see [this example][ref-ce-pytorch-venv]). Helpful references: +* Introduction to concepts of the Machine Learning platform: [LLM tutorials][ref-software-ml-tutorials] * Running containers on Alps: [Container Engine Guide][ref-container-engine] * Building custom container images: [Container Build Guide][ref-build-containers] @@ -30,17 +41,18 @@ Available ML-related uenvs: * [PyTorch][ref-uenv-pytorch] — available on [Clariden][ref-cluster-clariden] and [Daint][ref-cluster-daint] -To extend these environments with additional Python packages, it is recommended to create a Python Virtual Environment (venv). -See this [PyTorch venv example][ref-uenv-pytorch-venv] for details. - -!!! note - While many Python packages provide pre-built binaries for common architectures, some may require building from source. +!!! note "Extending a uenv with a virtual environment" + To extend these environments with additional Python packages, it is recommended to create a Python Virtual Environment (venv) layered on top of the packages in the uenv. + See this [PyTorch venv example][ref-uenv-pytorch-venv] for details. ## Building custom Python environments Users may also choose to build entirely custom software stacks using Python package managers such as `uv` or `conda`. Most ML libraries are available via the [Python Package Index (PyPI)](https://pypi.org/). +!!! note + While many Python packages provide pre-built binaries for common architectures, some may require building from source. + To ensure optimal performance on CSCS systems, we recommend starting from an environment that already includes: * CUDA, cuDNN diff --git a/docs/software/ml/pytorch.md b/docs/software/ml/pytorch.md index 84aa68db..7ee77524 100644 --- a/docs/software/ml/pytorch.md +++ b/docs/software/ml/pytorch.md @@ -1,6 +1,207 @@ -[](){#ref-uenv-pytorch} +[](){#ref-software-ml-pytorch} # PyTorch +PyTorch is available both as a container with the [Container Engine (CE)][ref-container-engine] and a [uenv][ref-uenv] software stack. The best choice for your use case depends on the amount of control required over the lower level libraries. + +While NGC provides an optimized build of PyTorch with many dependencies included, uenv allows a more flexible choice of lower level libraries and represents a thinner layer over the host system. Both options can be customized - a container via a Dockerfile and a uenv (in advanced use cases) via its recipe and both, additionally, via Python virtual environments built on top. Due to the simplicity and reproducible performance, containers are generally the recommended default for most users. + +[](){#ref-ce-pytorch} +## Running PyTorch with the Container Engine (recommended) + +Running PyTorch from a container ensures maximum portability, reproducibility, and ease of use across machines. This is achieved by + +1. selecting an appropriate base image and customizing it in a Dockerfile +2. define the container runtime environment in an EDF +3. (optionally) extending by a virtual environment +4. submitting jobs with CE in SLURM + +These steps are illustrated in the [machine learning platform tutorials][ref-software-ml-tutorials] and the instructions detailed in the [podman build guide][ref-build-containers]. + +!!! info "Preliminary steps" + Before proceeding with the next steps, make sure you have storage for podman configured as in the [build guide][ref-build-containers-configure-podman] and make sure to apply [recommended Lustre settings][ref-guides-storage-lustre] to every directory (e.g. `$SCRATCH/ce-images`) dedicated to container images before importing them with enroot. This is necessary to guarantee good filesystem performance. + + ```bash + lfs setstripe -E 4M -c 1 -E 64M -c 4 -E -1 -c -1 -S 4M $SCRATCH/ce-images # (1)! + ``` + + 1. This makes sure that files stored subsequently end up on the same storage node (up to 4 MB), on 4 storage nodes (between 4 and 64 MB) or are striped across all storage nodes (above 64 MB) + + +### Selecting the base image + +For most applications, the [PyTorch NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) is a good base image as PyTorch comes pre-installed with an optimized build including many dependencies. The [Release Notes](https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/index.html) give an overview on installed packages and compatibility. This image can be further customized in a Dockerfile and built with podman as detailed in the [podman build guide][ref-build-containers]. + +### Define Container Runtime Environment + +Having built and imported a container image with podman and enroot, the next step is to configure the runtime environment with an environment definition file (EDF). In particular, this includes specifying the image, any directories mounted and a working directory to for the processes in the container to start in as in the [quickstart examples for CE][ref-container-engine]. + +Besides this, there are specific features relevant for machine learning available through [annotations][ref-ce-annotations], which customize the container at runtime. + +* When using NCCL inside the container, you want to include the [aws-ofi-nccl][ref-ce-aws-ofi-hook] plugin which enables the container to interface with the host's libfabric and, thus, make use of Alps Slingshot high-speed interconnect. This is crucial for multi-node communication performance. +* An [SSH annotation][ref-ce-ssh-hook] allows adding a light-weight SSH server to the container without the need to modify the container image + +A resulting example TOML file following best practices may look like + +```toml title="$HOME/my-app/ngc-pytorch-my-app-25.06.toml" +image = "${SCRATCH}/ce-images/ngc-pytorch-my-app+25.06.sqsh" # (1)! + +mounts = [ + "/capstor", + "/iopsstor", + "/users/${USER}/my-app" +] # (2)! + +workdir = "${HOME}/my-app" # (3)! + +[annotations] +com.hooks.aws_ofi_nccl.enabled = "true" # (4)! +com.hooks.aws_ofi_nccl.variant = "cuda12" + +[env] +NCCL_DEBUG = "INFO" # (5)! +CUDA_CACHE_DISABLE = "1" # (6)! +TORCH_NCCL_ASYNC_ERROR_HANDLING = "1" # (7)! +MPICH_GPU_SUPPORT_ENABLED = "0" # (8)! +``` + +1. It is important to use curly braces for environment variables used in the EDF +2. The path `/users` is not mounted as a whole since it often contains user-specific initialization scripts for the host environment and many frameworks leave temporary data behind that can lead to non-trivial runtime errors when swapping container images. Thus, it is recommended to selectively mount specific subfolders under `${HOME}` if needed. +3. You can use `${PWD}` as an alternative to use the path submitted from when the container is started +4. This enables NCCL installed in the container to make effective use of the Slingshot interconnect on Alps by interfacing with the [AWS OFI NCCL plugin][ref-ce-aws-ofi-hook] with libfabric. While not strictly needed for single node workloads, it is good practice to keep it always on. +5. This makes NCCL output debug info during initialization, which can be useful to spot communication-related issues in a distributed scenario (see later tutorials). Subsystems with debug log can be configured with `NCCL_DEBUG_SUBSYS`. +6. Disable CUDA JIT cache +7. Async error handling when an exception is observed in NCCL watchdog: aborting NCCL communicator and tearing down process upon error +8. Disable GPU support in MPICH, as it can lead to deadlocks when using together with NCCL + +??? note "Access to SLURM from inside the container" + In case access to SLURM is required from inside the container, you can add the following lines to the mounts above: + + ```toml + ... + + mounts = [ + "/capstor", + "/iopsstor", + "/users/${USER}/my-app", + "/etc/slurm", # (1)! + "/usr/lib64/libslurm-uenv-mount.so", + "/etc/container_engine_pyxis.conf" + ] + + ... + ``` + + 1. Enable Slurm commands (together with two subsequent mounts) + +!!! note "Best practice for large-scale jobs" + + For stability and reproducibility, use self-contained containers for large scale jobs. Using code mounted from the distributed filesystem may leave compiled artefacts behind that can result in unintentional runtime errors when e.g. swapping the container image. In particular, it is recommended to avoid mounting all of `$HOME`, so that environments are properly isolated and e.g. the Triton cache (that by default ends up in `$HOME/.triton`) resides in an ephemeral location of the filesystem. + +!!! note "Collaborating in Git" + + For reproducibility, it is recommended to always track the Dockerfile, EDF and an optional virtual environment specification alongside your application code in a Git repository. + +### (Optionally) extend container with virtual environment + +While production jobs should include as many dependencies as possible in the container image, during development it can be convenient to manage frequently changing packages in a virtual environment built on top of the container image. This can include both dependencies and actively developed packages (that can be installed in editable mode with `pip install -e .`). + +To create such a virtual environment, _inside the container_ use the Python `venv` module with the option `--system-site-packages` to ensure that packages are installed _in addition_ to the existing packages. Without this option, packages may accidentally be re-installed shadowing a version that is already present in the container. +A workflow installing additional packages in a virtual environment may look like this: + +```console +[clariden-lnXXX]$ srun -A \ + --environment=./ngc-pytorch-my-app-25.06.toml --pty bash # (1)! +user@nidYYYYYY$ python -m venv --system-site-packages venv-ngc-pt-25.06 # (2)! +user@nidYYYYYY$ source venv-ngc-pt-25.06/bin/activate # (3)! +(venv-ngc-pt-25.06) user@nidYYYYYY$ pip install # (3)! +(venv-ngc-pt-25.06) user@nidYYYYYY$ exit +``` + +1. Allocate an interactive session on a compute node +2. Create a virtual environment on top of the existing Python installation in the container (only necessary the first time) +3. Activate the newly created virtual environment (always necessary when running a Slurm job) +4. Install additional packages (only run this from a single process to avoid race conditions) + +The changes made to the virtual environment will outlive the container as they are persisted on the distributed filesystem. + +!!! note + Keep in mind that + + * this virtual environment is _specific_ to this particular container and won't actually work unless you are using it from inside this container - it relies on the resources packaged inside the container. + * every Slurm job making use of this virtual environment will need to activate it first (_inside_ the `srun` command). + + +### Submit jobs with the Container Engine in Slurm + +A general template for a Pytorch distributed training job with Slurm in analogy to the [last tutorial][software-ml-llm-nanotron-tutorial] may look like + +```bash title="$HOME/my-app/submit-dist-train.sh" +#!/bin/bash +#SBATCH --account= +#SBATCH --job-name=dist-train-ddp +#SBATCH --time=01:00:00 +#SBATCH --nodes=2 +#SBATCH --ntasks-per-node=4 +#SBATCH --output=logs/slurm-%x-%j.log +# (1)! + +set -x + +ulimit -c 0 # (2)! + + # (3)! + # (4)! +srun -ul --environment=./ngc-pytorch-my-app-25.06.toml bash -c " + . venv-ngc-pt-25.06/bin/activate # activate (optional) venv + +--8<-- "docs/software/ml/torch_distributed_env_vars" + python dist-train.py +" +``` + +1. If `#SBATCH --error=...` is not specified, `#SBATCH --output` will also contain stderr (error messages) +2. In case the application crashes, it may leave behind large core dump files that contain an image of the process memory at the time of the crash. While these can be useful for debugging the reason of a specific crash (by e.g. loading them with `cuda-gdb` and looking at the stack trace with `bt`), they may accumulate over time and occupy a large space on the filesystem. For this reason, it is recommended to disable their creation (unless needed) by adding this line. +3. Loading the virtual environment is mandatory within every `srun` command if it is used to manage packages. +4. The environment variables are set to initialize PyTorch's distributed module through the environment (cf. [docs](https://docs.pytorch.org/docs/stable/distributed.html#environment-variable-initialization)). + + +For further details on execution logic, job monitoring and data management, please refer to the [nanotron tutorial][software-ml-llm-nanotron-tutorial] (which in particular also explains the usage of `torchrun` with Slurm). Make sure to apply [recommended Lustre settings][ref-guides-storage-lustre] to datasets, models and container images persisted to the distributed filesystem. + +!!! warning "#SBATCH --environment" + The operations performed before the `srun` command are executed in the host environment of a single compute node in the allocation. If you need to perform these steps in the container environment as well, you can alternatively use the `#SBATCH --environment=path/to/ngc-pytorch-my-app-25.06.toml` option _instead of_ using `--environment` with `srun`. + + Use of the `--environment` option for `sbatch` is still considered experimental and could result in unexpected behavior. In particular, avoid mixing `#SBATCH --environment` and `srun --environment` in the same job. + + Use of `--environment` is currently only recommended for the `srun` command. + +!!! note "Optimizing large-scale training jobs" + The following settings were established to **improve compute throughput** of LLM training in `Megatron-LM`: + + * Extensively evaluate all possible parallelization dimensions, including data-, tensor- and pipeline parallelism (including virtual pipeline parallelism) and more, when available. In `Megatron-LM`, avoid using the option '--defer-embedding-wgrad-compute` to defer the embedding gradient computation. Identify storage-related bottlenecks by isolating data loading/generation operations into a separate benchmark. + + * Disabling transparent huge pages and enabling the Nvidia [vboost](https://docs.nvidia.com/nemo-framework/user-guide/latest/performance/performance-guide.html#gpu-core-clock-optimization) feature has been observed to improve performance in large-scale LLM training in `Megatron-LM`. This can be achieved by adding these constraints to the sbatch script: + ```bash + #SBATCH -C thp_never&nvidia_vboost_enabled + ``` + + * The argument `--ddp-bucket-size` controls the level of grouping of many small data-parallel communications into bigger ones and setting it to a high value such as can improve throughput (model-dependent, e.g. `10000000000`). + + * If in doubt about communication performance with NCCL at scale, use [nccl-tests](https://github.com/NVIDIA/nccl-tests) with the relevant communication patterns to check if scaling behavior can be reproduced. + + Additionally, consider the **best practice for checkpointing and data management**: + + * Following the advice on [filesystems][ref-storage-fs], write checkpoints (sequential write) to `/capstor/scratch` and place randomly accessed training data (many small random reads) on `/iopsstor/scratch`. Use the [data transfer instructions][ref-data-xfer] to move data to/from `/capstor/store`. Make sure to apply recommended [Lustre settings][ref-guides-storage-lustre] on all directories containing significant amount of data, including those containing container images and managed by other tools (e.g. the HuggingFace cache, see [`HF_HOME`](https://huggingface.co/docs/huggingface_hub/en/package_reference/environment_variables#hfhome) in the [this tutorial][software-ml-llm-inference-tutorial]). + + * Regularly adjust checkpoint writing intervals to the overhead induced by writing a checkpoint ($T_1$) and the mean time between job failures ($T_2$). As a first order approximation use a checkpointing interval of $\sqrt{2 T_1 T_2}$ (derived by [Young](https://doi.org/10.1145/361147.361115) and [Daly](https://doi.org/10.1016/j.future.2004.11.016)). + + Adjust for **cluster availability**: + + * Submit your jobs with a Slurm time limit compatible with reservations (such as maintenance windows, cf. `scontrol show res`) to be able to get scheduled. + + +[](){#ref-uenv-pytorch} +## Running PyTorch with a uenv + The PyTorch software stack was designed with the intention of being able to run [Megatron-LM](https://github.com/NVIDIA/Megatron-LM)-based pre-training workloads out of the box. Thus, it comes with batteries included and does not just provide the bare [PyTorch framework](https://github.com/pytorch/pytorch). @@ -9,7 +210,7 @@ Thus, it comes with batteries included and does not just provide the bare [PyTor [PyTorch][ref-uenv-pytorch] is provided via [uenv][ref-uenv]. Please have a look at the [uenv documentation][ref-uenv] for more information about uenvs and how to use them. -## Versioning +### Versioning The PyTorch uenv is versioned according to the PyTorch version it provides. @@ -241,7 +442,7 @@ The PyTorch uenv is versioned according to the PyTorch version it provides. [](){#ref-uenv-pytorch-how-to-use} -## How to use +### How to use There are two ways to access the software provided by the uenv, once it has been started. @@ -279,20 +480,20 @@ There are two ways to access the software provided by the uenv, once it has been [Check out the guide for using Spack with uenv][ref-building-uenv-spack]. [](){#ref-uenv-pytorch-venv} -## Adding Python packages on top of the uenv +### Adding Python packages on top of the uenv -Uenvs are read-only, and cannot be modified. However, it is possible to add Python packages on top of the uenv using virtual environments. +Uenvs are read-only, and cannot be modified. However, it is possible to add Python packages on top of the uenv using virtual environments analogous to the setup with containers. ```console title="Creating a virtual environment on top of the uenv" $ uenv start pytorch/v2.6.0:v1 --view=default # (1)! -$ python -m venv --system-site-packages ./my-venv # (2)! +$ python -m venv --system-site-packages venv-uenv-pt2.6-v1 # (2)! -$ source ./my-venv/bin/activate # (3)! +$ source venv-uenv-pt2.6-v1/bin/activate # (3)! -(my-venv) $ pip install # (4)! +(venv-uenv-pt2.6-v1) $ pip install # (4)! -(my-venv) $ deactivate # (5)! +(venv-uenv-pt2.6-v1) $ deactivate # (5)! $ exit # (6)! ``` @@ -312,18 +513,26 @@ $ exit # (6)! Python virtual environments can be slow on the parallel Lustre file system due to the amount of small files and potentially many processes accessing it. If this becomes a bottleneck, consider [squashing the venv][ref-guides-storage-venv] into its own memory-mapped, read-only file system to enhance scalability and reduce load times. +??? bug "Python packages from uenv shadowing those in a virtual environment" + When using uenv with a virtual environment on top, the site-packages under `/user-environment` currently take precedence over those in the activated virtual environment. This is due to the uenv paths being included in the `PYTHONPATH` environment variable. As a consequence, despite installing a different version of a package in the virtual environment from what is available in the uenv, the uenv version will still be imported at runtime. A possible workaround is to prepend the virtual environment's site-packages to `PYTHONPATH` whenever activating the virtual environment. + ```bash + export PYTHONPATH="$(python -c 'import site; print(site.getsitepackages()[0])'):$PYTHONPATH" + ``` + It is recommended to apply this workaround if you are constrained by a Python package version installed in the uenv that you need to change for your application. + Alternatively one can use the uenv as [upstream Spack instance][ref-building-uenv-spack] to to add both Python and non-Python packages. However, this workflow is more involved and intended for advanced Spack users. -## Running PyTorch jobs with Slurm +### Running PyTorch jobs with Slurm ```bash title="Slurm sbatch script" #!/bin/bash -#SBATCH --job-name=myjob -#SBATCH --nodes=1 +#SBATCH --account= +#SBATCH --job-name=dist-train-ddp +#SBATCH --time=01:00:00 +#SBATCH --nodes=2 #SBATCH --ntasks-per-node=4 -#SBATCH --cpus-per-task=72 -#SBATCH --time=00:30:00 +#SBATCH --output=logs/slurm-%x-%j.log # (1)! #SBATCH --uenv=pytorch/v2.6.0:/user-environment #SBATCH --view=default @@ -336,35 +545,32 @@ export OMP_NUM_THREADS=8 # (2)! ################################# # PyTorch environment variables # ################################# -export MASTER_ADDR=$(hostname) # (3)! -export MASTER_PORT=29500 -export WORLD_SIZE=$SLURM_NPROCS -export TORCH_NCCL_ASYNC_ERROR_HANDLING=1 # (4)! -export TRITON_HOME=/dev/shm/ # (5)! +export TORCH_NCCL_ASYNC_ERROR_HANDLING=1 # (3)! +export TRITON_HOME=/dev/shm/ # (4)! ################################# # MPICH environment variables # ################################# -export MPICH_GPU_SUPPORT_ENABLED=0 # (6)! +export MPICH_GPU_SUPPORT_ENABLED=0 # (5)! ################################# # CUDA environment variables # ################################# -export CUDA_CACHE_DISABLE=1 # (7)! +export CUDA_CACHE_DISABLE=1 # (6)! ############################################ # NCCL and Fabric environment variables # ############################################ -# (8)! +# (7)! --8<-- "docs/software/communication/nccl_env_vars" +# (8)! # (9)! -# (10)! srun bash -c " - export RANK=\$SLURM_PROCID - export LOCAL_RANK=\$SLURM_LOCALID - . ./my-venv/bin/activate - python myscript.py + . ./venv-uenv-pt2.6-v1/bin/activate + +--8<-- "docs/software/ml/torch_distributed_env_vars" + python dist-train.py " ``` @@ -374,16 +580,15 @@ srun bash -c " The number of threads should be not greater than the number of cores per task (`$SLURM_CPUS_PER_TASK`). The optimal number depends on the workload and should be determined by testing. Consider for example that typical workloads using PyTorch may fork the processes, so the number of threads should be around the number of cores per task divided by the number of processes. -3. These variables are used by PyTorch to initialize the distributed backend. - The `MASTER_ADDR`, `MASTER_PORT` and `WORLD_SIZE` variables are used to determine the address and port of the master node. - Additionally we also need `RANK` and `LOCAL_RANK` but these must be set per-process, see below. -4. Enable more graceful exception handling, see [PyTorch documentation](https://pytorch.org/docs/stable/torch_nccl_environment_variables.html) -5. Set the Triton home to a local path (e.g. `/dev/shm`) to avoid writing to the (distributed) file system. - This is important for performance, as writing to the Lustre file system can be slow due to the amount of small files and potentially many processes accessing it. -6. Disable GPU support in MPICH, as it [can lead to deadlocks](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/mpi.html#inter-gpu-communication-with-cuda-aware-mpi) when using together with nccl. -7. Avoid writing JITed binaries to the (distributed) file system, which could lead to performance issues. -8. These variables should always be set for correctness and optimal performance when using NCCL, see [the detailed explanation][ref-communication-nccl]. -9. `RANK` and `LOCAL_RANK` are set per-process by the Slurm job launcher. -10. Activate the virtual environment created on top of the uenv (if any). +3. Enable more graceful exception handling, see [PyTorch documentation](https://pytorch.org/docs/stable/torch_nccl_environment_variables.html) +4. Set the Triton home to a local path (e.g. `/dev/shm`) to avoid writing to the (distributed) file system. + This is important for performance, as writing to the Lustre file system can be slow due to the amount of small files and potentially many processes accessing it. Avoid this setting with the container engine as it may lead to errors related to mount settings of `/dev/shm` (use a filesystem path inside the container instead). +5. Disable GPU support in MPICH, as it [can lead to deadlocks](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/mpi.html#inter-gpu-communication-with-cuda-aware-mpi) when using together with nccl. +6. Avoid writing JITed binaries to the (distributed) file system, which could lead to performance issues. +7. These variables should always be set for correctness and optimal performance when using NCCL with uenv, see [the detailed explanation][ref-communication-nccl]. +8. Activate the virtual environment created on top of the uenv (if any). Please follow the guidelines for [python virtual environments with uenv][ref-guides-storage-venv] to enhance scalability and reduce load times. +9. The environment variables are used by PyTorch to initialize the distributed backend. + The `MASTER_ADDR`, `MASTER_PORT` variables are used to determine the address and port of the master node. + Additionally we also need `RANK` and `LOCAL_RANK` and `WORLD_SIZE` to identify the position of each rank within the Slurm step and node, respectively. diff --git a/docs/software/ml/torch_distributed_env_vars b/docs/software/ml/torch_distributed_env_vars new file mode 100644 index 00000000..6d7692a0 --- /dev/null +++ b/docs/software/ml/torch_distributed_env_vars @@ -0,0 +1,5 @@ + MASTER_ADDR=\$(scontrol show hostnames \$SLURM_JOB_NODELIST | head -n 1) \ + MASTER_PORT=29500 \ + RANK=\${SLURM_PROCID} \ + LOCAL_RANK=\${SLURM_LOCALID} \ + WORLD_SIZE=\${SLURM_NTASKS} \ \ No newline at end of file diff --git a/docs/software/ml/tutorials/index.md b/docs/software/ml/tutorials/index.md new file mode 100644 index 00000000..a3ee90c7 --- /dev/null +++ b/docs/software/ml/tutorials/index.md @@ -0,0 +1,13 @@ +[](){#ref-software-ml-tutorials} +# Machine Learning Platform Tutorials + +The LLM tutorials gradually introduce key concepts of the Machine Learning Platform in a series of hands-on examples. A particular focus is on the [Container Engine][ref-container-engine] for managing the runtime environment. + +In a [first tutorial][software-ml-llm-inference-tutorial], you will learn how to run inference with a LLM on a single node using a container from the NVIDIA GPU Cloud (NGC). Concepts such as container environment description, layering a thin virtual environment on top of the container image, and job launching and monitoring will be introduced. + +Building on the first tutorial, in the [second tutorial][software-ml-llm-fine-tuning-tutorial] you will learn how to train (fine-tune) a LLM on multiple GPUs on a single node. For this purpose, you will use HuggingFace's `accelerate` and see best practices for dataset management. + +In the [third tutorial][software-ml-llm-nanotron-tutorial], you will apply the techniques from the previous tutorials to enable distributed (pre-)training of a model in `nanotron` on multiple nodes. In particular, this tutorial makes use of model-parallelism and introduces the usage of `torchrun` to manage jobs on individual nodes. + +!!! note + The focus for these tutorials is on introducing concepts of the Machine Learning Platform. As such, they do not necessarily discuss the latest advancements or steps required to obtain maximum performance. For this purpose, consult the framework-specific pages, such as the one for [PyTorch][ref-software-ml-pytorch]. \ No newline at end of file diff --git a/docs/guides/mlp_tutorials/llm-fine-tuning.md b/docs/software/ml/tutorials/llm-fine-tuning.md similarity index 95% rename from docs/guides/mlp_tutorials/llm-fine-tuning.md rename to docs/software/ml/tutorials/llm-fine-tuning.md index 5d72cfd1..e27a5cbc 100644 --- a/docs/guides/mlp_tutorials/llm-fine-tuning.md +++ b/docs/software/ml/tutorials/llm-fine-tuning.md @@ -1,8 +1,8 @@ -[](){#ref-mlp-llm-fine-tuning-tutorial} +[](){#software-ml-llm-fine-tuning-tutorial} # LLM Fine-tuning Tutorial -This tutorial will take the model from the [LLM Inference][ref-mlp-llm-inference-tutorial] tutorial and show you how to perform fine-tuning. +This tutorial will take the model from the [LLM Inference][software-ml-llm-inference-tutorial] tutorial and show you how to perform fine-tuning. This means that we take the model and train it on some new custom data to change its behavior. To complete the tutorial, we set up some extra libraries that will help us to update the state of the machine learning model. @@ -12,7 +12,7 @@ We also write a script that will allow us to unlock more of the performance offe ### Prerequisites -This tutorial assumes you've already successfully completed the [LLM Inference][ref-mlp-llm-inference-tutorial] tutorial. +This tutorial assumes you've already successfully completed the [LLM Inference][software-ml-llm-inference-tutorial] tutorial. For fine-tuning Gemma, we will rely on the NGC PyTorch container and the libraries we've already installed in the Python virtual environment used previously. ### Set up TRL @@ -97,7 +97,7 @@ The first four lines of the launch line are used to configure `accelerate`. Everything after that configures the `trl/examples/scripts/sft.py` Python script, which we use to train Gemma. !!! note "Dataset management and sharing" - For datasets, recommended LUSTRE settings should be used as illustrated in the tutorial on [LLM Inference][ref-mlp-llm-inference-tutorial]. As they have been set there for `HF_HOME`, which `huggingface_hub` uses for its dataset cache, they don't need to be re-applied here. + For datasets, recommended LUSTRE settings should be used as illustrated in the tutorial on [LLM Inference][software-ml-llm-inference-tutorial]. As they have been set there for `HF_HOME`, which `huggingface_hub` uses for its dataset cache, they don't need to be re-applied here. To enable your colleagues to use also use your datasets, please refer to the [storage guide][ref-guides-storage-sharing]. diff --git a/docs/guides/mlp_tutorials/llm-inference.md b/docs/software/ml/tutorials/llm-inference.md similarity index 99% rename from docs/guides/mlp_tutorials/llm-inference.md rename to docs/software/ml/tutorials/llm-inference.md index 15af5ed0..2495687b 100644 --- a/docs/guides/mlp_tutorials/llm-inference.md +++ b/docs/software/ml/tutorials/llm-inference.md @@ -1,4 +1,4 @@ -[](){#ref-mlp-llm-inference-tutorial} +[](){#software-ml-llm-inference-tutorial} # LLM Inference Tutorial diff --git a/docs/guides/mlp_tutorials/llm-nanotron-training.md b/docs/software/ml/tutorials/llm-nanotron-training.md similarity index 92% rename from docs/guides/mlp_tutorials/llm-nanotron-training.md rename to docs/software/ml/tutorials/llm-nanotron-training.md index 10194a20..f8a9235e 100644 --- a/docs/guides/mlp_tutorials/llm-nanotron-training.md +++ b/docs/software/ml/tutorials/llm-nanotron-training.md @@ -1,17 +1,20 @@ -[](){#ref-mlp-llm-nanotron-tutorial} +[](){#software-ml-llm-nanotron-tutorial} # LLM Nanotron Pre-training Tutorial In this tutorial, we will build a container image to run multi-node training jobs with [nanotron](https://github.com/huggingface/nanotron). We will train a 109M parameter model with ~100M wikitext tokens as a proof of concept. +!!! info + Note that while the concepts taught here for multi-node training with PyTorch are generally portable across frameworks, the current (August 2025) recommendation for users with a need for large-scale model-parallel training is to use `Megatron-LM` instead of `nanotron` due to significant performance advantages at scale. + ### Prerequisites -It is recommended to follow the previous two tutorials on [LLM Inference][ref-mlp-llm-inference-tutorial] and [LLM Fine-tuning][ref-mlp-llm-fine-tuning-tutorial] first, as this will build upon them. +It is recommended to follow the previous two tutorials on [LLM Inference][software-ml-llm-inference-tutorial] and [LLM Fine-tuning][software-ml-llm-fine-tuning-tutorial] first, as this will build upon them. ### Set up Podman -If not already done as part of the [LLM Inference tutorial][ref-mlp-llm-inference-tutorial], edit your podman configuration in `$HOME/.config/containers/storage.conf` as follows: +If not already done as part of the [LLM Inference tutorial][software-ml-llm-inference-tutorial], edit your podman configuration in `$HOME/.config/containers/storage.conf` as follows: ```toml title="$HOME/.config/containers/storage.conf" [storage] @@ -64,7 +67,7 @@ RUN pip install \ ``` !!! note "More recent NGC releases" - As discussed in the [LLM Inference tutorial][ref-mlp-llm-inference-tutorial], starting with the 24.11 release, NGC PyTorch no longer requires the installation of the Python venv module. Furthermore, FlashAttention and several other packages were integrated into the hosted image. However, as `nanotron` as of June 2025 still requires Python 3.10 (cf. this [issue](https://github.com/huggingface/nanotron/issues/217)), this example is restricted to NGC releases up to `24.10`. + As discussed in the [LLM Inference tutorial][software-ml-llm-inference-tutorial], starting with the 24.11 release, NGC PyTorch no longer requires the installation of the Python venv module. Furthermore, FlashAttention and several other packages were integrated into the hosted image. However, as `nanotron` as of June 2025 still requires Python 3.10 (cf. this [issue](https://github.com/huggingface/nanotron/issues/217)), this example is restricted to NGC releases up to `24.10`. ```dockerfile title="$SCRATCH/tutorials/nanotron-pretrain/Dockerfile" FROM nvcr.io/nvidia/pytorch:24.10-py3 @@ -165,7 +168,7 @@ In the login node run: 1. This ensures the compatibility of nanotron with the following example. For general usage, there is no reason to stick to an outdated version of nanotron, though. -We will install nanotron in a thin virtual environment on top of the container image built above. This proceeds as in the [LLM Inference][ref-mlp-llm-inference-tutorial]. +We will install nanotron in a thin virtual environment on top of the container image built above. This proceeds as in the [LLM Inference][software-ml-llm-inference-tutorial]. ```console [clariden-lnXXX]$ srun -A --environment=./ngc-nanotron-24.04.toml --pty bash @@ -336,7 +339,7 @@ srun -ul --environment=./ngc-nanotron-24.04.toml bash -c " !!! note "A few comments" - The parts outside the srun command will be run on the first node of the Slurm allocation for this job. srun commands without further specifiers execute with the settings of the sbatch script (i.e. using all nodes allocated to the job). - - Note that we are setting `HF_HOME` to a directory in scratch. This is done to place the dataset downloaded from `huggingface_hub` in your scratch, instead of your home directory. The same applies to your HuggingFace token as well as any models/spaces unless `HF_HUB_CACHE` is set (cf. [HuggingFace docs](https://huggingface.co/docs/huggingface_hub/en/package_reference/environment_variables#hfhome)). As discussed in the tutorial on [LLM Inference][ref-mlp-llm-inference-tutorial], it is good practice to apply the [recommended LUSTRE settings][ref-guides-storage-lustre] there. + - Note that we are setting `HF_HOME` to a directory in scratch. This is done to place the dataset downloaded from `huggingface_hub` in your scratch, instead of your home directory. The same applies to your HuggingFace token as well as any models/spaces unless `HF_HUB_CACHE` is set (cf. [HuggingFace docs](https://huggingface.co/docs/huggingface_hub/en/package_reference/environment_variables#hfhome)). As discussed in the tutorial on [LLM Inference][software-ml-llm-inference-tutorial], it is good practice to apply the [recommended LUSTRE settings][ref-guides-storage-lustre] there. - If instead of downloading a dataset from HuggingFace you want to re-use one managed by a colleague, please refer to the [storage guide][ref-guides-storage-sharing] for instructions on dataset sharing. - If you have a [wandb API key](https://docs.wandb.ai/guides/track/environment-variables/) and want to synchronize the training run, be sure to set the `WANDB_API_KEY` variable. Alternatively, `wandb` can write log data to the distributed filesystem with `WANDB_MODE=of​f​line` so that it can be uploaded with `wandb sync` (cf. [Weights & Biases docs](https://docs.wandb.ai/support/run_wandb_offline/)) after the training run has finished. diff --git a/mkdocs.yml b/mkdocs.yml index 56777a14..0f6139b3 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -62,6 +62,11 @@ nav: - 'Cray modules (CPE)': software/prgenv/cpe.md - 'Machine Learning': - software/ml/index.md + - 'Tutorials': + - software/ml/tutorials/index.md + - 'LLM Inference': software/ml/tutorials/llm-inference.md + - 'LLM Fine-tuning': software/ml/tutorials/llm-fine-tuning.md + - 'LLM Pre-training': software/ml/tutorials/llm-nanotron-training.md - 'PyTorch': software/ml/pytorch.md - 'Communication Libraries': - software/communication/index.md @@ -125,11 +130,6 @@ nav: - 'Internet Access on Alps': guides/internet-access.md - 'Storage': guides/storage.md - 'Using the terminal': guides/terminal.md - - 'MLP Tutorials': - - guides/mlp_tutorials/index.md - - 'LLM Inference': guides/mlp_tutorials/llm-inference.md - - 'LLM Fine-tuning': guides/mlp_tutorials/llm-fine-tuning.md - - 'LLM Pre-training': guides/mlp_tutorials/llm-nanotron-training.md - 'Policies': - policies/index.md - 'User Regulations': policies/regulations.md From 5109d87ca1fef36af494bb32ee4ccd7aa6bee001 Mon Sep 17 00:00:00 2001 From: Lukas Drescher <38319063+lukasgd@users.noreply.github.com> Date: Wed, 20 Aug 2025 15:02:48 +0200 Subject: [PATCH 02/22] Update docs/build-install/containers.md Co-authored-by: boeschf <48126478+boeschf@users.noreply.github.com> --- docs/build-install/containers.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/build-install/containers.md b/docs/build-install/containers.md index 3b6f35f8..c687a75a 100644 --- a/docs/build-install/containers.md +++ b/docs/build-install/containers.md @@ -17,7 +17,7 @@ graphroot = "/dev/shm/$USER/root" ``` !!! warning - If `$XDG_CONFIG_HOME` is set, place this file at `$XDG_CONFIG_HOME/containers/storage.conf` instead. + If `$XDG_CONFIG_HOME` is set, place this file at `$XDG_CONFIG_HOME/containers/storage.conf` instead. See also [this guide][ref-guides-terminal-arch] for further information about XDG variables. !!! warning In the above configuration, `/dev/shm` is used to store the container images. From 34bbcf884d0da5a37da1ba826c9eccc449e51877 Mon Sep 17 00:00:00 2001 From: Lukas Drescher <38319063+lukasgd@users.noreply.github.com> Date: Wed, 20 Aug 2025 18:16:18 +0200 Subject: [PATCH 03/22] Update docs/build-install/containers.md Co-authored-by: boeschf <48126478+boeschf@users.noreply.github.com> --- docs/build-install/containers.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/build-install/containers.md b/docs/build-install/containers.md index c687a75a..a2d2128e 100644 --- a/docs/build-install/containers.md +++ b/docs/build-install/containers.md @@ -56,7 +56,7 @@ In general, [`podman build`](https://docs.podman.io/en/stable/markdown/podman-bu 1. Setting `NVIDIA_VISIBLE_DEVICES` in the environment is required specifically to run NGC containers with podman - replacing `` by the actual hash output in the build job and interactively test the failing command. + replacing `` with the actual hash output in the build job and interactively test the failing command. ## Importing images in the Container Engine From 5a62e5cf245f66aab44a43bf8df3940d706d2c17 Mon Sep 17 00:00:00 2001 From: Lukas Drescher <38319063+lukasgd@users.noreply.github.com> Date: Wed, 20 Aug 2025 18:16:45 +0200 Subject: [PATCH 04/22] Update docs/platforms/mlp/index.md Co-authored-by: boeschf <48126478+boeschf@users.noreply.github.com> --- docs/platforms/mlp/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/platforms/mlp/index.md b/docs/platforms/mlp/index.md index 5ea7fdb9..3454f062 100644 --- a/docs/platforms/mlp/index.md +++ b/docs/platforms/mlp/index.md @@ -91,4 +91,4 @@ Project is per project - each project gets a project folder with project-specifi ## Guides and tutorials -Tutorials on how to set up and configure a machine learning environment in order to run LLM workloads such as inference, fine-tuning and multi-node training can be found in the [tutorials section][ref-software-ml-tutorials] under machine learning software. +Tutorials on how to set up and configure a machine learning environment in order to run LLM workloads such as inference, fine-tuning and multi-node training can be found in the [tutorials section][ref-software-ml-tutorials]. From 9c1c90a206ae87b91ff3d08fe65efe61a60d2bcb Mon Sep 17 00:00:00 2001 From: Lukas Drescher <38319063+lukasgd@users.noreply.github.com> Date: Wed, 20 Aug 2025 18:16:56 +0200 Subject: [PATCH 05/22] Update docs/software/ml/pytorch.md Co-authored-by: boeschf <48126478+boeschf@users.noreply.github.com> --- docs/software/ml/pytorch.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/software/ml/pytorch.md b/docs/software/ml/pytorch.md index 7ee77524..fe5e21c4 100644 --- a/docs/software/ml/pytorch.md +++ b/docs/software/ml/pytorch.md @@ -11,7 +11,7 @@ While NGC provides an optimized build of PyTorch with many dependencies included Running PyTorch from a container ensures maximum portability, reproducibility, and ease of use across machines. This is achieved by 1. selecting an appropriate base image and customizing it in a Dockerfile -2. define the container runtime environment in an EDF +2. defining the container runtime environment in an EDF 3. (optionally) extending by a virtual environment 4. submitting jobs with CE in SLURM From 2c4381f1d7e5fe191dbff2bbf8fc26d4395f155c Mon Sep 17 00:00:00 2001 From: Lukas Drescher <38319063+lukasgd@users.noreply.github.com> Date: Wed, 20 Aug 2025 18:17:07 +0200 Subject: [PATCH 06/22] Update docs/software/ml/pytorch.md Co-authored-by: boeschf <48126478+boeschf@users.noreply.github.com> --- docs/software/ml/pytorch.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/software/ml/pytorch.md b/docs/software/ml/pytorch.md index fe5e21c4..6daa8953 100644 --- a/docs/software/ml/pytorch.md +++ b/docs/software/ml/pytorch.md @@ -12,7 +12,7 @@ Running PyTorch from a container ensures maximum portability, reproducibility, a 1. selecting an appropriate base image and customizing it in a Dockerfile 2. defining the container runtime environment in an EDF -3. (optionally) extending by a virtual environment +3. (optionally) extending with a virtual environment 4. submitting jobs with CE in SLURM These steps are illustrated in the [machine learning platform tutorials][ref-software-ml-tutorials] and the instructions detailed in the [podman build guide][ref-build-containers]. From 6c9b1a57b4cc90b515f58051794dbecfe0de2993 Mon Sep 17 00:00:00 2001 From: Lukas Drescher <38319063+lukasgd@users.noreply.github.com> Date: Wed, 20 Aug 2025 18:17:20 +0200 Subject: [PATCH 07/22] Update docs/software/ml/pytorch.md Co-authored-by: boeschf <48126478+boeschf@users.noreply.github.com> --- docs/software/ml/pytorch.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/software/ml/pytorch.md b/docs/software/ml/pytorch.md index 6daa8953..e3040ba3 100644 --- a/docs/software/ml/pytorch.md +++ b/docs/software/ml/pytorch.md @@ -27,7 +27,7 @@ These steps are illustrated in the [machine learning platform tutorials][ref-sof 1. This makes sure that files stored subsequently end up on the same storage node (up to 4 MB), on 4 storage nodes (between 4 and 64 MB) or are striped across all storage nodes (above 64 MB) -### Selecting the base image +### Select the base image For most applications, the [PyTorch NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) is a good base image as PyTorch comes pre-installed with an optimized build including many dependencies. The [Release Notes](https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/index.html) give an overview on installed packages and compatibility. This image can be further customized in a Dockerfile and built with podman as detailed in the [podman build guide][ref-build-containers]. From 7c580b8f8edd83435f9406d2b350549cfd8fa796 Mon Sep 17 00:00:00 2001 From: Lukas Drescher <38319063+lukasgd@users.noreply.github.com> Date: Wed, 20 Aug 2025 18:17:49 +0200 Subject: [PATCH 08/22] Update docs/software/ml/pytorch.md Co-authored-by: boeschf <48126478+boeschf@users.noreply.github.com> --- docs/software/ml/pytorch.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/software/ml/pytorch.md b/docs/software/ml/pytorch.md index e3040ba3..872d2b2f 100644 --- a/docs/software/ml/pytorch.md +++ b/docs/software/ml/pytorch.md @@ -29,7 +29,7 @@ These steps are illustrated in the [machine learning platform tutorials][ref-sof ### Select the base image -For most applications, the [PyTorch NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) is a good base image as PyTorch comes pre-installed with an optimized build including many dependencies. The [Release Notes](https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/index.html) give an overview on installed packages and compatibility. This image can be further customized in a Dockerfile and built with podman as detailed in the [podman build guide][ref-build-containers]. +For most applications, the [PyTorch NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) is a good base image as PyTorch comes pre-installed with an optimized build including many dependencies. The [Release Notes](https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/index.html) give an overview of installed packages and compatibility. This image can be further customized in a Dockerfile and built with podman as detailed in the [podman build guide][ref-build-containers]. ### Define Container Runtime Environment From 42082ce5a954028557d8ab5f580881518e1aba67 Mon Sep 17 00:00:00 2001 From: Lukas Drescher <38319063+lukasgd@users.noreply.github.com> Date: Wed, 20 Aug 2025 18:25:54 +0200 Subject: [PATCH 09/22] Update docs/software/ml/pytorch.md Co-authored-by: boeschf <48126478+boeschf@users.noreply.github.com> --- docs/software/ml/pytorch.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/software/ml/pytorch.md b/docs/software/ml/pytorch.md index 872d2b2f..1fcf1b81 100644 --- a/docs/software/ml/pytorch.md +++ b/docs/software/ml/pytorch.md @@ -35,7 +35,7 @@ For most applications, the [PyTorch NGC container](https://catalog.ngc.nvidia.co Having built and imported a container image with podman and enroot, the next step is to configure the runtime environment with an environment definition file (EDF). In particular, this includes specifying the image, any directories mounted and a working directory to for the processes in the container to start in as in the [quickstart examples for CE][ref-container-engine]. -Besides this, there are specific features relevant for machine learning available through [annotations][ref-ce-annotations], which customize the container at runtime. +Apart from this, there are specific features relevant for machine learning made available through [annotations][ref-ce-annotations], which customize the container at runtime. * When using NCCL inside the container, you want to include the [aws-ofi-nccl][ref-ce-aws-ofi-hook] plugin which enables the container to interface with the host's libfabric and, thus, make use of Alps Slingshot high-speed interconnect. This is crucial for multi-node communication performance. * An [SSH annotation][ref-ce-ssh-hook] allows adding a light-weight SSH server to the container without the need to modify the container image From c4e9cb6ae1b6d89ec3fa0752bfa51c0f9cf4f5cc Mon Sep 17 00:00:00 2001 From: Lukas Drescher <38319063+lukasgd@users.noreply.github.com> Date: Wed, 20 Aug 2025 18:36:16 +0200 Subject: [PATCH 10/22] Update docs/software/ml/pytorch.md Co-authored-by: boeschf <48126478+boeschf@users.noreply.github.com> --- docs/software/ml/pytorch.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/software/ml/pytorch.md b/docs/software/ml/pytorch.md index 1fcf1b81..cc447c0a 100644 --- a/docs/software/ml/pytorch.md +++ b/docs/software/ml/pytorch.md @@ -184,7 +184,7 @@ For further details on execution logic, job monitoring and data management, plea #SBATCH -C thp_never&nvidia_vboost_enabled ``` - * The argument `--ddp-bucket-size` controls the level of grouping of many small data-parallel communications into bigger ones and setting it to a high value such as can improve throughput (model-dependent, e.g. `10000000000`). + * The argument `--ddp-bucket-size` controls the level of grouping of many small data-parallel communications into bigger ones and setting it to a high value can improve throughput (model-dependent, e.g. `10000000000`). * If in doubt about communication performance with NCCL at scale, use [nccl-tests](https://github.com/NVIDIA/nccl-tests) with the relevant communication patterns to check if scaling behavior can be reproduced. From 61b0ffedce1e092922d63ce9ddb04bbc5ae0e6a6 Mon Sep 17 00:00:00 2001 From: Lukas Drescher <38319063+lukasgd@users.noreply.github.com> Date: Wed, 20 Aug 2025 18:57:08 +0200 Subject: [PATCH 11/22] Update docs/software/ml/pytorch.md Co-authored-by: boeschf <48126478+boeschf@users.noreply.github.com> --- docs/software/ml/pytorch.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/software/ml/pytorch.md b/docs/software/ml/pytorch.md index cc447c0a..ae8d6fc6 100644 --- a/docs/software/ml/pytorch.md +++ b/docs/software/ml/pytorch.md @@ -37,7 +37,7 @@ Having built and imported a container image with podman and enroot, the next ste Apart from this, there are specific features relevant for machine learning made available through [annotations][ref-ce-annotations], which customize the container at runtime. -* When using NCCL inside the container, you want to include the [aws-ofi-nccl][ref-ce-aws-ofi-hook] plugin which enables the container to interface with the host's libfabric and, thus, make use of Alps Slingshot high-speed interconnect. This is crucial for multi-node communication performance. +* When using NCCL inside the container, include the [aws-ofi-nccl][ref-ce-aws-ofi-hook] plugin which enables the container to interface with the host's libfabric and, thus, use the Slingshot high-speed interconnect. This is crucial for multi-node communication performance. * An [SSH annotation][ref-ce-ssh-hook] allows adding a light-weight SSH server to the container without the need to modify the container image A resulting example TOML file following best practices may look like From 6d22e8d1248f616e6401f510fdf69e84166f0a27 Mon Sep 17 00:00:00 2001 From: Lukas Drescher <38319063+lukasgd@users.noreply.github.com> Date: Wed, 20 Aug 2025 18:58:48 +0200 Subject: [PATCH 12/22] Update docs/software/ml/pytorch.md Co-authored-by: boeschf <48126478+boeschf@users.noreply.github.com> --- docs/software/ml/pytorch.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/software/ml/pytorch.md b/docs/software/ml/pytorch.md index ae8d6fc6..69fb0f4f 100644 --- a/docs/software/ml/pytorch.md +++ b/docs/software/ml/pytorch.md @@ -67,7 +67,7 @@ MPICH_GPU_SUPPORT_ENABLED = "0" # (8)! 1. It is important to use curly braces for environment variables used in the EDF 2. The path `/users` is not mounted as a whole since it often contains user-specific initialization scripts for the host environment and many frameworks leave temporary data behind that can lead to non-trivial runtime errors when swapping container images. Thus, it is recommended to selectively mount specific subfolders under `${HOME}` if needed. 3. You can use `${PWD}` as an alternative to use the path submitted from when the container is started -4. This enables NCCL installed in the container to make effective use of the Slingshot interconnect on Alps by interfacing with the [AWS OFI NCCL plugin][ref-ce-aws-ofi-hook] with libfabric. While not strictly needed for single node workloads, it is good practice to keep it always on. +4. This enables NCCL installed in the container to make effective use of the Slingshot interconnect on Alps by interfacing with the [AWS OFI NCCL plugin][ref-ce-aws-ofi-hook]. While not strictly needed for single node workloads, it is good practice to keep it always on. 5. This makes NCCL output debug info during initialization, which can be useful to spot communication-related issues in a distributed scenario (see later tutorials). Subsystems with debug log can be configured with `NCCL_DEBUG_SUBSYS`. 6. Disable CUDA JIT cache 7. Async error handling when an exception is observed in NCCL watchdog: aborting NCCL communicator and tearing down process upon error From d438dcd759ae94a677210bafcc1df1ef7630a44d Mon Sep 17 00:00:00 2001 From: Lukas Drescher Date: Wed, 20 Aug 2025 18:55:56 +0200 Subject: [PATCH 13/22] Integrating Fabian's feedback, updating code owners --- .github/CODEOWNERS | 2 +- docs/access/jupyterlab.md | 2 +- docs/software/ml/pytorch.md | 93 +++++++++++++------ docs/software/ml/tutorials/llm-inference.md | 4 +- .../ml/tutorials/llm-nanotron-training.md | 9 +- 5 files changed, 77 insertions(+), 33 deletions(-) diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS index 8d1af372..33fea112 100644 --- a/.github/CODEOWNERS +++ b/.github/CODEOWNERS @@ -9,6 +9,6 @@ docs/software/prgenv/linalg.md @finkandreas @msimberg docs/software/sciapps/cp2k.md @abussy @RMeli docs/software/sciapps/lammps.md @nickjbrowning docs/software/sciapps/gromacs.md @kanduri -docs/software/ml @boeschf +docs/software/ml @boeschf @henrique @lukasgd docs/storage @mpasserini docs/alps/storage.md @mpasserini diff --git a/docs/access/jupyterlab.md b/docs/access/jupyterlab.md index 89fe1e79..28780702 100644 --- a/docs/access/jupyterlab.md +++ b/docs/access/jupyterlab.md @@ -86,7 +86,7 @@ If the default base images do not meet your requirements, you can specify a cust 3. Currently only required on Daint and Santis, not on Clariden 4. Set working directory of Jupyter session (file browser root directory) 5. Use environment settings for optimized communication - 6. Disable CUDA JIT cache + 6. Avoid writing JITed binaries to the (distributed) file system, which could lead to performance issues. 7. Async error handling when an exception is observed in NCCL watchdog: aborting NCCL communicator and tearing down process upon error 8. Disable GPU support in MPICH, as it can lead to deadlocks when using together with NCCL diff --git a/docs/software/ml/pytorch.md b/docs/software/ml/pytorch.md index 69fb0f4f..9ebae511 100644 --- a/docs/software/ml/pytorch.md +++ b/docs/software/ml/pytorch.md @@ -33,7 +33,7 @@ For most applications, the [PyTorch NGC container](https://catalog.ngc.nvidia.co ### Define Container Runtime Environment -Having built and imported a container image with podman and enroot, the next step is to configure the runtime environment with an environment definition file (EDF). In particular, this includes specifying the image, any directories mounted and a working directory to for the processes in the container to start in as in the [quickstart examples for CE][ref-container-engine]. +Having built and imported a container image with podman and enroot, the next step is to configure the runtime environment with an environment definition file (EDF). In particular, this includes specifying the image, any directories mounted from the host and a working directory for the process in the container to start in as in the [quickstart examples for CE][ref-container-engine]. Apart from this, there are specific features relevant for machine learning made available through [annotations][ref-ce-annotations], which customize the container at runtime. @@ -68,8 +68,8 @@ MPICH_GPU_SUPPORT_ENABLED = "0" # (8)! 2. The path `/users` is not mounted as a whole since it often contains user-specific initialization scripts for the host environment and many frameworks leave temporary data behind that can lead to non-trivial runtime errors when swapping container images. Thus, it is recommended to selectively mount specific subfolders under `${HOME}` if needed. 3. You can use `${PWD}` as an alternative to use the path submitted from when the container is started 4. This enables NCCL installed in the container to make effective use of the Slingshot interconnect on Alps by interfacing with the [AWS OFI NCCL plugin][ref-ce-aws-ofi-hook]. While not strictly needed for single node workloads, it is good practice to keep it always on. -5. This makes NCCL output debug info during initialization, which can be useful to spot communication-related issues in a distributed scenario (see later tutorials). Subsystems with debug log can be configured with `NCCL_DEBUG_SUBSYS`. -6. Disable CUDA JIT cache +5. This makes NCCL output debug info during initialization, which can be useful to spot communication-related issues in a distributed scenario. Subsystems with debug log can be configured with [`NCCL_DEBUG_SUBSYS`](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-debug-subsys). +6. Avoid writing JITed binaries to the (distributed) file system, which could lead to performance issues. 7. Async error handling when an exception is observed in NCCL watchdog: aborting NCCL communicator and tearing down process upon error 8. Disable GPU support in MPICH, as it can lead to deadlocks when using together with NCCL @@ -93,14 +93,15 @@ MPICH_GPU_SUPPORT_ENABLED = "0" # (8)! 1. Enable Slurm commands (together with two subsequent mounts) -!!! note "Best practice for large-scale jobs" +!!! note "Best practice for production jobs" - For stability and reproducibility, use self-contained containers for large scale jobs. Using code mounted from the distributed filesystem may leave compiled artefacts behind that can result in unintentional runtime errors when e.g. swapping the container image. In particular, it is recommended to avoid mounting all of `$HOME`, so that environments are properly isolated and e.g. the Triton cache (that by default ends up in `$HOME/.triton`) resides in an ephemeral location of the filesystem. + For stability and reproducibility, use self-contained containers for production jobs. Using code mounted from the distributed filesystem may leave compiled artefacts behind that can result in unintentional runtime errors when e.g. swapping the container image. In particular, it is recommended to avoid mounting all of `$HOME`, so that environments are properly isolated and e.g. the Triton cache (that by default ends up in `$HOME/.triton`) resides in an ephemeral location of the filesystem. !!! note "Collaborating in Git" For reproducibility, it is recommended to always track the Dockerfile, EDF and an optional virtual environment specification alongside your application code in a Git repository. +[](){#ref-ce-pytorch-venv} ### (Optionally) extend container with virtual environment While production jobs should include as many dependencies as possible in the container image, during development it can be convenient to manage frequently changing packages in a virtual environment built on top of the container image. This can include both dependencies and actively developed packages (that can be installed in editable mode with `pip install -e .`). @@ -186,28 +187,57 @@ For further details on execution logic, job monitoring and data management, plea * The argument `--ddp-bucket-size` controls the level of grouping of many small data-parallel communications into bigger ones and setting it to a high value can improve throughput (model-dependent, e.g. `10000000000`). - * If in doubt about communication performance with NCCL at scale, use [nccl-tests](https://github.com/NVIDIA/nccl-tests) with the relevant communication patterns to check if scaling behavior can be reproduced. + * If in doubt about communication performance with NCCL at scale, use the [`NCCL_DEBUG`](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-debug) environment variable to validate that the aws-ofi-nccl plugin has been properly initialized and libfabric was recognized (further subsystems can be monitored with [`NCCL_DEBUG_SUBSYS`](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-debug-subsys)). If the issue persists, use [nccl-tests](https://github.com/NVIDIA/nccl-tests) with the relevant communication patterns to check if the scaling behavior can be reproduced and contact CSCS support. Additionally, consider the **best practice for checkpointing and data management**: - * Following the advice on [filesystems][ref-storage-fs], write checkpoints (sequential write) to `/capstor/scratch` and place randomly accessed training data (many small random reads) on `/iopsstor/scratch`. Use the [data transfer instructions][ref-data-xfer] to move data to/from `/capstor/store`. Make sure to apply recommended [Lustre settings][ref-guides-storage-lustre] on all directories containing significant amount of data, including those containing container images and managed by other tools (e.g. the HuggingFace cache, see [`HF_HOME`](https://huggingface.co/docs/huggingface_hub/en/package_reference/environment_variables#hfhome) in the [this tutorial][software-ml-llm-inference-tutorial]). + * Following the advice on [filesystems][ref-storage-fs], write checkpoints (sequential write) to `/capstor/scratch` and place randomly accessed training data (many small random reads) on `/iopsstor/scratch`. Use the [data transfer instructions][ref-data-xfer] to move data to/from `/capstor/store`. Make sure to apply recommended [Lustre settings][ref-guides-storage-lustre] on all directories containing significant amount of data, including those containing container images and managed by other tools (e.g. the HuggingFace cache, see [`HF_HOME`](https://huggingface.co/docs/huggingface_hub/en/package_reference/environment_variables#hfhome) in the [this tutorial][software-ml-llm-inference-tutorial]). In case your workload continues to be limited by filesystem performance, contact CSCS support. + + * Regularly adjust checkpoint writing intervals to the current overhead induced by writing a checkpoint ($T_1$) and mean time between job failures ($T_2$). As a first order approximation use a checkpointing interval of $\sqrt{2 T_1 T_2}$ (derived by [Young](https://doi.org/10.1145/361147.361115) and [Daly](https://doi.org/10.1016/j.future.2004.11.016)). - * Regularly adjust checkpoint writing intervals to the overhead induced by writing a checkpoint ($T_1$) and the mean time between job failures ($T_2$). As a first order approximation use a checkpointing interval of $\sqrt{2 T_1 T_2}$ (derived by [Young](https://doi.org/10.1145/361147.361115) and [Daly](https://doi.org/10.1016/j.future.2004.11.016)). + * Avoid activities that put excessive load on third party services (such as web scraping or bulk downloads) in line with the [guidelines on Internet Access on Alps][ref-guides-internet-access-ext]. Adjust for **cluster availability**: * Submit your jobs with a Slurm time limit compatible with reservations (such as maintenance windows, cf. `scontrol show res`) to be able to get scheduled. +??? info "Debugging segmentation faults" + Application crashes with segmentation faults can be investigated by inspecting core dump files that contain an image of the process memory at the time of the crash. For this purpose, you can load the core dump file with `cuda-gdb` installed in the container and look at the stack trace with `bt`. Note that in order to generate core dump files the line `ulimit -c 0` must be commented out in the above sbatch script. + +### Known Issues + +??? info "Errors hidden by failures in UCX signal handler" + Application errors may trigger the UCX signal handler in the NGC container, which has caused secondary failures in the past, shadowing the initial error trace. These secondary failures may be significantly harder to fix than the initial problem. + + An example is the following trace from the NGC PyTorch 25.01 with Megatron-LM: + ```console + 640: [nid007306:244443:0:244443] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x455) + 640: ==== backtrace (tid: 244443) ==== + 640: 0 /opt/hpcx/ucx/lib/libucs.so.0(ucs_handle_error+0x2cc) [0x4000d2b214dc] + 640: 1 /opt/hpcx/ucx/lib/libucs.so.0(+0x3168c) [0x4000d2b2168c] + 640: 2 /opt/hpcx/ucx/lib/libucs.so.0(+0x319b8) [0x4000d2b219b8] + 640: 3 linux-vdso.so.1(__kernel_rt_sigreturn+0) [0x4000347707dc] + 640: 4 /usr/local/cuda/lib64/libnvrtc.so.12.8.61(+0x935000) [0x400140a25000] + 640: 5 [0x3d5c5e58] + 640: ================================= + srun: error: nid007306: task 640: Segmentation fault + srun: Terminating StepId=348680.1 + ``` + In this case, the segmentation fault in the UCX signal handler (`ucs_handle_error`) was due to a broken NVRTC in the container. However, to obtain the trace of the initial error (which was unrelated), it was necessary to disable the UCX signal handler by setting the following environment variable in the sbatch script: + ```bash + export UCX_HANDLE_ERRORS=none + ``` + [](){#ref-uenv-pytorch} ## Running PyTorch with a uenv -The PyTorch software stack was designed with the intention of being able to run [Megatron-LM](https://github.com/NVIDIA/Megatron-LM)-based pre-training workloads out of the box. +The PyTorch uenv software stack was designed with the intention of being able to run [Megatron-LM](https://github.com/NVIDIA/Megatron-LM)-based pre-training workloads out of the box. Thus, it comes with batteries included and does not just provide the bare [PyTorch framework](https://github.com/pytorch/pytorch). !!! note "uenv" - [PyTorch][ref-uenv-pytorch] is provided via [uenv][ref-uenv]. + The [PyTorch uenv][ref-uenv-pytorch] is provided via the tool [uenv][ref-uenv]. Please have a look at the [uenv documentation][ref-uenv] for more information about uenvs and how to use them. ### Versioning @@ -520,6 +550,12 @@ $ exit # (6)! ``` It is recommended to apply this workaround if you are constrained by a Python package version installed in the uenv that you need to change for your application. +!!! note + Keep in mind that + + * this virtual environment is _specific_ to this particular uenv and won't actually work unless you are using it from inside this uenv - it relies on the resources packaged inside the uenv. + * every Slurm job making use of this virtual environment will need to activate it first (_inside_ the `srun` command). + Alternatively one can use the uenv as [upstream Spack instance][ref-building-uenv-spack] to to add both Python and non-Python packages. However, this workflow is more involved and intended for advanced Spack users. @@ -537,36 +573,40 @@ However, this workflow is more involved and intended for advanced Spack users. #SBATCH --uenv=pytorch/v2.6.0:/user-environment #SBATCH --view=default +set -x + +ulimit -c 0 # (2)! + ################################# # OpenMP environment variables # ################################# -export OMP_NUM_THREADS=8 # (2)! +export OMP_NUM_THREADS=8 # (3)! ################################# # PyTorch environment variables # ################################# -export TORCH_NCCL_ASYNC_ERROR_HANDLING=1 # (3)! -export TRITON_HOME=/dev/shm/ # (4)! +export TORCH_NCCL_ASYNC_ERROR_HANDLING=1 # (4)! +export TRITON_HOME=/dev/shm/ # (5)! ################################# # MPICH environment variables # ################################# -export MPICH_GPU_SUPPORT_ENABLED=0 # (5)! +export MPICH_GPU_SUPPORT_ENABLED=0 # (6)! ################################# # CUDA environment variables # ################################# -export CUDA_CACHE_DISABLE=1 # (6)! +export CUDA_CACHE_DISABLE=1 # (7)! ############################################ # NCCL and Fabric environment variables # ############################################ -# (7)! +# (8)! --8<-- "docs/software/communication/nccl_env_vars" -# (8)! # (9)! -srun bash -c " +# (10)! +srun -ul bash -c " . ./venv-uenv-pt2.6-v1/bin/activate --8<-- "docs/software/ml/torch_distributed_env_vars" @@ -576,19 +616,20 @@ srun bash -c " 1. The `--uenv` option is used to specify the uenv to use for the job. The `--view=default` option is used to load all the packages provided by the uenv. -2. Set `OMP_NUM_THREADS` if you are using OpenMP in your code. +2. In case the application crashes, it may leave behind large core dump files that contain an image of the process memory at the time of the crash. While these can be useful for debugging the reason of a specific crash (by e.g. loading them with `cuda-gdb` and looking at the stack trace with `bt`), they may accumulate over time and occupy a large space on the filesystem. For this reason, it is recommended to disable their creation (unless needed) by adding this line. +3. Set `OMP_NUM_THREADS` if you are using OpenMP in your code. The number of threads should be not greater than the number of cores per task (`$SLURM_CPUS_PER_TASK`). The optimal number depends on the workload and should be determined by testing. Consider for example that typical workloads using PyTorch may fork the processes, so the number of threads should be around the number of cores per task divided by the number of processes. -3. Enable more graceful exception handling, see [PyTorch documentation](https://pytorch.org/docs/stable/torch_nccl_environment_variables.html) -4. Set the Triton home to a local path (e.g. `/dev/shm`) to avoid writing to the (distributed) file system. +4. Enable more graceful exception handling, see [PyTorch documentation](https://pytorch.org/docs/stable/torch_nccl_environment_variables.html) +5. Set the Triton home to a local path (e.g. `/dev/shm`) to avoid writing to the (distributed) file system. This is important for performance, as writing to the Lustre file system can be slow due to the amount of small files and potentially many processes accessing it. Avoid this setting with the container engine as it may lead to errors related to mount settings of `/dev/shm` (use a filesystem path inside the container instead). -5. Disable GPU support in MPICH, as it [can lead to deadlocks](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/mpi.html#inter-gpu-communication-with-cuda-aware-mpi) when using together with nccl. -6. Avoid writing JITed binaries to the (distributed) file system, which could lead to performance issues. -7. These variables should always be set for correctness and optimal performance when using NCCL with uenv, see [the detailed explanation][ref-communication-nccl]. -8. Activate the virtual environment created on top of the uenv (if any). +6. Disable GPU support in MPICH, as it [can lead to deadlocks](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/mpi.html#inter-gpu-communication-with-cuda-aware-mpi) when using together with nccl. +7. Avoid writing JITed binaries to the (distributed) file system, which could lead to performance issues. +8. These variables should always be set for correctness and optimal performance when using NCCL with uenv, see [the detailed explanation][ref-communication-nccl]. +9. Activate the virtual environment created on top of the uenv (if any). Please follow the guidelines for [python virtual environments with uenv][ref-guides-storage-venv] to enhance scalability and reduce load times. -9. The environment variables are used by PyTorch to initialize the distributed backend. +10. The environment variables are used by PyTorch to initialize the distributed backend. The `MASTER_ADDR`, `MASTER_PORT` variables are used to determine the address and port of the master node. Additionally we also need `RANK` and `LOCAL_RANK` and `WORLD_SIZE` to identify the position of each rank within the Slurm step and node, respectively. diff --git a/docs/software/ml/tutorials/llm-inference.md b/docs/software/ml/tutorials/llm-inference.md index 2495687b..af7a8fc8 100644 --- a/docs/software/ml/tutorials/llm-inference.md +++ b/docs/software/ml/tutorials/llm-inference.md @@ -165,8 +165,8 @@ MPICH_GPU_SUPPORT_ENABLED = "0" # (8)! 2. The path `/users` is not mounted since it often contains user-specific initialization scripts for the host environment and many frameworks leave temporary data behind that can lead to non-trivial runtime errors when swapping container images. Thus, it is recommended to selectively mount specific subfolders under `${HOME}` if needed. 3. You can use `${PWD}` as an alternative to use the path submitted from when the container is started 4. This enables NCCL installed in the container to make effective use of the Slingshot interconnect on Alps by interfacing with the [AWS OFI NCCL plugin][ref-ce-aws-ofi-hook] with libfabric. While not strictly needed for single node workloads, it is good practice to keep it always on. -5. This makes NCCL output debug info during initialization, which can be useful to spot communication-related issues in a distributed scenario (see later tutorials). Subsystems with debug log can be configured with `NCCL_DEBUG_SUBSYS`. -6. Disable CUDA JIT cache +5. This makes NCCL output debug info during initialization, which can be useful to spot communication-related issues in a distributed scenario (see later tutorials). Subsystems with debug log can be configured with [`NCCL_DEBUG_SUBSYS`](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-debug-subsys). +6. Avoid writing JITed binaries to the (distributed) file system, which could lead to performance issues. 7. Async error handling when an exception is observed in NCCL watchdog: aborting NCCL communicator and tearing down process upon error 8. Disable GPU support in MPICH, as it can lead to deadlocks when using together with NCCL diff --git a/docs/software/ml/tutorials/llm-nanotron-training.md b/docs/software/ml/tutorials/llm-nanotron-training.md index f8a9235e..267ecf42 100644 --- a/docs/software/ml/tutorials/llm-nanotron-training.md +++ b/docs/software/ml/tutorials/llm-nanotron-training.md @@ -6,7 +6,7 @@ In this tutorial, we will build a container image to run multi-node training job We will train a 109M parameter model with ~100M wikitext tokens as a proof of concept. !!! info - Note that while the concepts taught here for multi-node training with PyTorch are generally portable across frameworks, the current (August 2025) recommendation for users with a need for large-scale model-parallel training is to use `Megatron-LM` instead of `nanotron` due to significant performance advantages at scale. + While the concepts taught here for multi-node training with PyTorch are generally portable across training frameworks, the current (August 2025) recommendation for users with a need for large-scale model-parallel training is to use `Megatron-LM` instead of `nanotron` due to significant performance advantages at scale. ### Prerequisites @@ -26,6 +26,9 @@ If not already done as part of the [LLM Inference tutorial][software-ml-llm-infe mount_program = "/usr/bin/fuse-overlayfs-1.13" ``` +!!! warning + If `$XDG_CONFIG_HOME` is set, place this file at `$XDG_CONFIG_HOME/containers/storage.conf` instead. + Create a directory to store container images used with CE and configure it with [recommended LUSTRE settings][ref-guides-storage-lustre]: ```console title="Container image directory with recommended LUSTRE settings" @@ -147,8 +150,8 @@ MPICH_GPU_SUPPORT_ENABLED = "0" # (8)! 2. The path `/users` is not mounted since it often contains user-specific initialization scripts for the host environment and many frameworks leave temporary data behind that can lead to non-trivial runtime errors when swapping container images. Thus, it is recommended to selectively mount specific subfolders under `${HOME}` if needed. 3. You can use `${PWD}` as an alternative to use the path submitted from when the container is started 4. This enables NCCL installed in the container to make effective use of the Slingshot interconnect on Alps by interfacing with the [AWS OFI NCCL plugin][ref-ce-aws-ofi-hook] with libfabric. While not strictly needed for single node workloads, it is good practice to keep it always on. -5. This makes NCCL output debug info during initialization, which can be useful to spot communication-related issues in a distributed scenario (see later tutorials). Subsystems with debug log can be configured with `NCCL_DEBUG_SUBSYS`. -6. Disable CUDA JIT cache +5. This makes NCCL output debug info during initialization, which can be useful to spot communication-related issues in a distributed scenario (see later tutorials). Subsystems with debug log can be configured with [`NCCL_DEBUG_SUBSYS`](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-debug-subsys). +6. Avoid writing JITed binaries to the (distributed) file system, which could lead to performance issues. 7. Async error handling when an exception is observed in NCCL watchdog: aborting NCCL communicator and tearing down process upon error 8. Disable GPU support in MPICH, as it can lead to deadlocks when using together with NCCL From 2c7904c3596b3afaf37273c138bb9ed5b8dbac1b Mon Sep 17 00:00:00 2001 From: Lukas Drescher Date: Thu, 21 Aug 2025 11:37:57 +0200 Subject: [PATCH 14/22] Updating known issues --- docs/software/ml/pytorch.md | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/docs/software/ml/pytorch.md b/docs/software/ml/pytorch.md index 9ebae511..8894e8d6 100644 --- a/docs/software/ml/pytorch.md +++ b/docs/software/ml/pytorch.md @@ -176,11 +176,11 @@ For further details on execution logic, job monitoring and data management, plea Use of `--environment` is currently only recommended for the `srun` command. !!! note "Optimizing large-scale training jobs" - The following settings were established to **improve compute throughput** of LLM training in `Megatron-LM`: + The following settings were established to **improve compute throughput** of LLM training in [Megatron-LM](https://github.com/NVIDIA/Megatron-LM): - * Extensively evaluate all possible parallelization dimensions, including data-, tensor- and pipeline parallelism (including virtual pipeline parallelism) and more, when available. In `Megatron-LM`, avoid using the option '--defer-embedding-wgrad-compute` to defer the embedding gradient computation. Identify storage-related bottlenecks by isolating data loading/generation operations into a separate benchmark. + * Extensively evaluate all possible parallelization dimensions, including data-, tensor- and pipeline parallelism (including virtual pipeline parallelism) and more, when available. Identify storage-related bottlenecks by isolating data loading/generation operations into a separate benchmark. - * Disabling transparent huge pages and enabling the Nvidia [vboost](https://docs.nvidia.com/nemo-framework/user-guide/latest/performance/performance-guide.html#gpu-core-clock-optimization) feature has been observed to improve performance in large-scale LLM training in `Megatron-LM`. This can be achieved by adding these constraints to the sbatch script: + * Disabling transparent huge pages and enabling the Nvidia [vboost](https://docs.nvidia.com/nemo-framework/user-guide/latest/performance/performance-guide.html#gpu-core-clock-optimization) feature has been observed to improve performance in large-scale LLM training in Megatron-LM. This can be achieved by adding these constraints to the sbatch script: ```bash #SBATCH -C thp_never&nvidia_vboost_enabled ``` @@ -206,6 +206,8 @@ For further details on execution logic, job monitoring and data management, plea ### Known Issues +The [Release Notes](https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/index.html) of every NGC PyTorch container contain a selected list of known issues. + ??? info "Errors hidden by failures in UCX signal handler" Application errors may trigger the UCX signal handler in the NGC container, which has caused secondary failures in the past, shadowing the initial error trace. These secondary failures may be significantly harder to fix than the initial problem. @@ -228,6 +230,8 @@ For further details on execution logic, job monitoring and data management, plea export UCX_HANDLE_ERRORS=none ``` +??? info "Avoid `--defer-embedding-wgrad-compute` in Megatron-LM" + In Megatron-LM, avoid using the option `--defer-embedding-wgrad-compute` to delay the embedding gradient computation as it can lead to an incorrect gradient norm that changes upon resuming at different scale. [](){#ref-uenv-pytorch} ## Running PyTorch with a uenv From ef0fdf6dfddb22cee1b86a8c17cf0845d349c380 Mon Sep 17 00:00:00 2001 From: Lukas Drescher Date: Thu, 21 Aug 2025 16:10:25 +0200 Subject: [PATCH 15/22] Update check-spelling metadata --- .github/actions/spelling/expect.txt | 4 ++++ 1 file changed, 4 insertions(+) create mode 100644 .github/actions/spelling/expect.txt diff --git a/.github/actions/spelling/expect.txt b/.github/actions/spelling/expect.txt new file mode 100644 index 00000000..e19c78b1 --- /dev/null +++ b/.github/actions/spelling/expect.txt @@ -0,0 +1,4 @@ +JAX +nvitop +NVRTC +placeholders From f099e3cf9361759a558a038c25db9dba78dedfe3 Mon Sep 17 00:00:00 2001 From: Ben Cumming Date: Fri, 22 Aug 2025 12:59:50 +0200 Subject: [PATCH 16/22] Update docs/build-install/containers.md Co-authored-by: Theofilos Manitaras --- docs/build-install/containers.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/build-install/containers.md b/docs/build-install/containers.md index a2d2128e..a99f8251 100644 --- a/docs/build-install/containers.md +++ b/docs/build-install/containers.md @@ -7,7 +7,7 @@ Its command-line interface (CLI) closely mirrors Docker’s, providing a consist [](){#ref-build-containers-configure-podman} ## Preliminary step: configuring Podman's storage -The first step in order to use Podman on Alps is to create a valid Container Storage configuration file in your home according to the following minimal template: +The first step in order to use Podman on Alps is to create a valid Container Storage configuration file in your home directory, according to the following minimal template: ```toml title="$HOME/.config/containers/storage.conf" [storage] From 9862ec3bd76d4e00bc489b15c7e664fc36004a02 Mon Sep 17 00:00:00 2001 From: Ben Cumming Date: Fri, 22 Aug 2025 13:00:01 +0200 Subject: [PATCH 17/22] Update docs/software/ml/index.md Co-authored-by: Theofilos Manitaras --- docs/software/ml/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/software/ml/index.md b/docs/software/ml/index.md index ac136472..d4e5a72d 100644 --- a/docs/software/ml/index.md +++ b/docs/software/ml/index.md @@ -2,7 +2,7 @@ # Machine learning applications and frameworks CSCS supports a wide range of machine learning (ML) applications and frameworks on its systems. -Most ML workloads are containerized to ensure portability, reproducibility, and ease of use across machines. +Most ML workloads are containerized to ensure portability, reproducibility, and ease of use across systems. Users can choose between running containers, using provided uenv software stacks, or building custom Python environments tailored to their needs. From 48ccb037d4a55693adfdf068a1293e45b4fa2e3c Mon Sep 17 00:00:00 2001 From: Ben Cumming Date: Fri, 22 Aug 2025 13:00:27 +0200 Subject: [PATCH 18/22] Update docs/software/ml/tutorials/index.md Co-authored-by: Theofilos Manitaras --- docs/software/ml/tutorials/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/software/ml/tutorials/index.md b/docs/software/ml/tutorials/index.md index a3ee90c7..88bda260 100644 --- a/docs/software/ml/tutorials/index.md +++ b/docs/software/ml/tutorials/index.md @@ -3,7 +3,7 @@ The LLM tutorials gradually introduce key concepts of the Machine Learning Platform in a series of hands-on examples. A particular focus is on the [Container Engine][ref-container-engine] for managing the runtime environment. -In a [first tutorial][software-ml-llm-inference-tutorial], you will learn how to run inference with a LLM on a single node using a container from the NVIDIA GPU Cloud (NGC). Concepts such as container environment description, layering a thin virtual environment on top of the container image, and job launching and monitoring will be introduced. +In the [first tutorial][software-ml-llm-inference-tutorial], you will learn how to run inference with a LLM on a single node using a container from the NVIDIA GPU Cloud (NGC). Concepts such as container environment description, layering a thin virtual environment on top of the container image, and job launching/monitoring will be introduced. Building on the first tutorial, in the [second tutorial][software-ml-llm-fine-tuning-tutorial] you will learn how to train (fine-tune) a LLM on multiple GPUs on a single node. For this purpose, you will use HuggingFace's `accelerate` and see best practices for dataset management. From c1c72e72b2573d4754aa860d17dcc9c7d9413043 Mon Sep 17 00:00:00 2001 From: Ben Cumming Date: Fri, 22 Aug 2025 13:00:41 +0200 Subject: [PATCH 19/22] Update docs/software/ml/pytorch.md Co-authored-by: Theofilos Manitaras --- docs/software/ml/pytorch.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/software/ml/pytorch.md b/docs/software/ml/pytorch.md index 8894e8d6..94ab6bcc 100644 --- a/docs/software/ml/pytorch.md +++ b/docs/software/ml/pytorch.md @@ -126,7 +126,7 @@ user@nidYYYYYY$ source venv-ngc-pt-25.06/bin/activate # (3)! The changes made to the virtual environment will outlive the container as they are persisted on the distributed filesystem. !!! note - Keep in mind that + Keep in mind that: * this virtual environment is _specific_ to this particular container and won't actually work unless you are using it from inside this container - it relies on the resources packaged inside the container. * every Slurm job making use of this virtual environment will need to activate it first (_inside_ the `srun` command). From bedceb6b7c9434e6eb88710d2c9a4e3ec1c2c1b7 Mon Sep 17 00:00:00 2001 From: bcumming Date: Fri, 22 Aug 2025 13:12:38 +0200 Subject: [PATCH 20/22] move ml tutorials into a tutorials section --- docs/tutorials/index.md | 2 ++ .../{software/ml/tutorials => tutorials/ml}/index.md | 0 .../ml/tutorials => tutorials/ml}/llm-fine-tuning.md | 0 .../ml/tutorials => tutorials/ml}/llm-inference.md | 0 .../ml}/llm-nanotron-training.md | 0 mkdocs.yml | 12 +++++++----- 6 files changed, 9 insertions(+), 5 deletions(-) create mode 100644 docs/tutorials/index.md rename docs/{software/ml/tutorials => tutorials/ml}/index.md (100%) rename docs/{software/ml/tutorials => tutorials/ml}/llm-fine-tuning.md (100%) rename docs/{software/ml/tutorials => tutorials/ml}/llm-inference.md (100%) rename docs/{software/ml/tutorials => tutorials/ml}/llm-nanotron-training.md (100%) diff --git a/docs/tutorials/index.md b/docs/tutorials/index.md new file mode 100644 index 00000000..cde382a7 --- /dev/null +++ b/docs/tutorials/index.md @@ -0,0 +1,2 @@ +[](){#ref-tutorials} +# Tutorials diff --git a/docs/software/ml/tutorials/index.md b/docs/tutorials/ml/index.md similarity index 100% rename from docs/software/ml/tutorials/index.md rename to docs/tutorials/ml/index.md diff --git a/docs/software/ml/tutorials/llm-fine-tuning.md b/docs/tutorials/ml/llm-fine-tuning.md similarity index 100% rename from docs/software/ml/tutorials/llm-fine-tuning.md rename to docs/tutorials/ml/llm-fine-tuning.md diff --git a/docs/software/ml/tutorials/llm-inference.md b/docs/tutorials/ml/llm-inference.md similarity index 100% rename from docs/software/ml/tutorials/llm-inference.md rename to docs/tutorials/ml/llm-inference.md diff --git a/docs/software/ml/tutorials/llm-nanotron-training.md b/docs/tutorials/ml/llm-nanotron-training.md similarity index 100% rename from docs/software/ml/tutorials/llm-nanotron-training.md rename to docs/tutorials/ml/llm-nanotron-training.md diff --git a/mkdocs.yml b/mkdocs.yml index dcfb1a37..05102b95 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -62,11 +62,6 @@ nav: - 'Cray modules (CPE)': software/prgenv/cpe.md - 'Machine Learning': - software/ml/index.md - - 'Tutorials': - - software/ml/tutorials/index.md - - 'LLM Inference': software/ml/tutorials/llm-inference.md - - 'LLM Fine-tuning': software/ml/tutorials/llm-fine-tuning.md - - 'LLM Pre-training': software/ml/tutorials/llm-nanotron-training.md - 'PyTorch': software/ml/pytorch.md - 'Communication Libraries': - software/communication/index.md @@ -131,6 +126,13 @@ nav: - 'Storage': guides/storage.md - 'Using the terminal': guides/terminal.md - 'Gordon Bell 2025': guides/gb25.md + - 'Tutorials': + - tutorials/index.md + - 'Machine Learning': + - tutorials/ml/index.md + - 'LLM Inference': tutorials/ml/llm-inference.md + - 'LLM Fine-tuning': tutorials/ml/llm-fine-tuning.md + - 'LLM Pre-training': tutorials/ml/llm-nanotron-training.md - 'Policies': - policies/index.md - 'User Regulations': policies/regulations.md From c5383db71101ca2d8672684ff0e28e394ce33fdf Mon Sep 17 00:00:00 2001 From: bcumming Date: Fri, 22 Aug 2025 13:43:12 +0200 Subject: [PATCH 21/22] small cleanup --- docs/access/jupyterlab.md | 4 +++- docs/build-install/containers.md | 15 +++++---------- docs/platforms/mlp/index.md | 10 +++++++--- 3 files changed, 15 insertions(+), 14 deletions(-) diff --git a/docs/access/jupyterlab.md b/docs/access/jupyterlab.md index 28780702..43d7048c 100644 --- a/docs/access/jupyterlab.md +++ b/docs/access/jupyterlab.md @@ -199,7 +199,9 @@ Examples of notebooks with `ipcmagic` can be found [here](https://github.com/ While it is generally recommended to submit long-running machine learning training and inference jobs via `sbatch`, certain use cases can benefit from an interactive Jupyter environment. -A popular approach to run multi-GPU ML workloads is with [`accelerate`](https://github.com/huggingface/accelerate) and [`torchrun`](https://docs.pytorch.org/docs/stable/elastic/run.html) as demonstrated in the [tutorials][ref-software-ml-tutorials]. In particular, the `accelerate launch` script in the [LLM fine-tuning tutorial][software-ml-llm-fine-tuning-tutorial] can be directly carried over to a Jupyter cell with a `%%bash` header (to run its contents interpreted by bash). For `torchrun`, one can adapt the command from the multi-node [nanotron tutorial][software-ml-llm-nanotron-tutorial] to run on a single GH200 node using the following line in a Jupyter cell +A popular approach to run multi-GPU ML workloads is with [`accelerate`](https://github.com/huggingface/accelerate) and [`torchrun`](https://docs.pytorch.org/docs/stable/elastic/run.html) as demonstrated in the [tutorials][ref-software-ml-tutorials]. +In particular, the `accelerate launch` script in the [LLM fine-tuning tutorial][software-ml-llm-fine-tuning-tutorial] can be directly carried over to a Jupyter cell with a `%%bash` header (to run its contents interpreted by bash). +For `torchrun`, one can adapt the command from the multi-node [nanotron tutorial][software-ml-llm-nanotron-tutorial] to run on a single GH200 node using the following line in a Jupyter cell ```bash !python -m torch.distributed.run --standalone --nproc_per_node=4 run_train.py ... diff --git a/docs/build-install/containers.md b/docs/build-install/containers.md index a2d2128e..1cb85454 100644 --- a/docs/build-install/containers.md +++ b/docs/build-install/containers.md @@ -17,7 +17,8 @@ graphroot = "/dev/shm/$USER/root" ``` !!! warning - If `$XDG_CONFIG_HOME` is set, place this file at `$XDG_CONFIG_HOME/containers/storage.conf` instead. See also [this guide][ref-guides-terminal-arch] for further information about XDG variables. + If `$XDG_CONFIG_HOME` is set, place this file at `$XDG_CONFIG_HOME/containers/storage.conf` instead. + See the [terminal user guide][ref-guides-terminal-arch] for further information about XDG variables. !!! warning In the above configuration, `/dev/shm` is used to store the container images. @@ -64,15 +65,9 @@ In general, [`podman build`](https://docs.podman.io/en/stable/markdown/podman-bu An image built using Podman can be easily imported as a squashfs archive in order to be used with our Container Engine solution. It is important to keep in mind that the import has to take place in the same job allocation where the image creation took place, otherwise the image is lost due to the temporary nature of `/dev/shm`. -!!! warning "Preliminary configuration: Lustre settings for container images" - Since container images are large files and the filesystem is a shared resource, you need to configure the target directory according to [best practices for Lustre][ref-guides-storage-lustre] before importing the container image so it will be properly distributed across storage nodes. - - ```bash - lfs setstripe -E 4M -c 1 -E 64M -c 4 -E -1 -c -1 -S 4M # (1)! - ``` - - 1. This makes sure that files stored subsequently end up on the same storage node (up to 4 MB), on 4 storage nodes (between 4 and 64 MB) or are striped across all storage nodes (above 64 MB) - +!!! info "Preliminary configuration: Lustre settings for container images" + Container images are stored in a single [SquashFS]() file, that is typically between 1-20 GB in size (particularly for large ML containers). + To ensure good performance for jobs on multiple nodes, take the time to configure the target directory using `lfs setstripe` according to [best practices for Lustre][ref-guides-storage-lustre] before importing the container image, or using `lfs migrate` to fix files that are already imported. To import the image: diff --git a/docs/platforms/mlp/index.md b/docs/platforms/mlp/index.md index 27604a56..5e3b09f0 100644 --- a/docs/platforms/mlp/index.md +++ b/docs/platforms/mlp/index.md @@ -3,6 +3,13 @@ The Machine Learning Platform (MLP) provides compute, storage and expertise to the machine learning and AI community in Switzerland, with the main user being the [Swiss AI Initiative](https://www.swiss-ai.org/). +
+- :fontawesome-solid-mountain: [__Tutorials__][ref-software-ml-tutorials] + + Tutorials on how to set up and configure a machine learning environment in order to run LLM workloads such as inference, fine-tuning and multi-node training can be found in the [tutorials section][ref-software-ml-tutorials]. + +
+ ## Getting started ### Getting access @@ -89,6 +96,3 @@ Project is per project - each project gets a project folder with project-specifi * hard limits on capacity and inodes prevent users from writing to project if the quota is reached - you can check quota and available space by running the [`quota`][ref-storage-quota] command on a login node or ela * it is not recommended to write directly to the project path from jobs. -## Guides and tutorials - -Tutorials on how to set up and configure a machine learning environment in order to run LLM workloads such as inference, fine-tuning and multi-node training can be found in the [tutorials section][ref-software-ml-tutorials]. From 998eefd545d0e33530c80dfbfceb56c372944ef9 Mon Sep 17 00:00:00 2001 From: bcumming Date: Fri, 22 Aug 2025 15:09:36 +0200 Subject: [PATCH 22/22] fix links; add more links to ML tutorials/pytorch; add guides/tutorials section to landing page --- docs/access/jupyterlab.md | 2 +- docs/index.md | 61 ++++++++++++++++++++++++------------- docs/platforms/mlp/index.md | 6 ++-- docs/software/ml/index.md | 4 +-- docs/software/ml/pytorch.md | 3 +- docs/tutorials/index.md | 2 ++ docs/tutorials/ml/index.md | 4 +-- 7 files changed, 52 insertions(+), 30 deletions(-) diff --git a/docs/access/jupyterlab.md b/docs/access/jupyterlab.md index 43d7048c..f5ffe225 100644 --- a/docs/access/jupyterlab.md +++ b/docs/access/jupyterlab.md @@ -199,7 +199,7 @@ Examples of notebooks with `ipcmagic` can be found [here](https://github.com/ While it is generally recommended to submit long-running machine learning training and inference jobs via `sbatch`, certain use cases can benefit from an interactive Jupyter environment. -A popular approach to run multi-GPU ML workloads is with [`accelerate`](https://github.com/huggingface/accelerate) and [`torchrun`](https://docs.pytorch.org/docs/stable/elastic/run.html) as demonstrated in the [tutorials][ref-software-ml-tutorials]. +A popular approach to run multi-GPU ML workloads is with [`accelerate`](https://github.com/huggingface/accelerate) and [`torchrun`](https://docs.pytorch.org/docs/stable/elastic/run.html) as demonstrated in the [tutorials][ref-tutorials-ml]. In particular, the `accelerate launch` script in the [LLM fine-tuning tutorial][software-ml-llm-fine-tuning-tutorial] can be directly carried over to a Jupyter cell with a `%%bash` header (to run its contents interpreted by bash). For `torchrun`, one can adapt the command from the multi-node [nanotron tutorial][software-ml-llm-nanotron-tutorial] to run on a single GH200 node using the following line in a Jupyter cell diff --git a/docs/index.md b/docs/index.md index 40c14536..915728c1 100644 --- a/docs/index.md +++ b/docs/index.md @@ -1,9 +1,3 @@ -!!! info "" - This is the new CSCS documentation site, which replaces the [CSCS Knowledge Base](https://confluence.cscs.ch/display/KB). - - The migration of old documentation is still not fully complete. - If you find documentation that is missing, please create a ticket on the documentation's [GitHub issue tracker](https://github.com/eth-cscs/cscs-docs/issues). - # CSCS Documentation
@@ -66,32 +60,26 @@ The Alps Research infrastructure hosts multiple platforms and clusters targeting
-[](){#ref-get-in-touch} -## Get in Touch +## Tutorials and Guides -If you cannot find the information that you need in the documentation, help is available. +Learn by doing with our guides and tutorials.
+- :fontawesome-solid-layer-group: __Tutorials__ -- :fontawesome-solid-headset: __Get Help__ - - Contact the CSCS Service Desk for help. - - [:octicons-arrow-right-24: Service Desk](https://jira.cscs.ch/plugins/servlet/desk) + Hands on tutorials that show how to implement workflows on Alps. -- :fontawesome-regular-comments: __Chat__ + [:octicons-arrow-right-24: Machine Learning][ref-tutorials-ml] - Discuss Alps with other users and CSCS staff on Slack. +- :fontawesome-solid-mountain-sun: __Guides__ - [:octicons-arrow-right-24: CSCS User Slack](https://cscs-users.slack.com/) + Guides with practical advice, hints and tips for key topics. -
-- :fontawesome-solid-hammer: __Contribute__ + [:octicons-arrow-right-24: Using storage effectively][ref-guides-storage] - The source for the documentation is hosted on GitHub. + [:octicons-arrow-right-24: Accessing internet and external services][ref-guides-internet-access] - [:octicons-arrow-right-24: Contribute to the docs ](contributing/index.md) -
+ [:octicons-arrow-right-24: Using and configuring the terminal][ref-guides-terminal]
@@ -142,3 +130,32 @@ If you cannot find the information that you need in the documentation, help is a +[](){#ref-get-in-touch} +## Get in Touch + +If you cannot find the information that you need in the documentation, help is available. + +
+ +- :fontawesome-solid-headset: __Get Help__ + + Contact the CSCS Service Desk for help. + + [:octicons-arrow-right-24: Service Desk](https://jira.cscs.ch/plugins/servlet/desk) + +- :fontawesome-regular-comments: __Chat__ + + Discuss Alps with other users and CSCS staff on Slack. + + [:octicons-arrow-right-24: CSCS User Slack](https://cscs-users.slack.com/) + +
+- :fontawesome-solid-hammer: __Contribute__ + + The source for the documentation is hosted on GitHub. + + [:octicons-arrow-right-24: Contribute to the docs ](contributing/index.md) +
+ +
+ diff --git a/docs/platforms/mlp/index.md b/docs/platforms/mlp/index.md index 5e3b09f0..b62ad8d0 100644 --- a/docs/platforms/mlp/index.md +++ b/docs/platforms/mlp/index.md @@ -4,9 +4,11 @@ The Machine Learning Platform (MLP) provides compute, storage and expertise to the machine learning and AI community in Switzerland, with the main user being the [Swiss AI Initiative](https://www.swiss-ai.org/).
-- :fontawesome-solid-mountain: [__Tutorials__][ref-software-ml-tutorials] +- :fontawesome-solid-mountain: [__Tutorials__][ref-tutorials-ml] - Tutorials on how to set up and configure a machine learning environment in order to run LLM workloads such as inference, fine-tuning and multi-node training can be found in the [tutorials section][ref-software-ml-tutorials]. + Tutorials on how to set up and configure a machine learning environment in order to run LLM workloads such as inference, fine-tuning and multi-node training can be found in the [tutorials section][ref-tutorials-ml]. + + Also check out the [PyTorch documentation][ref-software-ml-pytorch] for information about how to run PyTorch.
diff --git a/docs/software/ml/index.md b/docs/software/ml/index.md index d4e5a72d..c20a77d4 100644 --- a/docs/software/ml/index.md +++ b/docs/software/ml/index.md @@ -6,7 +6,7 @@ Most ML workloads are containerized to ensure portability, reproducibility, and Users can choose between running containers, using provided uenv software stacks, or building custom Python environments tailored to their needs. -First time users are recommended to consult the [LLM tutorials][ref-software-ml-tutorials] to get familiar with the concepts of the Machine Learning platform in a series of hands-on examples. +First time users are recommended to consult the [LLM tutorials][ref-tutorials-ml] to get familiar with the concepts of the Machine Learning platform in a series of hands-on examples. ## Running ML applications with containers (recommended) @@ -28,7 +28,7 @@ Documented best practices are available for: Helpful references: -* Introduction to concepts of the Machine Learning platform: [LLM tutorials][ref-software-ml-tutorials] +* Introduction to concepts of the Machine Learning platform: [LLM tutorials][ref-tutorials-ml] * Running containers on Alps: [Container Engine Guide][ref-container-engine] * Building custom container images: [Container Build Guide][ref-build-containers] diff --git a/docs/software/ml/pytorch.md b/docs/software/ml/pytorch.md index 94ab6bcc..66fd6a77 100644 --- a/docs/software/ml/pytorch.md +++ b/docs/software/ml/pytorch.md @@ -15,7 +15,8 @@ Running PyTorch from a container ensures maximum portability, reproducibility, a 3. (optionally) extending with a virtual environment 4. submitting jobs with CE in SLURM -These steps are illustrated in the [machine learning platform tutorials][ref-software-ml-tutorials] and the instructions detailed in the [podman build guide][ref-build-containers]. +!!! example + These steps are illustrated in the [machine learning tutorials][ref-tutorials-ml] and the instructions detailed in the [podman build guide][ref-build-containers]. !!! info "Preliminary steps" Before proceeding with the next steps, make sure you have storage for podman configured as in the [build guide][ref-build-containers-configure-podman] and make sure to apply [recommended Lustre settings][ref-guides-storage-lustre] to every directory (e.g. `$SCRATCH/ce-images`) dedicated to container images before importing them with enroot. This is necessary to guarantee good filesystem performance. diff --git a/docs/tutorials/index.md b/docs/tutorials/index.md index cde382a7..74225d99 100644 --- a/docs/tutorials/index.md +++ b/docs/tutorials/index.md @@ -1,2 +1,4 @@ [](){#ref-tutorials} # Tutorials + +Currently there is one set of tutorials, for [machine learning workflows][ref-tutorials-ml]. diff --git a/docs/tutorials/ml/index.md b/docs/tutorials/ml/index.md index 88bda260..5f60d5e5 100644 --- a/docs/tutorials/ml/index.md +++ b/docs/tutorials/ml/index.md @@ -1,4 +1,4 @@ -[](){#ref-software-ml-tutorials} +[](){#ref-tutorials-ml} # Machine Learning Platform Tutorials The LLM tutorials gradually introduce key concepts of the Machine Learning Platform in a series of hands-on examples. A particular focus is on the [Container Engine][ref-container-engine] for managing the runtime environment. @@ -10,4 +10,4 @@ Building on the first tutorial, in the [second tutorial][software-ml-llm-fine-tu In the [third tutorial][software-ml-llm-nanotron-tutorial], you will apply the techniques from the previous tutorials to enable distributed (pre-)training of a model in `nanotron` on multiple nodes. In particular, this tutorial makes use of model-parallelism and introduces the usage of `torchrun` to manage jobs on individual nodes. !!! note - The focus for these tutorials is on introducing concepts of the Machine Learning Platform. As such, they do not necessarily discuss the latest advancements or steps required to obtain maximum performance. For this purpose, consult the framework-specific pages, such as the one for [PyTorch][ref-software-ml-pytorch]. \ No newline at end of file + The focus for these tutorials is on introducing concepts of the Machine Learning Platform. As such, they do not necessarily discuss the latest advancements or steps required to obtain maximum performance. For this purpose, consult the framework-specific pages, such as the one for [PyTorch][ref-software-ml-pytorch].