diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS index 8d1af372..33fea112 100644 --- a/.github/CODEOWNERS +++ b/.github/CODEOWNERS @@ -9,6 +9,6 @@ docs/software/prgenv/linalg.md @finkandreas @msimberg docs/software/sciapps/cp2k.md @abussy @RMeli docs/software/sciapps/lammps.md @nickjbrowning docs/software/sciapps/gromacs.md @kanduri -docs/software/ml @boeschf +docs/software/ml @boeschf @henrique @lukasgd docs/storage @mpasserini docs/alps/storage.md @mpasserini diff --git a/.github/actions/spelling/expect.txt b/.github/actions/spelling/expect.txt new file mode 100644 index 00000000..e19c78b1 --- /dev/null +++ b/.github/actions/spelling/expect.txt @@ -0,0 +1,4 @@ +JAX +nvitop +NVRTC +placeholders diff --git a/docs/access/jupyterlab.md b/docs/access/jupyterlab.md index 8c233781..f5ffe225 100644 --- a/docs/access/jupyterlab.md +++ b/docs/access/jupyterlab.md @@ -86,7 +86,7 @@ If the default base images do not meet your requirements, you can specify a cust 3. Currently only required on Daint and Santis, not on Clariden 4. Set working directory of Jupyter session (file browser root directory) 5. Use environment settings for optimized communication - 6. Disable CUDA JIT cache + 6. Avoid writing JITed binaries to the (distributed) file system, which could lead to performance issues. 7. Async error handling when an exception is observed in NCCL watchdog: aborting NCCL communicator and tearing down process upon error 8. Disable GPU support in MPICH, as it can lead to deadlocks when using together with NCCL @@ -199,7 +199,9 @@ Examples of notebooks with `ipcmagic` can be found [here](https://github.com/ While it is generally recommended to submit long-running machine learning training and inference jobs via `sbatch`, certain use cases can benefit from an interactive Jupyter environment. -A popular approach to run multi-GPU ML workloads is with [`accelerate`](https://github.com/huggingface/accelerate) and [`torchrun`](https://docs.pytorch.org/docs/stable/elastic/run.html) as demonstrated in the [tutorials][ref-guides-mlp-tutorials]. In particular, the `accelerate launch` script in the [LLM fine-tuning tutorial][ref-mlp-llm-fine-tuning-tutorial] can be directly carried over to a Jupyter cell with a `%%bash` header (to run its contents interpreted by bash). For `torchrun`, one can adapt the command from the multi-node [nanotron tutorial][ref-mlp-llm-nanotron-tutorial] to run on a single GH200 node using the following line in a Jupyter cell +A popular approach to run multi-GPU ML workloads is with [`accelerate`](https://github.com/huggingface/accelerate) and [`torchrun`](https://docs.pytorch.org/docs/stable/elastic/run.html) as demonstrated in the [tutorials][ref-tutorials-ml]. +In particular, the `accelerate launch` script in the [LLM fine-tuning tutorial][software-ml-llm-fine-tuning-tutorial] can be directly carried over to a Jupyter cell with a `%%bash` header (to run its contents interpreted by bash). +For `torchrun`, one can adapt the command from the multi-node [nanotron tutorial][software-ml-llm-nanotron-tutorial] to run on a single GH200 node using the following line in a Jupyter cell ```bash !python -m torch.distributed.run --standalone --nproc_per_node=4 run_train.py ... diff --git a/docs/build-install/containers.md b/docs/build-install/containers.md index 5d6405c0..365066f6 100644 --- a/docs/build-install/containers.md +++ b/docs/build-install/containers.md @@ -4,17 +4,22 @@ Building OCI container images on Alps vClusters is supported through [Podman](https://podman.io/), an open-source container engine that adheres to OCI standards and supports rootless containers by leveraging Linux [user namespaces](https://www.man7.org/linux/man-pages/man7/user_namespaces.7.html). Its command-line interface (CLI) closely mirrors Docker’s, providing a consistent and familiar experience for users of established container tools. +[](){#ref-build-containers-configure-podman} ## Preliminary step: configuring Podman's storage -The first step in order to use Podman on Alps is to create a valid Container Storage configuration file at `$HOME/.config/containers/storage.conf` (or `$XDG_CONFIG_HOME/containers/storage.conf`, if you have `$XDG_CONFIG_HOME` set), according to the following minimal template: +The first step in order to use Podman on Alps is to create a valid Container Storage configuration file in your home directory, according to the following minimal template: -```toml +```toml title="$HOME/.config/containers/storage.conf" [storage] driver = "overlay" runroot = "/dev/shm/$USER/runroot" graphroot = "/dev/shm/$USER/root" ``` +!!! warning + If `$XDG_CONFIG_HOME` is set, place this file at `$XDG_CONFIG_HOME/containers/storage.conf` instead. + See the [terminal user guide][ref-guides-terminal-arch] for further information about XDG variables. + !!! warning In the above configuration, `/dev/shm` is used to store the container images. `/dev/shm` is the mount point of a [tmpfs filesystem](https://www.kernel.org/doc/html/latest/filesystems/tmpfs.html#tmpfs) and is compatible with the user namespaces used by Podman. @@ -43,11 +48,27 @@ podman build -t . In general, [`podman build`](https://docs.podman.io/en/stable/markdown/podman-build.1.html) follows the Docker options convention. +!!! info "Debugging the container build" + If the container build fails, you can run an interactive shell using the image from the last successfully built layer with + + ```bash + podman run -it --rm -e NVIDIA_VISIBLE_DEVICES=void bash # (1)! + ``` + + 1. Setting `NVIDIA_VISIBLE_DEVICES` in the environment is required specifically to run NGC containers with podman + + replacing `` with the actual hash output in the build job and interactively test the failing command. + + ## Importing images in the Container Engine An image built using Podman can be easily imported as a squashfs archive in order to be used with our Container Engine solution. It is important to keep in mind that the import has to take place in the same job allocation where the image creation took place, otherwise the image is lost due to the temporary nature of `/dev/shm`. +!!! info "Preliminary configuration: Lustre settings for container images" + Container images are stored in a single [SquashFS]() file, that is typically between 1-20 GB in size (particularly for large ML containers). + To ensure good performance for jobs on multiple nodes, take the time to configure the target directory using `lfs setstripe` according to [best practices for Lustre][ref-guides-storage-lustre] before importing the container image, or using `lfs migrate` to fix files that are already imported. + To import the image: ``` @@ -62,7 +83,6 @@ image = "//" mounts = ["/capstor/scratch/cscs/:/capstor/scratch/cscs/"] workdir = "/capstor/scratch/cscs/" ``` - ## Pushing Images to a Container Registry In order to push an image to a container registry, you first need to follow three steps: diff --git a/docs/clusters/clariden.md b/docs/clusters/clariden.md index 62401236..e5564770 100644 --- a/docs/clusters/clariden.md +++ b/docs/clusters/clariden.md @@ -65,6 +65,8 @@ Alternatively, [uenv][ref-uenv] are also available on Clariden. Currently deploy uenv start namd/3.0:v3@daint ``` +For detailed instructions and best practices with ML frameworks, please refer to the dedicated pages under [ML software][ref-software-ml]. + ## Running Jobs on Clariden ### Slurm diff --git a/docs/guides/mlp_tutorials/index.md b/docs/guides/mlp_tutorials/index.md deleted file mode 100644 index da7cb242..00000000 --- a/docs/guides/mlp_tutorials/index.md +++ /dev/null @@ -1,10 +0,0 @@ -[](){#ref-guides-mlp-tutorials} -# Machine Learning Platform Tutorials - -These tutorials gradually introduce key concepts of the Machine Learning Platform. A particular focus is on the [Container Engine][ref-container-engine] for managing the runtime environment. - -In a [first tutorial][ref-mlp-llm-inference-tutorial], you will learn how to run inference with a LLM on a single node using a container from the NVIDIA GPU Cloud (NGC). Concepts such as container environment description, layering a thin virtual environment on top of the container image, and job launching and monitoring will be introduced. - -Building on the first tutorial, in the [second tutorial][ref-mlp-llm-fine-tuning-tutorial] you will learn how to train (fine-tune) a LLM on multiple GPUs on a single node. For this purpose, you will use HuggingFace's `accelerate` and see best practices for dataset management. - -In the [third tutorial][ref-mlp-llm-nanotron-tutorial], you will apply the techniques from the previous tutorials to enable distributed (pre-)training of a model in `nanotron` on multiple nodes. In particular, this tutorial makes use of model-parallelism and introduces the usage of `torchrun` to manage jobs on individual nodes. diff --git a/docs/index.md b/docs/index.md index 40c14536..915728c1 100644 --- a/docs/index.md +++ b/docs/index.md @@ -1,9 +1,3 @@ -!!! info "" - This is the new CSCS documentation site, which replaces the [CSCS Knowledge Base](https://confluence.cscs.ch/display/KB). - - The migration of old documentation is still not fully complete. - If you find documentation that is missing, please create a ticket on the documentation's [GitHub issue tracker](https://github.com/eth-cscs/cscs-docs/issues). - # CSCS Documentation
@@ -66,32 +60,26 @@ The Alps Research infrastructure hosts multiple platforms and clusters targeting
-[](){#ref-get-in-touch} -## Get in Touch +## Tutorials and Guides -If you cannot find the information that you need in the documentation, help is available. +Learn by doing with our guides and tutorials.
+- :fontawesome-solid-layer-group: __Tutorials__ -- :fontawesome-solid-headset: __Get Help__ - - Contact the CSCS Service Desk for help. - - [:octicons-arrow-right-24: Service Desk](https://jira.cscs.ch/plugins/servlet/desk) + Hands on tutorials that show how to implement workflows on Alps. -- :fontawesome-regular-comments: __Chat__ + [:octicons-arrow-right-24: Machine Learning][ref-tutorials-ml] - Discuss Alps with other users and CSCS staff on Slack. +- :fontawesome-solid-mountain-sun: __Guides__ - [:octicons-arrow-right-24: CSCS User Slack](https://cscs-users.slack.com/) + Guides with practical advice, hints and tips for key topics. -
-- :fontawesome-solid-hammer: __Contribute__ + [:octicons-arrow-right-24: Using storage effectively][ref-guides-storage] - The source for the documentation is hosted on GitHub. + [:octicons-arrow-right-24: Accessing internet and external services][ref-guides-internet-access] - [:octicons-arrow-right-24: Contribute to the docs ](contributing/index.md) -
+ [:octicons-arrow-right-24: Using and configuring the terminal][ref-guides-terminal]
@@ -142,3 +130,32 @@ If you cannot find the information that you need in the documentation, help is a +[](){#ref-get-in-touch} +## Get in Touch + +If you cannot find the information that you need in the documentation, help is available. + +
+ +- :fontawesome-solid-headset: __Get Help__ + + Contact the CSCS Service Desk for help. + + [:octicons-arrow-right-24: Service Desk](https://jira.cscs.ch/plugins/servlet/desk) + +- :fontawesome-regular-comments: __Chat__ + + Discuss Alps with other users and CSCS staff on Slack. + + [:octicons-arrow-right-24: CSCS User Slack](https://cscs-users.slack.com/) + +
+- :fontawesome-solid-hammer: __Contribute__ + + The source for the documentation is hosted on GitHub. + + [:octicons-arrow-right-24: Contribute to the docs ](contributing/index.md) +
+ +
+ diff --git a/docs/platforms/mlp/index.md b/docs/platforms/mlp/index.md index 0fcbc5ca..b62ad8d0 100644 --- a/docs/platforms/mlp/index.md +++ b/docs/platforms/mlp/index.md @@ -3,6 +3,15 @@ The Machine Learning Platform (MLP) provides compute, storage and expertise to the machine learning and AI community in Switzerland, with the main user being the [Swiss AI Initiative](https://www.swiss-ai.org/). +
+- :fontawesome-solid-mountain: [__Tutorials__][ref-tutorials-ml] + + Tutorials on how to set up and configure a machine learning environment in order to run LLM workloads such as inference, fine-tuning and multi-node training can be found in the [tutorials section][ref-tutorials-ml]. + + Also check out the [PyTorch documentation][ref-software-ml-pytorch] for information about how to run PyTorch. + +
+ ## Getting started ### Getting access @@ -89,6 +98,3 @@ Project is per project - each project gets a project folder with project-specifi * hard limits on capacity and inodes prevent users from writing to project if the quota is reached - you can check quota and available space by running the [`quota`][ref-storage-quota] command on a login node or ela * it is not recommended to write directly to the project path from jobs. -## Guides and tutorials - -Tutorials for fine-tuning and running inference of LLMs as well as training an LLM with Nanotron can be found in the [MLP Tutorials][ref-guides-mlp-tutorials] page. diff --git a/docs/software/communication/nccl.md b/docs/software/communication/nccl.md index 6a0068ad..5a16c39c 100644 --- a/docs/software/communication/nccl.md +++ b/docs/software/communication/nccl.md @@ -14,7 +14,7 @@ When using e.g. the `default` view of `prgenv-gnu` the `aws-ofi-nccl` plugin wil Alternatively, loading the `aws-ofi-nccl` module with the `modules` view also makes the plugin available in the environment. The environment variables described below must be set to ensure that NCCL uses the plugin. -While the container engine sets these automatically when using the NCCL hook, the following environment variables should always be set for correctness and optimal performance when using NCCL: +While the container engine sets these automatically when using the NCCL hook, the following environment variables should always be set for correctness and optimal performance when using NCCL with uenv: ```bash --8<-- "docs/software/communication/nccl_env_vars" diff --git a/docs/software/ml/index.md b/docs/software/ml/index.md index ba101c0b..c20a77d4 100644 --- a/docs/software/ml/index.md +++ b/docs/software/ml/index.md @@ -2,22 +2,33 @@ # Machine learning applications and frameworks CSCS supports a wide range of machine learning (ML) applications and frameworks on its systems. -Most ML workloads are containerized to ensure portability, reproducibility, and ease of use across environments. +Most ML workloads are containerized to ensure portability, reproducibility, and ease of use across systems. Users can choose between running containers, using provided uenv software stacks, or building custom Python environments tailored to their needs. -## Running machine learning applications with containers +First time users are recommended to consult the [LLM tutorials][ref-tutorials-ml] to get familiar with the concepts of the Machine Learning platform in a series of hands-on examples. + +## Running ML applications with containers (recommended) Containerization is the recommended approach for ML workloads on Alps, as it simplifies software management and maximizes compatibility with other systems. -* Users are encouraged to build their own containers, starting from popular sources such as the [Nvidia NGC Catalog](https://catalog.ngc.nvidia.com/containers), which offers a variety of pre-built images optimized for HPC and ML workloads. +Users are encouraged to build their own containers, starting from popular sources such as the [Nvidia NGC Catalog](https://catalog.ngc.nvidia.com/containers), which offers a variety of pre-built images optimized for HPC and ML workloads. Examples include: - * [PyTorch NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) - * [TensorFlow NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorflow) -* For frequently changing dependencies, consider creating a virtual environment (venv) mounted into the container. + +* [PyTorch NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) ([Release Notes](https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/index.html)) +* [JAX NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/jax) ([Release Notes](https://docs.nvidia.com/deeplearning/frameworks/jax-release-notes/index.html)) +* [TensorFlow NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorflow) (deprecated since 25.02, see [Release Notes](https://docs.nvidia.com/deeplearning/frameworks/tensorflow-release-notes/index.html)) + +Documented best practices are available for: + +* [PyTorch][ref-ce-pytorch] + +!!! note "Extending a container with a virtual environment" + For frequently changing Python dependencies during development, consider creating a Virtual Environment (venv) on top of the packages in the container (see [this example][ref-ce-pytorch-venv]). Helpful references: +* Introduction to concepts of the Machine Learning platform: [LLM tutorials][ref-tutorials-ml] * Running containers on Alps: [Container Engine Guide][ref-container-engine] * Building custom container images: [Container Build Guide][ref-build-containers] @@ -30,17 +41,18 @@ Available ML-related uenvs: * [PyTorch][ref-uenv-pytorch] — available on [Clariden][ref-cluster-clariden] and [Daint][ref-cluster-daint] -To extend these environments with additional Python packages, it is recommended to create a Python Virtual Environment (venv). -See this [PyTorch venv example][ref-uenv-pytorch-venv] for details. - -!!! note - While many Python packages provide pre-built binaries for common architectures, some may require building from source. +!!! note "Extending a uenv with a virtual environment" + To extend these environments with additional Python packages, it is recommended to create a Python Virtual Environment (venv) layered on top of the packages in the uenv. + See this [PyTorch venv example][ref-uenv-pytorch-venv] for details. ## Building custom Python environments Users may also choose to build entirely custom software stacks using Python package managers such as `uv` or `conda`. Most ML libraries are available via the [Python Package Index (PyPI)](https://pypi.org/). +!!! note + While many Python packages provide pre-built binaries for common architectures, some may require building from source. + To ensure optimal performance on CSCS systems, we recommend starting from an environment that already includes: * CUDA, cuDNN diff --git a/docs/software/ml/pytorch.md b/docs/software/ml/pytorch.md index 84aa68db..66fd6a77 100644 --- a/docs/software/ml/pytorch.md +++ b/docs/software/ml/pytorch.md @@ -1,15 +1,251 @@ -[](){#ref-uenv-pytorch} +[](){#ref-software-ml-pytorch} # PyTorch -The PyTorch software stack was designed with the intention of being able to run [Megatron-LM](https://github.com/NVIDIA/Megatron-LM)-based pre-training workloads out of the box. +PyTorch is available both as a container with the [Container Engine (CE)][ref-container-engine] and a [uenv][ref-uenv] software stack. The best choice for your use case depends on the amount of control required over the lower level libraries. + +While NGC provides an optimized build of PyTorch with many dependencies included, uenv allows a more flexible choice of lower level libraries and represents a thinner layer over the host system. Both options can be customized - a container via a Dockerfile and a uenv (in advanced use cases) via its recipe and both, additionally, via Python virtual environments built on top. Due to the simplicity and reproducible performance, containers are generally the recommended default for most users. + +[](){#ref-ce-pytorch} +## Running PyTorch with the Container Engine (recommended) + +Running PyTorch from a container ensures maximum portability, reproducibility, and ease of use across machines. This is achieved by + +1. selecting an appropriate base image and customizing it in a Dockerfile +2. defining the container runtime environment in an EDF +3. (optionally) extending with a virtual environment +4. submitting jobs with CE in SLURM + +!!! example + These steps are illustrated in the [machine learning tutorials][ref-tutorials-ml] and the instructions detailed in the [podman build guide][ref-build-containers]. + +!!! info "Preliminary steps" + Before proceeding with the next steps, make sure you have storage for podman configured as in the [build guide][ref-build-containers-configure-podman] and make sure to apply [recommended Lustre settings][ref-guides-storage-lustre] to every directory (e.g. `$SCRATCH/ce-images`) dedicated to container images before importing them with enroot. This is necessary to guarantee good filesystem performance. + + ```bash + lfs setstripe -E 4M -c 1 -E 64M -c 4 -E -1 -c -1 -S 4M $SCRATCH/ce-images # (1)! + ``` + + 1. This makes sure that files stored subsequently end up on the same storage node (up to 4 MB), on 4 storage nodes (between 4 and 64 MB) or are striped across all storage nodes (above 64 MB) + + +### Select the base image + +For most applications, the [PyTorch NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) is a good base image as PyTorch comes pre-installed with an optimized build including many dependencies. The [Release Notes](https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/index.html) give an overview of installed packages and compatibility. This image can be further customized in a Dockerfile and built with podman as detailed in the [podman build guide][ref-build-containers]. + +### Define Container Runtime Environment + +Having built and imported a container image with podman and enroot, the next step is to configure the runtime environment with an environment definition file (EDF). In particular, this includes specifying the image, any directories mounted from the host and a working directory for the process in the container to start in as in the [quickstart examples for CE][ref-container-engine]. + +Apart from this, there are specific features relevant for machine learning made available through [annotations][ref-ce-annotations], which customize the container at runtime. + +* When using NCCL inside the container, include the [aws-ofi-nccl][ref-ce-aws-ofi-hook] plugin which enables the container to interface with the host's libfabric and, thus, use the Slingshot high-speed interconnect. This is crucial for multi-node communication performance. +* An [SSH annotation][ref-ce-ssh-hook] allows adding a light-weight SSH server to the container without the need to modify the container image + +A resulting example TOML file following best practices may look like + +```toml title="$HOME/my-app/ngc-pytorch-my-app-25.06.toml" +image = "${SCRATCH}/ce-images/ngc-pytorch-my-app+25.06.sqsh" # (1)! + +mounts = [ + "/capstor", + "/iopsstor", + "/users/${USER}/my-app" +] # (2)! + +workdir = "${HOME}/my-app" # (3)! + +[annotations] +com.hooks.aws_ofi_nccl.enabled = "true" # (4)! +com.hooks.aws_ofi_nccl.variant = "cuda12" + +[env] +NCCL_DEBUG = "INFO" # (5)! +CUDA_CACHE_DISABLE = "1" # (6)! +TORCH_NCCL_ASYNC_ERROR_HANDLING = "1" # (7)! +MPICH_GPU_SUPPORT_ENABLED = "0" # (8)! +``` + +1. It is important to use curly braces for environment variables used in the EDF +2. The path `/users` is not mounted as a whole since it often contains user-specific initialization scripts for the host environment and many frameworks leave temporary data behind that can lead to non-trivial runtime errors when swapping container images. Thus, it is recommended to selectively mount specific subfolders under `${HOME}` if needed. +3. You can use `${PWD}` as an alternative to use the path submitted from when the container is started +4. This enables NCCL installed in the container to make effective use of the Slingshot interconnect on Alps by interfacing with the [AWS OFI NCCL plugin][ref-ce-aws-ofi-hook]. While not strictly needed for single node workloads, it is good practice to keep it always on. +5. This makes NCCL output debug info during initialization, which can be useful to spot communication-related issues in a distributed scenario. Subsystems with debug log can be configured with [`NCCL_DEBUG_SUBSYS`](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-debug-subsys). +6. Avoid writing JITed binaries to the (distributed) file system, which could lead to performance issues. +7. Async error handling when an exception is observed in NCCL watchdog: aborting NCCL communicator and tearing down process upon error +8. Disable GPU support in MPICH, as it can lead to deadlocks when using together with NCCL + +??? note "Access to SLURM from inside the container" + In case access to SLURM is required from inside the container, you can add the following lines to the mounts above: + + ```toml + ... + + mounts = [ + "/capstor", + "/iopsstor", + "/users/${USER}/my-app", + "/etc/slurm", # (1)! + "/usr/lib64/libslurm-uenv-mount.so", + "/etc/container_engine_pyxis.conf" + ] + + ... + ``` + + 1. Enable Slurm commands (together with two subsequent mounts) + +!!! note "Best practice for production jobs" + + For stability and reproducibility, use self-contained containers for production jobs. Using code mounted from the distributed filesystem may leave compiled artefacts behind that can result in unintentional runtime errors when e.g. swapping the container image. In particular, it is recommended to avoid mounting all of `$HOME`, so that environments are properly isolated and e.g. the Triton cache (that by default ends up in `$HOME/.triton`) resides in an ephemeral location of the filesystem. + +!!! note "Collaborating in Git" + + For reproducibility, it is recommended to always track the Dockerfile, EDF and an optional virtual environment specification alongside your application code in a Git repository. + +[](){#ref-ce-pytorch-venv} +### (Optionally) extend container with virtual environment + +While production jobs should include as many dependencies as possible in the container image, during development it can be convenient to manage frequently changing packages in a virtual environment built on top of the container image. This can include both dependencies and actively developed packages (that can be installed in editable mode with `pip install -e .`). + +To create such a virtual environment, _inside the container_ use the Python `venv` module with the option `--system-site-packages` to ensure that packages are installed _in addition_ to the existing packages. Without this option, packages may accidentally be re-installed shadowing a version that is already present in the container. +A workflow installing additional packages in a virtual environment may look like this: + +```console +[clariden-lnXXX]$ srun -A \ + --environment=./ngc-pytorch-my-app-25.06.toml --pty bash # (1)! +user@nidYYYYYY$ python -m venv --system-site-packages venv-ngc-pt-25.06 # (2)! +user@nidYYYYYY$ source venv-ngc-pt-25.06/bin/activate # (3)! +(venv-ngc-pt-25.06) user@nidYYYYYY$ pip install # (3)! +(venv-ngc-pt-25.06) user@nidYYYYYY$ exit +``` + +1. Allocate an interactive session on a compute node +2. Create a virtual environment on top of the existing Python installation in the container (only necessary the first time) +3. Activate the newly created virtual environment (always necessary when running a Slurm job) +4. Install additional packages (only run this from a single process to avoid race conditions) + +The changes made to the virtual environment will outlive the container as they are persisted on the distributed filesystem. + +!!! note + Keep in mind that: + + * this virtual environment is _specific_ to this particular container and won't actually work unless you are using it from inside this container - it relies on the resources packaged inside the container. + * every Slurm job making use of this virtual environment will need to activate it first (_inside_ the `srun` command). + + +### Submit jobs with the Container Engine in Slurm + +A general template for a Pytorch distributed training job with Slurm in analogy to the [last tutorial][software-ml-llm-nanotron-tutorial] may look like + +```bash title="$HOME/my-app/submit-dist-train.sh" +#!/bin/bash +#SBATCH --account= +#SBATCH --job-name=dist-train-ddp +#SBATCH --time=01:00:00 +#SBATCH --nodes=2 +#SBATCH --ntasks-per-node=4 +#SBATCH --output=logs/slurm-%x-%j.log +# (1)! + +set -x + +ulimit -c 0 # (2)! + + # (3)! + # (4)! +srun -ul --environment=./ngc-pytorch-my-app-25.06.toml bash -c " + . venv-ngc-pt-25.06/bin/activate # activate (optional) venv + +--8<-- "docs/software/ml/torch_distributed_env_vars" + python dist-train.py +" +``` + +1. If `#SBATCH --error=...` is not specified, `#SBATCH --output` will also contain stderr (error messages) +2. In case the application crashes, it may leave behind large core dump files that contain an image of the process memory at the time of the crash. While these can be useful for debugging the reason of a specific crash (by e.g. loading them with `cuda-gdb` and looking at the stack trace with `bt`), they may accumulate over time and occupy a large space on the filesystem. For this reason, it is recommended to disable their creation (unless needed) by adding this line. +3. Loading the virtual environment is mandatory within every `srun` command if it is used to manage packages. +4. The environment variables are set to initialize PyTorch's distributed module through the environment (cf. [docs](https://docs.pytorch.org/docs/stable/distributed.html#environment-variable-initialization)). + + +For further details on execution logic, job monitoring and data management, please refer to the [nanotron tutorial][software-ml-llm-nanotron-tutorial] (which in particular also explains the usage of `torchrun` with Slurm). Make sure to apply [recommended Lustre settings][ref-guides-storage-lustre] to datasets, models and container images persisted to the distributed filesystem. + +!!! warning "#SBATCH --environment" + The operations performed before the `srun` command are executed in the host environment of a single compute node in the allocation. If you need to perform these steps in the container environment as well, you can alternatively use the `#SBATCH --environment=path/to/ngc-pytorch-my-app-25.06.toml` option _instead of_ using `--environment` with `srun`. + + Use of the `--environment` option for `sbatch` is still considered experimental and could result in unexpected behavior. In particular, avoid mixing `#SBATCH --environment` and `srun --environment` in the same job. + + Use of `--environment` is currently only recommended for the `srun` command. + +!!! note "Optimizing large-scale training jobs" + The following settings were established to **improve compute throughput** of LLM training in [Megatron-LM](https://github.com/NVIDIA/Megatron-LM): + + * Extensively evaluate all possible parallelization dimensions, including data-, tensor- and pipeline parallelism (including virtual pipeline parallelism) and more, when available. Identify storage-related bottlenecks by isolating data loading/generation operations into a separate benchmark. + + * Disabling transparent huge pages and enabling the Nvidia [vboost](https://docs.nvidia.com/nemo-framework/user-guide/latest/performance/performance-guide.html#gpu-core-clock-optimization) feature has been observed to improve performance in large-scale LLM training in Megatron-LM. This can be achieved by adding these constraints to the sbatch script: + ```bash + #SBATCH -C thp_never&nvidia_vboost_enabled + ``` + + * The argument `--ddp-bucket-size` controls the level of grouping of many small data-parallel communications into bigger ones and setting it to a high value can improve throughput (model-dependent, e.g. `10000000000`). + + * If in doubt about communication performance with NCCL at scale, use the [`NCCL_DEBUG`](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-debug) environment variable to validate that the aws-ofi-nccl plugin has been properly initialized and libfabric was recognized (further subsystems can be monitored with [`NCCL_DEBUG_SUBSYS`](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-debug-subsys)). If the issue persists, use [nccl-tests](https://github.com/NVIDIA/nccl-tests) with the relevant communication patterns to check if the scaling behavior can be reproduced and contact CSCS support. + + Additionally, consider the **best practice for checkpointing and data management**: + + * Following the advice on [filesystems][ref-storage-fs], write checkpoints (sequential write) to `/capstor/scratch` and place randomly accessed training data (many small random reads) on `/iopsstor/scratch`. Use the [data transfer instructions][ref-data-xfer] to move data to/from `/capstor/store`. Make sure to apply recommended [Lustre settings][ref-guides-storage-lustre] on all directories containing significant amount of data, including those containing container images and managed by other tools (e.g. the HuggingFace cache, see [`HF_HOME`](https://huggingface.co/docs/huggingface_hub/en/package_reference/environment_variables#hfhome) in the [this tutorial][software-ml-llm-inference-tutorial]). In case your workload continues to be limited by filesystem performance, contact CSCS support. + + * Regularly adjust checkpoint writing intervals to the current overhead induced by writing a checkpoint ($T_1$) and mean time between job failures ($T_2$). As a first order approximation use a checkpointing interval of $\sqrt{2 T_1 T_2}$ (derived by [Young](https://doi.org/10.1145/361147.361115) and [Daly](https://doi.org/10.1016/j.future.2004.11.016)). + + * Avoid activities that put excessive load on third party services (such as web scraping or bulk downloads) in line with the [guidelines on Internet Access on Alps][ref-guides-internet-access-ext]. + + Adjust for **cluster availability**: + + * Submit your jobs with a Slurm time limit compatible with reservations (such as maintenance windows, cf. `scontrol show res`) to be able to get scheduled. + +??? info "Debugging segmentation faults" + Application crashes with segmentation faults can be investigated by inspecting core dump files that contain an image of the process memory at the time of the crash. For this purpose, you can load the core dump file with `cuda-gdb` installed in the container and look at the stack trace with `bt`. Note that in order to generate core dump files the line `ulimit -c 0` must be commented out in the above sbatch script. + +### Known Issues + +The [Release Notes](https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/index.html) of every NGC PyTorch container contain a selected list of known issues. + +??? info "Errors hidden by failures in UCX signal handler" + Application errors may trigger the UCX signal handler in the NGC container, which has caused secondary failures in the past, shadowing the initial error trace. These secondary failures may be significantly harder to fix than the initial problem. + + An example is the following trace from the NGC PyTorch 25.01 with Megatron-LM: + ```console + 640: [nid007306:244443:0:244443] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x455) + 640: ==== backtrace (tid: 244443) ==== + 640: 0 /opt/hpcx/ucx/lib/libucs.so.0(ucs_handle_error+0x2cc) [0x4000d2b214dc] + 640: 1 /opt/hpcx/ucx/lib/libucs.so.0(+0x3168c) [0x4000d2b2168c] + 640: 2 /opt/hpcx/ucx/lib/libucs.so.0(+0x319b8) [0x4000d2b219b8] + 640: 3 linux-vdso.so.1(__kernel_rt_sigreturn+0) [0x4000347707dc] + 640: 4 /usr/local/cuda/lib64/libnvrtc.so.12.8.61(+0x935000) [0x400140a25000] + 640: 5 [0x3d5c5e58] + 640: ================================= + srun: error: nid007306: task 640: Segmentation fault + srun: Terminating StepId=348680.1 + ``` + In this case, the segmentation fault in the UCX signal handler (`ucs_handle_error`) was due to a broken NVRTC in the container. However, to obtain the trace of the initial error (which was unrelated), it was necessary to disable the UCX signal handler by setting the following environment variable in the sbatch script: + ```bash + export UCX_HANDLE_ERRORS=none + ``` + +??? info "Avoid `--defer-embedding-wgrad-compute` in Megatron-LM" + In Megatron-LM, avoid using the option `--defer-embedding-wgrad-compute` to delay the embedding gradient computation as it can lead to an incorrect gradient norm that changes upon resuming at different scale. + +[](){#ref-uenv-pytorch} +## Running PyTorch with a uenv + +The PyTorch uenv software stack was designed with the intention of being able to run [Megatron-LM](https://github.com/NVIDIA/Megatron-LM)-based pre-training workloads out of the box. Thus, it comes with batteries included and does not just provide the bare [PyTorch framework](https://github.com/pytorch/pytorch). !!! note "uenv" - [PyTorch][ref-uenv-pytorch] is provided via [uenv][ref-uenv]. + The [PyTorch uenv][ref-uenv-pytorch] is provided via the tool [uenv][ref-uenv]. Please have a look at the [uenv documentation][ref-uenv] for more information about uenvs and how to use them. -## Versioning +### Versioning The PyTorch uenv is versioned according to the PyTorch version it provides. @@ -241,7 +477,7 @@ The PyTorch uenv is versioned according to the PyTorch version it provides. [](){#ref-uenv-pytorch-how-to-use} -## How to use +### How to use There are two ways to access the software provided by the uenv, once it has been started. @@ -279,20 +515,20 @@ There are two ways to access the software provided by the uenv, once it has been [Check out the guide for using Spack with uenv][ref-building-uenv-spack]. [](){#ref-uenv-pytorch-venv} -## Adding Python packages on top of the uenv +### Adding Python packages on top of the uenv -Uenvs are read-only, and cannot be modified. However, it is possible to add Python packages on top of the uenv using virtual environments. +Uenvs are read-only, and cannot be modified. However, it is possible to add Python packages on top of the uenv using virtual environments analogous to the setup with containers. ```console title="Creating a virtual environment on top of the uenv" $ uenv start pytorch/v2.6.0:v1 --view=default # (1)! -$ python -m venv --system-site-packages ./my-venv # (2)! +$ python -m venv --system-site-packages venv-uenv-pt2.6-v1 # (2)! -$ source ./my-venv/bin/activate # (3)! +$ source venv-uenv-pt2.6-v1/bin/activate # (3)! -(my-venv) $ pip install # (4)! +(venv-uenv-pt2.6-v1) $ pip install # (4)! -(my-venv) $ deactivate # (5)! +(venv-uenv-pt2.6-v1) $ deactivate # (5)! $ exit # (6)! ``` @@ -312,33 +548,48 @@ $ exit # (6)! Python virtual environments can be slow on the parallel Lustre file system due to the amount of small files and potentially many processes accessing it. If this becomes a bottleneck, consider [squashing the venv][ref-guides-storage-venv] into its own memory-mapped, read-only file system to enhance scalability and reduce load times. +??? bug "Python packages from uenv shadowing those in a virtual environment" + When using uenv with a virtual environment on top, the site-packages under `/user-environment` currently take precedence over those in the activated virtual environment. This is due to the uenv paths being included in the `PYTHONPATH` environment variable. As a consequence, despite installing a different version of a package in the virtual environment from what is available in the uenv, the uenv version will still be imported at runtime. A possible workaround is to prepend the virtual environment's site-packages to `PYTHONPATH` whenever activating the virtual environment. + ```bash + export PYTHONPATH="$(python -c 'import site; print(site.getsitepackages()[0])'):$PYTHONPATH" + ``` + It is recommended to apply this workaround if you are constrained by a Python package version installed in the uenv that you need to change for your application. + +!!! note + Keep in mind that + + * this virtual environment is _specific_ to this particular uenv and won't actually work unless you are using it from inside this uenv - it relies on the resources packaged inside the uenv. + * every Slurm job making use of this virtual environment will need to activate it first (_inside_ the `srun` command). + Alternatively one can use the uenv as [upstream Spack instance][ref-building-uenv-spack] to to add both Python and non-Python packages. However, this workflow is more involved and intended for advanced Spack users. -## Running PyTorch jobs with Slurm +### Running PyTorch jobs with Slurm ```bash title="Slurm sbatch script" #!/bin/bash -#SBATCH --job-name=myjob -#SBATCH --nodes=1 +#SBATCH --account= +#SBATCH --job-name=dist-train-ddp +#SBATCH --time=01:00:00 +#SBATCH --nodes=2 #SBATCH --ntasks-per-node=4 -#SBATCH --cpus-per-task=72 -#SBATCH --time=00:30:00 +#SBATCH --output=logs/slurm-%x-%j.log # (1)! #SBATCH --uenv=pytorch/v2.6.0:/user-environment #SBATCH --view=default +set -x + +ulimit -c 0 # (2)! + ################################# # OpenMP environment variables # ################################# -export OMP_NUM_THREADS=8 # (2)! +export OMP_NUM_THREADS=8 # (3)! ################################# # PyTorch environment variables # ################################# -export MASTER_ADDR=$(hostname) # (3)! -export MASTER_PORT=29500 -export WORLD_SIZE=$SLURM_NPROCS export TORCH_NCCL_ASYNC_ERROR_HANDLING=1 # (4)! export TRITON_HOME=/dev/shm/ # (5)! @@ -360,30 +611,30 @@ export CUDA_CACHE_DISABLE=1 # (7)! # (9)! # (10)! -srun bash -c " - export RANK=\$SLURM_PROCID - export LOCAL_RANK=\$SLURM_LOCALID - . ./my-venv/bin/activate - python myscript.py +srun -ul bash -c " + . ./venv-uenv-pt2.6-v1/bin/activate + +--8<-- "docs/software/ml/torch_distributed_env_vars" + python dist-train.py " ``` 1. The `--uenv` option is used to specify the uenv to use for the job. The `--view=default` option is used to load all the packages provided by the uenv. -2. Set `OMP_NUM_THREADS` if you are using OpenMP in your code. +2. In case the application crashes, it may leave behind large core dump files that contain an image of the process memory at the time of the crash. While these can be useful for debugging the reason of a specific crash (by e.g. loading them with `cuda-gdb` and looking at the stack trace with `bt`), they may accumulate over time and occupy a large space on the filesystem. For this reason, it is recommended to disable their creation (unless needed) by adding this line. +3. Set `OMP_NUM_THREADS` if you are using OpenMP in your code. The number of threads should be not greater than the number of cores per task (`$SLURM_CPUS_PER_TASK`). The optimal number depends on the workload and should be determined by testing. Consider for example that typical workloads using PyTorch may fork the processes, so the number of threads should be around the number of cores per task divided by the number of processes. -3. These variables are used by PyTorch to initialize the distributed backend. - The `MASTER_ADDR`, `MASTER_PORT` and `WORLD_SIZE` variables are used to determine the address and port of the master node. - Additionally we also need `RANK` and `LOCAL_RANK` but these must be set per-process, see below. 4. Enable more graceful exception handling, see [PyTorch documentation](https://pytorch.org/docs/stable/torch_nccl_environment_variables.html) 5. Set the Triton home to a local path (e.g. `/dev/shm`) to avoid writing to the (distributed) file system. - This is important for performance, as writing to the Lustre file system can be slow due to the amount of small files and potentially many processes accessing it. + This is important for performance, as writing to the Lustre file system can be slow due to the amount of small files and potentially many processes accessing it. Avoid this setting with the container engine as it may lead to errors related to mount settings of `/dev/shm` (use a filesystem path inside the container instead). 6. Disable GPU support in MPICH, as it [can lead to deadlocks](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/mpi.html#inter-gpu-communication-with-cuda-aware-mpi) when using together with nccl. 7. Avoid writing JITed binaries to the (distributed) file system, which could lead to performance issues. -8. These variables should always be set for correctness and optimal performance when using NCCL, see [the detailed explanation][ref-communication-nccl]. -9. `RANK` and `LOCAL_RANK` are set per-process by the Slurm job launcher. -10. Activate the virtual environment created on top of the uenv (if any). +8. These variables should always be set for correctness and optimal performance when using NCCL with uenv, see [the detailed explanation][ref-communication-nccl]. +9. Activate the virtual environment created on top of the uenv (if any). Please follow the guidelines for [python virtual environments with uenv][ref-guides-storage-venv] to enhance scalability and reduce load times. +10. The environment variables are used by PyTorch to initialize the distributed backend. + The `MASTER_ADDR`, `MASTER_PORT` variables are used to determine the address and port of the master node. + Additionally we also need `RANK` and `LOCAL_RANK` and `WORLD_SIZE` to identify the position of each rank within the Slurm step and node, respectively. diff --git a/docs/software/ml/torch_distributed_env_vars b/docs/software/ml/torch_distributed_env_vars new file mode 100644 index 00000000..6d7692a0 --- /dev/null +++ b/docs/software/ml/torch_distributed_env_vars @@ -0,0 +1,5 @@ + MASTER_ADDR=\$(scontrol show hostnames \$SLURM_JOB_NODELIST | head -n 1) \ + MASTER_PORT=29500 \ + RANK=\${SLURM_PROCID} \ + LOCAL_RANK=\${SLURM_LOCALID} \ + WORLD_SIZE=\${SLURM_NTASKS} \ \ No newline at end of file diff --git a/docs/tutorials/index.md b/docs/tutorials/index.md new file mode 100644 index 00000000..74225d99 --- /dev/null +++ b/docs/tutorials/index.md @@ -0,0 +1,4 @@ +[](){#ref-tutorials} +# Tutorials + +Currently there is one set of tutorials, for [machine learning workflows][ref-tutorials-ml]. diff --git a/docs/tutorials/ml/index.md b/docs/tutorials/ml/index.md new file mode 100644 index 00000000..5f60d5e5 --- /dev/null +++ b/docs/tutorials/ml/index.md @@ -0,0 +1,13 @@ +[](){#ref-tutorials-ml} +# Machine Learning Platform Tutorials + +The LLM tutorials gradually introduce key concepts of the Machine Learning Platform in a series of hands-on examples. A particular focus is on the [Container Engine][ref-container-engine] for managing the runtime environment. + +In the [first tutorial][software-ml-llm-inference-tutorial], you will learn how to run inference with a LLM on a single node using a container from the NVIDIA GPU Cloud (NGC). Concepts such as container environment description, layering a thin virtual environment on top of the container image, and job launching/monitoring will be introduced. + +Building on the first tutorial, in the [second tutorial][software-ml-llm-fine-tuning-tutorial] you will learn how to train (fine-tune) a LLM on multiple GPUs on a single node. For this purpose, you will use HuggingFace's `accelerate` and see best practices for dataset management. + +In the [third tutorial][software-ml-llm-nanotron-tutorial], you will apply the techniques from the previous tutorials to enable distributed (pre-)training of a model in `nanotron` on multiple nodes. In particular, this tutorial makes use of model-parallelism and introduces the usage of `torchrun` to manage jobs on individual nodes. + +!!! note + The focus for these tutorials is on introducing concepts of the Machine Learning Platform. As such, they do not necessarily discuss the latest advancements or steps required to obtain maximum performance. For this purpose, consult the framework-specific pages, such as the one for [PyTorch][ref-software-ml-pytorch]. diff --git a/docs/guides/mlp_tutorials/llm-fine-tuning.md b/docs/tutorials/ml/llm-fine-tuning.md similarity index 95% rename from docs/guides/mlp_tutorials/llm-fine-tuning.md rename to docs/tutorials/ml/llm-fine-tuning.md index 5d72cfd1..e27a5cbc 100644 --- a/docs/guides/mlp_tutorials/llm-fine-tuning.md +++ b/docs/tutorials/ml/llm-fine-tuning.md @@ -1,8 +1,8 @@ -[](){#ref-mlp-llm-fine-tuning-tutorial} +[](){#software-ml-llm-fine-tuning-tutorial} # LLM Fine-tuning Tutorial -This tutorial will take the model from the [LLM Inference][ref-mlp-llm-inference-tutorial] tutorial and show you how to perform fine-tuning. +This tutorial will take the model from the [LLM Inference][software-ml-llm-inference-tutorial] tutorial and show you how to perform fine-tuning. This means that we take the model and train it on some new custom data to change its behavior. To complete the tutorial, we set up some extra libraries that will help us to update the state of the machine learning model. @@ -12,7 +12,7 @@ We also write a script that will allow us to unlock more of the performance offe ### Prerequisites -This tutorial assumes you've already successfully completed the [LLM Inference][ref-mlp-llm-inference-tutorial] tutorial. +This tutorial assumes you've already successfully completed the [LLM Inference][software-ml-llm-inference-tutorial] tutorial. For fine-tuning Gemma, we will rely on the NGC PyTorch container and the libraries we've already installed in the Python virtual environment used previously. ### Set up TRL @@ -97,7 +97,7 @@ The first four lines of the launch line are used to configure `accelerate`. Everything after that configures the `trl/examples/scripts/sft.py` Python script, which we use to train Gemma. !!! note "Dataset management and sharing" - For datasets, recommended LUSTRE settings should be used as illustrated in the tutorial on [LLM Inference][ref-mlp-llm-inference-tutorial]. As they have been set there for `HF_HOME`, which `huggingface_hub` uses for its dataset cache, they don't need to be re-applied here. + For datasets, recommended LUSTRE settings should be used as illustrated in the tutorial on [LLM Inference][software-ml-llm-inference-tutorial]. As they have been set there for `HF_HOME`, which `huggingface_hub` uses for its dataset cache, they don't need to be re-applied here. To enable your colleagues to use also use your datasets, please refer to the [storage guide][ref-guides-storage-sharing]. diff --git a/docs/guides/mlp_tutorials/llm-inference.md b/docs/tutorials/ml/llm-inference.md similarity index 98% rename from docs/guides/mlp_tutorials/llm-inference.md rename to docs/tutorials/ml/llm-inference.md index 15af5ed0..af7a8fc8 100644 --- a/docs/guides/mlp_tutorials/llm-inference.md +++ b/docs/tutorials/ml/llm-inference.md @@ -1,4 +1,4 @@ -[](){#ref-mlp-llm-inference-tutorial} +[](){#software-ml-llm-inference-tutorial} # LLM Inference Tutorial @@ -165,8 +165,8 @@ MPICH_GPU_SUPPORT_ENABLED = "0" # (8)! 2. The path `/users` is not mounted since it often contains user-specific initialization scripts for the host environment and many frameworks leave temporary data behind that can lead to non-trivial runtime errors when swapping container images. Thus, it is recommended to selectively mount specific subfolders under `${HOME}` if needed. 3. You can use `${PWD}` as an alternative to use the path submitted from when the container is started 4. This enables NCCL installed in the container to make effective use of the Slingshot interconnect on Alps by interfacing with the [AWS OFI NCCL plugin][ref-ce-aws-ofi-hook] with libfabric. While not strictly needed for single node workloads, it is good practice to keep it always on. -5. This makes NCCL output debug info during initialization, which can be useful to spot communication-related issues in a distributed scenario (see later tutorials). Subsystems with debug log can be configured with `NCCL_DEBUG_SUBSYS`. -6. Disable CUDA JIT cache +5. This makes NCCL output debug info during initialization, which can be useful to spot communication-related issues in a distributed scenario (see later tutorials). Subsystems with debug log can be configured with [`NCCL_DEBUG_SUBSYS`](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-debug-subsys). +6. Avoid writing JITed binaries to the (distributed) file system, which could lead to performance issues. 7. Async error handling when an exception is observed in NCCL watchdog: aborting NCCL communicator and tearing down process upon error 8. Disable GPU support in MPICH, as it can lead to deadlocks when using together with NCCL diff --git a/docs/guides/mlp_tutorials/llm-nanotron-training.md b/docs/tutorials/ml/llm-nanotron-training.md similarity index 91% rename from docs/guides/mlp_tutorials/llm-nanotron-training.md rename to docs/tutorials/ml/llm-nanotron-training.md index 10194a20..267ecf42 100644 --- a/docs/guides/mlp_tutorials/llm-nanotron-training.md +++ b/docs/tutorials/ml/llm-nanotron-training.md @@ -1,17 +1,20 @@ -[](){#ref-mlp-llm-nanotron-tutorial} +[](){#software-ml-llm-nanotron-tutorial} # LLM Nanotron Pre-training Tutorial In this tutorial, we will build a container image to run multi-node training jobs with [nanotron](https://github.com/huggingface/nanotron). We will train a 109M parameter model with ~100M wikitext tokens as a proof of concept. +!!! info + While the concepts taught here for multi-node training with PyTorch are generally portable across training frameworks, the current (August 2025) recommendation for users with a need for large-scale model-parallel training is to use `Megatron-LM` instead of `nanotron` due to significant performance advantages at scale. + ### Prerequisites -It is recommended to follow the previous two tutorials on [LLM Inference][ref-mlp-llm-inference-tutorial] and [LLM Fine-tuning][ref-mlp-llm-fine-tuning-tutorial] first, as this will build upon them. +It is recommended to follow the previous two tutorials on [LLM Inference][software-ml-llm-inference-tutorial] and [LLM Fine-tuning][software-ml-llm-fine-tuning-tutorial] first, as this will build upon them. ### Set up Podman -If not already done as part of the [LLM Inference tutorial][ref-mlp-llm-inference-tutorial], edit your podman configuration in `$HOME/.config/containers/storage.conf` as follows: +If not already done as part of the [LLM Inference tutorial][software-ml-llm-inference-tutorial], edit your podman configuration in `$HOME/.config/containers/storage.conf` as follows: ```toml title="$HOME/.config/containers/storage.conf" [storage] @@ -23,6 +26,9 @@ If not already done as part of the [LLM Inference tutorial][ref-mlp-llm-inferenc mount_program = "/usr/bin/fuse-overlayfs-1.13" ``` +!!! warning + If `$XDG_CONFIG_HOME` is set, place this file at `$XDG_CONFIG_HOME/containers/storage.conf` instead. + Create a directory to store container images used with CE and configure it with [recommended LUSTRE settings][ref-guides-storage-lustre]: ```console title="Container image directory with recommended LUSTRE settings" @@ -64,7 +70,7 @@ RUN pip install \ ``` !!! note "More recent NGC releases" - As discussed in the [LLM Inference tutorial][ref-mlp-llm-inference-tutorial], starting with the 24.11 release, NGC PyTorch no longer requires the installation of the Python venv module. Furthermore, FlashAttention and several other packages were integrated into the hosted image. However, as `nanotron` as of June 2025 still requires Python 3.10 (cf. this [issue](https://github.com/huggingface/nanotron/issues/217)), this example is restricted to NGC releases up to `24.10`. + As discussed in the [LLM Inference tutorial][software-ml-llm-inference-tutorial], starting with the 24.11 release, NGC PyTorch no longer requires the installation of the Python venv module. Furthermore, FlashAttention and several other packages were integrated into the hosted image. However, as `nanotron` as of June 2025 still requires Python 3.10 (cf. this [issue](https://github.com/huggingface/nanotron/issues/217)), this example is restricted to NGC releases up to `24.10`. ```dockerfile title="$SCRATCH/tutorials/nanotron-pretrain/Dockerfile" FROM nvcr.io/nvidia/pytorch:24.10-py3 @@ -144,8 +150,8 @@ MPICH_GPU_SUPPORT_ENABLED = "0" # (8)! 2. The path `/users` is not mounted since it often contains user-specific initialization scripts for the host environment and many frameworks leave temporary data behind that can lead to non-trivial runtime errors when swapping container images. Thus, it is recommended to selectively mount specific subfolders under `${HOME}` if needed. 3. You can use `${PWD}` as an alternative to use the path submitted from when the container is started 4. This enables NCCL installed in the container to make effective use of the Slingshot interconnect on Alps by interfacing with the [AWS OFI NCCL plugin][ref-ce-aws-ofi-hook] with libfabric. While not strictly needed for single node workloads, it is good practice to keep it always on. -5. This makes NCCL output debug info during initialization, which can be useful to spot communication-related issues in a distributed scenario (see later tutorials). Subsystems with debug log can be configured with `NCCL_DEBUG_SUBSYS`. -6. Disable CUDA JIT cache +5. This makes NCCL output debug info during initialization, which can be useful to spot communication-related issues in a distributed scenario (see later tutorials). Subsystems with debug log can be configured with [`NCCL_DEBUG_SUBSYS`](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-debug-subsys). +6. Avoid writing JITed binaries to the (distributed) file system, which could lead to performance issues. 7. Async error handling when an exception is observed in NCCL watchdog: aborting NCCL communicator and tearing down process upon error 8. Disable GPU support in MPICH, as it can lead to deadlocks when using together with NCCL @@ -165,7 +171,7 @@ In the login node run: 1. This ensures the compatibility of nanotron with the following example. For general usage, there is no reason to stick to an outdated version of nanotron, though. -We will install nanotron in a thin virtual environment on top of the container image built above. This proceeds as in the [LLM Inference][ref-mlp-llm-inference-tutorial]. +We will install nanotron in a thin virtual environment on top of the container image built above. This proceeds as in the [LLM Inference][software-ml-llm-inference-tutorial]. ```console [clariden-lnXXX]$ srun -A --environment=./ngc-nanotron-24.04.toml --pty bash @@ -336,7 +342,7 @@ srun -ul --environment=./ngc-nanotron-24.04.toml bash -c " !!! note "A few comments" - The parts outside the srun command will be run on the first node of the Slurm allocation for this job. srun commands without further specifiers execute with the settings of the sbatch script (i.e. using all nodes allocated to the job). - - Note that we are setting `HF_HOME` to a directory in scratch. This is done to place the dataset downloaded from `huggingface_hub` in your scratch, instead of your home directory. The same applies to your HuggingFace token as well as any models/spaces unless `HF_HUB_CACHE` is set (cf. [HuggingFace docs](https://huggingface.co/docs/huggingface_hub/en/package_reference/environment_variables#hfhome)). As discussed in the tutorial on [LLM Inference][ref-mlp-llm-inference-tutorial], it is good practice to apply the [recommended LUSTRE settings][ref-guides-storage-lustre] there. + - Note that we are setting `HF_HOME` to a directory in scratch. This is done to place the dataset downloaded from `huggingface_hub` in your scratch, instead of your home directory. The same applies to your HuggingFace token as well as any models/spaces unless `HF_HUB_CACHE` is set (cf. [HuggingFace docs](https://huggingface.co/docs/huggingface_hub/en/package_reference/environment_variables#hfhome)). As discussed in the tutorial on [LLM Inference][software-ml-llm-inference-tutorial], it is good practice to apply the [recommended LUSTRE settings][ref-guides-storage-lustre] there. - If instead of downloading a dataset from HuggingFace you want to re-use one managed by a colleague, please refer to the [storage guide][ref-guides-storage-sharing] for instructions on dataset sharing. - If you have a [wandb API key](https://docs.wandb.ai/guides/track/environment-variables/) and want to synchronize the training run, be sure to set the `WANDB_API_KEY` variable. Alternatively, `wandb` can write log data to the distributed filesystem with `WANDB_MODE=of​f​line` so that it can be uploaded with `wandb sync` (cf. [Weights & Biases docs](https://docs.wandb.ai/support/run_wandb_offline/)) after the training run has finished. diff --git a/mkdocs.yml b/mkdocs.yml index bb8b560c..d7e37c5a 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -128,12 +128,14 @@ nav: - 'Internet Access on Alps': guides/internet-access.md - 'Storage': guides/storage.md - 'Using the terminal': guides/terminal.md - - 'MLP Tutorials': - - guides/mlp_tutorials/index.md - - 'LLM Inference': guides/mlp_tutorials/llm-inference.md - - 'LLM Fine-tuning': guides/mlp_tutorials/llm-fine-tuning.md - - 'LLM Pre-training': guides/mlp_tutorials/llm-nanotron-training.md - 'Gordon Bell 2025': guides/gb25.md + - 'Tutorials': + - tutorials/index.md + - 'Machine Learning': + - tutorials/ml/index.md + - 'LLM Inference': tutorials/ml/llm-inference.md + - 'LLM Fine-tuning': tutorials/ml/llm-fine-tuning.md + - 'LLM Pre-training': tutorials/ml/llm-nanotron-training.md - 'Policies': - policies/index.md - 'User Regulations': policies/regulations.md