diff --git a/docs/access/jupyterlab.md b/docs/access/jupyterlab.md index 9c5185b3..8c233781 100644 --- a/docs/access/jupyterlab.md +++ b/docs/access/jupyterlab.md @@ -23,7 +23,7 @@ When resources are granted the page redirects to the JupyterLab session, where y [](){#ref-jupyter-runtime-environment} ## Runtime environment -A Jupyter session can be started with either a [uenv][ref-uenv] or a [container][ref-container-engine] as a base image. The JupyterHub Spawner form provides a set of default images such as the [prgenv-gnu][ref-uenv-prgenv-gnu] uenv or the [NGC Pytorch container][ref-software-ml] to choose from in a dropdown menu. When using uenv, the software stack will be mounted at `/user-environment`, and the specified view will be activated. For a container, the Jupyter session will launch inside the container filesystem with only a select set of paths mounted from the host. Once you have found a suitable option, you can start the session with `Launch JupyterLab`. +A Jupyter session can be started with either a [uenv][ref-uenv] or a [container][ref-container-engine] as a base image. The JupyterHub Spawner form provides a set of default images such as the [prgenv-gnu][ref-uenv-prgenv-gnu] uenv or the [NGC PyTorch container][ref-software-ml] to choose from in a dropdown menu. When using uenv, the software stack will be mounted at `/user-environment`, and the specified view will be activated. For a container, the Jupyter session will launch inside the container filesystem with only a select set of paths mounted from the host. Once you have found a suitable option, you can start the session with `Launch JupyterLab`. ??? info "Using remote uenv for the first time." If the uenv is not present in the local repository, it will be automatically fetched. @@ -34,8 +34,8 @@ A Jupyter session can be started with either a [uenv][ref-uenv] or a [container] If the default base images do not meet your requirements, you can specify a custom environment instead. For this purpose, you supply either a custom uenv image/view or [container engine (CE)][ref-container-engine] TOML file under the section `Advanced options` before launching the session. The supported uenvs are compatible with the Jupyter service out of the box, whereas container images typically require the installation of some additional packages. -??? "Example of a custom Pytorch container" - A container image based on recent a NGC Pytorch release requires the installation of the following additional packages to be compatible with the Jupyter service: +??? "Example of a custom PyTorch container" + A container image based on recent a NGC PyTorch release requires the installation of the following additional packages to be compatible with the Jupyter service: ```Dockerfile FROM nvcr.io/nvidia/pytorch:25.05-py3 @@ -199,14 +199,14 @@ Examples of notebooks with `ipcmagic` can be found [here](https://github.com/ While it is generally recommended to submit long-running machine learning training and inference jobs via `sbatch`, certain use cases can benefit from an interactive Jupyter environment. -A popular approach to run multi-GPU ML workloads is with [`accelerate`](https://github.com/huggingface/accelerate) and [`torchrun`](https://docs.pytorch.org/docs/stable/elastic/run.html) as demonstrated in the [tutorials][ref-guides-mlp-tutorials]. In particular, the `accelerate launch` script in the [LLM fine-tuning tutorial][ref-mlp-llm-finetuning-tutorial] can be directly carried over to a Jupyter cell with a `%%bash` header (to run its contents interpreted by bash). For `torchrun`, one can adapt the command from the multi-node [nanotron tutorial][ref-mlp-llm-nanotron-tutorial] to run on a single GH200 node using the following line in a Jupyter cell +A popular approach to run multi-GPU ML workloads is with [`accelerate`](https://github.com/huggingface/accelerate) and [`torchrun`](https://docs.pytorch.org/docs/stable/elastic/run.html) as demonstrated in the [tutorials][ref-guides-mlp-tutorials]. In particular, the `accelerate launch` script in the [LLM fine-tuning tutorial][ref-mlp-llm-fine-tuning-tutorial] can be directly carried over to a Jupyter cell with a `%%bash` header (to run its contents interpreted by bash). For `torchrun`, one can adapt the command from the multi-node [nanotron tutorial][ref-mlp-llm-nanotron-tutorial] to run on a single GH200 node using the following line in a Jupyter cell ```bash !python -m torch.distributed.run --standalone --nproc_per_node=4 run_train.py ... ``` !!! warning "torchrun with virtual environments" - When using a virtual environment on top of a base image with Pytorch, always replace `torchrun` with `python -m torch.distributed.run` to pick up the correct Python environment. Otherwise, the system Python environment will be used and virtual environment packages not available. If not using virtual environments such as with a self-contained Pytorch container, `torchrun` is equivalent to `python -m torch.distributed.run`. + When using a virtual environment on top of a base image with PyTorch, always replace `torchrun` with `python -m torch.distributed.run` to pick up the correct Python environment. Otherwise, the system Python environment will be used and virtual environment packages will not available. If not using virtual environments such as with a self-contained PyTorch container, `torchrun` is equivalent to `python -m torch.distributed.run`. !!! note "Notebook structure" In none of these scenarios any significant memory allocations or background computations are performed on the main Jupyter process. Instead, the resources are kept available for the processes launched by `accelerate` or `torchrun`, respectively. @@ -216,19 +216,20 @@ Alternatively to using these launchers, it is also possible to use Slurm to obta ```bash !srun --overlap -ul --environment /path/to/edf.toml \ --container-workdir $PWD -n 4 bash -c "\ + . venv-/bin/activate MASTER_ADDR=\$(scontrol show hostnames \$SLURM_JOB_NODELIST | head -n 1) \ MASTER_PORT=29500 \ RANK=\$SLURM_PROCID LOCAL_RANK=\$SLURM_LOCALID WORLD_SIZE=\$SLURM_NPROCS \ python train.py ..." ``` -where `/path/to/edf.toml` should be replaced by the TOML file and `train.py` is a script using `torch.distributed` for distributed training. This can be further customized with extra Slurm options. +where `/path/to/edf.toml` should be replaced by the TOML file and `venv-` by the name of the virtual environment (if used). The script `train.py` is using `torch.distributed` for distributed training. This launch mechanism can be further customized with extra Slurm options. !!! warning "Concurrent usage of resources" Subtle bugs can occur when running multiple Jupyter notebooks concurrently that each assume access to the full node. Also, some notebooks may hold on to resources such as spawned child processes or allocated memory despite having completed. In this case, resources such as a GPU may still be busy, blocking another notebook from using it. Therefore, it is good practice to only keep one such notebook running that occupies the full node and restarting a kernel once a notebook has completed. If in doubt, system monitoring with `htop` and [nvdashboard](https://github.com/rapidsai/jupyterlab-nvdashboard) can be helpful for debugging. !!! warning "Multi-GPU training from a shared Jupyter process" - Running multi-GPU training workloads directly from the shared Jupyter process is generally not recommended due to potential inefficiencies and correctness issues (cf. the [Pytorch docs](https://docs.pytorch.org/docs/stable/notes/cuda.html#use-nn-parallel-distributeddataparallel-instead-of-multiprocessing-or-nn-dataparallel)). However, if you need it to e.g. reproduce existing results, it is possible to do so with utilities like `accelerate`'s `notebook_launcher` or [`transformers`](https://github.com/huggingface/transformers)' `Trainer` class. When using these in containers, you will currently need to unset the environment variables `RANK` and `LOCAL_RANK`, that is have the following in a cell at the top of the notebook: + Running multi-GPU training workloads directly from the shared Jupyter process is generally not recommended due to potential inefficiencies and correctness issues (cf. the [PyTorch docs](https://docs.pytorch.org/docs/stable/notes/cuda.html#use-nn-parallel-distributeddataparallel-instead-of-multiprocessing-or-nn-dataparallel)). However, if you need it to e.g. reproduce existing results, it is possible to do so with utilities like `accelerate`'s `notebook_launcher` or [`transformers`](https://github.com/huggingface/transformers)' `Trainer` class. When using these in containers, you will currently need to unset the environment variables `RANK` and `LOCAL_RANK` by adding the following in a cell at the top of the notebook: ```python import os; os.environ.pop("RANK"); os.environ.pop("LOCAL_RANK"); diff --git a/docs/guides/mlp_tutorials/index.md b/docs/guides/mlp_tutorials/index.md index ef44ac41..da7cb242 100644 --- a/docs/guides/mlp_tutorials/index.md +++ b/docs/guides/mlp_tutorials/index.md @@ -1,11 +1,10 @@ [](){#ref-guides-mlp-tutorials} -# MLP Tutorials +# Machine Learning Platform Tutorials -These tutorials solve simple MLP tasks using the [Container Engine][ref-container-engine] on the ML Platform. - -1. [LLM Inference][ref-mlp-llm-inference-tutorial] -2. [LLM Fine-tuning][ref-mlp-llm-finetuning-tutorial] -3. [Nanotron Training][ref-mlp-llm-nanotron-tutorial] +These tutorials gradually introduce key concepts of the Machine Learning Platform. A particular focus is on the [Container Engine][ref-container-engine] for managing the runtime environment. +In a [first tutorial][ref-mlp-llm-inference-tutorial], you will learn how to run inference with a LLM on a single node using a container from the NVIDIA GPU Cloud (NGC). Concepts such as container environment description, layering a thin virtual environment on top of the container image, and job launching and monitoring will be introduced. +Building on the first tutorial, in the [second tutorial][ref-mlp-llm-fine-tuning-tutorial] you will learn how to train (fine-tune) a LLM on multiple GPUs on a single node. For this purpose, you will use HuggingFace's `accelerate` and see best practices for dataset management. +In the [third tutorial][ref-mlp-llm-nanotron-tutorial], you will apply the techniques from the previous tutorials to enable distributed (pre-)training of a model in `nanotron` on multiple nodes. In particular, this tutorial makes use of model-parallelism and introduces the usage of `torchrun` to manage jobs on individual nodes. diff --git a/docs/guides/mlp_tutorials/llm-finetuning.md b/docs/guides/mlp_tutorials/llm-fine-tuning.md similarity index 73% rename from docs/guides/mlp_tutorials/llm-finetuning.md rename to docs/guides/mlp_tutorials/llm-fine-tuning.md index 2afa885b..5d72cfd1 100644 --- a/docs/guides/mlp_tutorials/llm-finetuning.md +++ b/docs/guides/mlp_tutorials/llm-fine-tuning.md @@ -1,4 +1,4 @@ -[](){#ref-mlp-llm-finetuning-tutorial} +[](){#ref-mlp-llm-fine-tuning-tutorial} # LLM Fine-tuning Tutorial @@ -8,23 +8,25 @@ This means that we take the model and train it on some new custom data to change To complete the tutorial, we set up some extra libraries that will help us to update the state of the machine learning model. We also write a script that will allow us to unlock more of the performance offered by the cluster, by running our fine-tuning task on two or more nodes. +## Fine-tuning Gemma 7B on the OpenAssistant dataset + ### Prerequisites This tutorial assumes you've already successfully completed the [LLM Inference][ref-mlp-llm-inference-tutorial] tutorial. -For fine-tuning Gemma, we will rely on the NGC PyTorch container and the libraries we've already installed in the Python environment used previously. +For fine-tuning Gemma, we will rely on the NGC PyTorch container and the libraries we've already installed in the Python virtual environment used previously. ### Set up TRL -We will use HuggingFace TRL to fine-tune Gemma-7B on the [OpenAssistant dataset](https://huggingface.co/datasets/OpenAssistant/oasst_top1_2023-08-25). +We will use HuggingFace TRL (Transformer Reinforcement Learning) to fine-tune Gemma-7B on the [OpenAssistant dataset](https://huggingface.co/datasets/OpenAssistant/oasst_top1_2023-08-25). First, we need to update our Python environment with some extra libraries to support TRL. To do this, we can launch an interactive shell in the PyTorch container, just like we did in the previous tutorial. Then, we install `peft`: ```console -$ cd $SCRATCH/gemma-inference -$ srun --environment=gemma-pytorch --container-workdir=$PWD --pty bash -$ source ./gemma-venv/bin/activate -$ python -m pip install peft==0.11.1 +[clariden-lnXXX]$ cd $SCRATCH/tutorials/gemma-7b +[clariden-lnXXX]$ srun --environment=./ngc-pytorch-gemma-24.01.toml --pty bash +user@nidYYYYYY$ source venv-gemma-24.01/bin/activate +(venv-gemma-24.01) user@nidYYYYYY$ pip install peft==0.11.1 ``` Next, we also need to clone and install the `trl` Git repository so that we have access to the fine-tuning scripts in it. @@ -32,21 +34,24 @@ For this purpose, we will install the package in editable mode in the virtual en This makes it available in python scripts independent of the current working directory and without creating a redundant copy of the files. ```console -$ git clone https://github.com/huggingface/trl -b v0.7.11 -$ pip install -e ./trl # install in editable mode +(venv-gemma-24.01) user@nidYYYYYY$ git clone \ + https://github.com/huggingface/trl -b v0.7.11 +(venv-gemma-24.01) user@nidYYYYYY$ pip install -e ./trl # (1)! ``` +1. Installs trl in editable mode + When this step is complete, you can exit the shell by typing `exit`. ### Fine-tune Gemma-7B -t this point, we can set up a fine-tuning script and start training Gemma-7B. -Use your favorite text editor to create the file `fine-tune-gemma.sh` just outside the `trl` and `gemma-venv` directories: +At this point, we can set up a fine-tuning script and start training Gemma-7B. +Use your favorite text editor to create the file `fine-tune-gemma.sh` just outside the `trl` and `venv-gemma-24.01` directories: -```bash title="fine-tune-gemma.sh" +```bash title="$SCRATCH/tutorials/gemma-7b/fine-tune-gemma.sh" #!/bin/bash -source ./gemma-venv/bin/activate +source venv-gemma-24.01/bin/activate set -x @@ -73,38 +78,50 @@ accelerate launch --config_file trl/examples/accelerate_configs/multi_gpu.yaml \ --use_peft \ --lora_r 16 --lora_alpha 32 \ --lora_target_modules q_proj k_proj v_proj o_proj \ - --output_dir gemma-finetuned-openassistant + --output_dir gemma-fine-tuned-openassistant ``` This script has quite a bit more content to unpack. -We use HuggingFace accelerate to launch the fine-tuning process, so we need to make sure that accelerate understands which hardware is available and where. +We use HuggingFace `accelerate` to launch the fine-tuning process, so we need to make sure that `accelerate` understands which hardware is available and where. Setting this up will be useful in the long run because it means we can tell Slurm how much hardware to reserve, and this script will setup all the details for us. The cluster has four GH200 chips per compute node. -We can make them accessible to scripts run through srun/sbatch via the option `--gpus-per-node=4`. +We can make them accessible to scripts run through `srun`/`sbatch` via the option `--gpus-per-node=4`. Then, we calculate how many processes accelerate should launch. We want to map each GPU to a separate process, this should be four processes per node. We multiply this by the number of nodes to obtain the total number of processes. Next, we use some bash magic to extract the name of the head node from Slurm environment variables. -Accelerate expects one main node and launches tasks on the other nodes from this main node. +`accelerate` expects one main node and launches tasks on the other nodes from this main node. Having sourced our python environment at the top of the script, we can then launch Gemma fine-tuning. -The first four lines of the launch line are used to configure accelerate. +The first four lines of the launch line are used to configure `accelerate`. Everything after that configures the `trl/examples/scripts/sft.py` Python script, which we use to train Gemma. +!!! note "Dataset management and sharing" + For datasets, recommended LUSTRE settings should be used as illustrated in the tutorial on [LLM Inference][ref-mlp-llm-inference-tutorial]. As they have been set there for `HF_HOME`, which `huggingface_hub` uses for its dataset cache, they don't need to be re-applied here. + + To enable your colleagues to use also use your datasets, please refer to the [storage guide][ref-guides-storage-sharing]. + +Make this script executable with + +```console +[clariden-lnXXX]$ chmod u+x $SCRATCH/tutorials/gemma-7b/fine-tune-gemma.sh +``` + Next, we also need to create a short Slurm batch script to launch our fine-tuning script: -```bash title="fine-tune-sft.sbatch" +```bash title="$SCRATCH/tutorials/gemma-7b/submit-fine-tune-gemma.sh" #!/bin/bash -#SBATCH --job-name=gemma-finetune +#SBATCH --account= +#SBATCH --job-name=fine-tune-gemma #SBATCH --time=00:30:00 #SBATCH --ntasks-per-node=1 #SBATCH --gpus-per-node=4 #SBATCH --cpus-per-task=288 -#SBATCH --account= +#SBATCH --output logs/slurm-%x-%j.out set -x -srun -ul --environment=gemma-pytorch --container-workdir=$PWD bash fine-tune-gemma.sh +srun -ul --environment=./ngc-pytorch-gemma-24.01.toml fine-tune-gemma.sh ``` We set a few Slurm parameters like we already did in the previous tutorial. @@ -116,7 +133,7 @@ We'll start out by launching it on two nodes. It should take about 10-15 minutes to fine-tune Gemma: ```console -$ sbatch --nodes=1 fine-tune-sft.sbatch +[clariden-lnXXX]$ sbatch --nodes=1 submit-fine-tune-gemma.sh ``` ### Compare fine-tuned Gemma against default Gemma @@ -131,7 +148,7 @@ input_text = "What are the 5 tallest mountains in the Swiss Alps?" We can run inference using our batch script from the previous tutorial: ```console -$ sbatch ./gemma-inference.sbatch +[clariden-lnXXX]$ sbatch submit-gemma-inference.sh ``` Inspecting the output should yield something like this: @@ -152,7 +169,8 @@ the 5 tallest mountains in the Swiss Alps: Next, we can update the model line in our Python inference script to use the model that we just fine-tuned: ```python -model = AutoModelForCausalLM.from_pretrained("gemma-finetuned-openassistant/checkpoint-400", device_map="auto") +model = AutoModelForCausalLM.from_pretrained( + "gemma-fine-tuned-openassistant/checkpoint-400", device_map="auto") ``` If we re-run inference, the output will be a bit more detailed and explanatory, similar to output we might expect from a helpful chatbot. One example looks like this: diff --git a/docs/guides/mlp_tutorials/llm-inference.md b/docs/guides/mlp_tutorials/llm-inference.md index e24f99ee..15af5ed0 100644 --- a/docs/guides/mlp_tutorials/llm-inference.md +++ b/docs/guides/mlp_tutorials/llm-inference.md @@ -5,10 +5,10 @@ This tutorial will guide you through the steps required to set up a PyTorch container and do ML inference. This means that we load an existing machine learning model, prompt it with some custom data, and run the model to see what output it will generate with our data. -To complete the tutorial, we get a PyTorch container from Nvidia, customize it to suit our needs, and tell the Container Engine how to run it. +To complete the tutorial, we get a PyTorch container from Nvidia's GPU Cloud (NGC), customize it to suit our needs, and tell the Container Engine how to run it. Finally, we set up and run a python script to run the machine learning model and generate some output. -The model we will be running is Google's [Gemma-7B](https://huggingface.co/google/gemma-7b#description), an LLM similar in style to the popular ChatGPT, which can generate text responses to text prompts that we feed into it. +The model we will be running is Google's [Gemma-7B](https://huggingface.co/google/gemma-7b-it#description) in the instruction-tuned variant. This is an LLM similar in style to popular chat assistants like ChatGPT, which can generate text responses to text prompts that we feed into it. ## Gemma-7B Inference using NGC PyTorch @@ -16,43 +16,65 @@ The model we will be running is Google's [Gemma-7B](https://huggingface.co/googl This tutorial assumes you are able to access the cluster via SSH. To set up access to CSCS systems, follow the guide [here][ref-ssh], and read through the documentation about the [ML Platform][ref-platform-mlp]. -### Modify the NGC Container +For clarity, we prepend all shell commands with the hostname and any active Python virtual environment they are executed in. E.g. `clariden-lnXXX` refers to a login node on Clariden, while `nidYYYYYY` is a compute node (with placeholders for numeric values). The commands listed here are run on Clariden, but can be adapted slightly to run on other vClusters as well. -In theory, we could now just go ahead and use the container to run some PyTorch code. +!!! note + Login nodes are a shared environment for editing files, preparing and submitting SLURM jobs as well as inspecting logs. They are not intended for running significant data processing or compute work. Any memory- or compute-intensive work should instead be done on compute nodes. + + If you need to move data [externally][ref-data-xfer-external] or [internally][ref-data-xfer-internal], please follow the corresponding guides using Globus or the `xfer` queue, respectively. + +### Build a modified NGC PyTorch Container + +In theory, we could just go ahead and use the vanilla container image to run some PyTorch code. However, chances are that we will need some additional libraries or software. -For this reason, we need to use some docker commands to build a container on top of what is provided by Nvidia. -To do this, we create a new directory for building containers in our home directory and set up a [Dockerfile](https://docs.docker.com/reference/dockerfile/): +For this reason, we need to build another image on top of the one provided by Nvidia. +To do this, we create a new directory for recipes to build containers in our home directory and set up a [Dockerfile](https://docs.docker.com/reference/dockerfile/): ```console -$ cd $SCRATCH -$ mkdir pytorch-24.01-py3-venv && cd pytorch-24.01-py3-venv +[clariden-lnXXX]$ cd $SCRATCH +[clariden-lnXXX]$ mkdir -p tutorials/gemma-7b +[clariden-lnXXX]$ cd tutorials/gemma-7b ``` Use your favorite text editor to create a file `Dockerfile` here. The Dockerfile should look like this: -```dockerfile title="Dockerfile" +```dockerfile title="$SCRATCH/tutorials/gemma-7b/Dockerfile" FROM nvcr.io/nvidia/pytorch:24.01-py3 ENV DEBIAN_FRONTEND=noninteractive -RUN apt-get update && apt-get install -y python3.10-venv && apt-get clean && rm -rf /var/lib/apt/lists/* +RUN apt-get update && \ + apt-get install -y python3.10-venv && \ + apt-get clean && \ + rm -rf /var/lib/apt/lists/* ``` The first line specifies that we are working on top of an existing container. -In this case we start `FROM` an NGC PyTorch container. -Next, we set an `ENV`ironment variable that helps us run `apt-get` in the container. +In this case we start `FROM` an [NGC PyTorch container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch). +Next, we set an environment variable with `ENV` that helps us run `apt-get` in the container. Finally, we `RUN` the package installer `apt-get` to install python virtual environments. This will let us install python packages later on without having to rebuild the container again and again. There's a bunch of extra commands in this line to tidy things up. If you want to understand what is happening, take a look at the [Docker documentation](https://docs.docker.com/develop/develop-images/instructions/#apt-get). +!!! note "Recent changes in NGC releases" + Starting with the 24.11 release, NGC PyTorch no longer requires the installation of the Python venv module. That is, the Dockerfile simplifies to only the first line, e.g. for the `25.06` release + + ```dockerfile + FROM nvcr.io/nvidia/pytorch:25.06-py3 + ``` + + The remaining steps can then be performed equivalently, replacing the version number `24.01` by the one chosen in the Dockerfile (e.g. `25.06`). + + It is generally recommended to stick to one of the most recent versions of NGC, unless there is a strong reason from your application to stick to an old version for compatibility. + Now that we've setup the Dockerfile, we can go ahead and pass it to [Podman](https://podman.io/) to build a container. Podman is a tool that enables us to fetch, manipulate, and interact with containers on the cluster. For more information, please see the [Container Engine][ref-container-engine] page. To use Podman, we first need to configure some storage locations for it. -This step is straightforward, just make the file `$HOME/.config/containers/storage.conf` (or `$XDG_CONFIG_HOME/containers/storage.conf` if `XDG_CONFIG_HOME` is set): +This step is straightforward, just create the file in your home: -```toml +```toml title="$HOME/.config/containers/storage.conf" [storage] driver = "overlay" runroot = "/dev/shm/$USER/runroot" @@ -62,78 +84,120 @@ This step is straightforward, just make the file `$HOME/.config/containers/stora mount_program = "/usr/bin/fuse-overlayfs-1.13" ``` -To build a container with Podman, we need to request a shell on a compute node from [Slurm][ref-slurm], pass the Dockerfile to Podman, and finally import the freshly built container using enroot. +!!! warning + If `$XDG_CONFIG_HOME` is set, place this file at `$XDG_CONFIG_HOME/containers/storage.conf` instead. + +Before building the container image, we create a dedicated directory to keep track of all images used with the CE. Since container images are large files and the filesystem is a shared resource, we need to apply [best practices for LUSTRE][ref-guides-storage-lustre] so they are properly distributed across storage nodes. + +```console title="Container image directory with recommended LUSTRE settings" +[clariden-lnXXX]$ mkdir -p $SCRATCH/ce-images +[clariden-lnXXX]$ lfs setstripe -E 4M -c 1 -E 64M -c 4 -E -1 -c -1 -S 4M \ + $SCRATCH/ce-images # (1)! +``` + +1. This makes sure that files stored subsequently end up on the same storage node (up to 4 MB), on 4 storage nodes (between 4 and 64 MB) or are striped across all storage nodes (above 64 MB) + +To build a container with Podman, we need to request a shell on a compute node from [Slurm][ref-slurm], pass the Dockerfile to Podman, and finally import the freshly built container to the dedicated directory using enroot. Slurm is a workload manager which distributes workloads on the cluster. -Through Slurm, many people can use the supercomputer at the same time without interfering with one another in any way: +Through Slurm, many people can use the supercomputer at the same time without interfering with one another. + ```console -$ srun -A --pty bash -$ podman build -t pytorch:24.01-py3-venv . +[clariden-lnXXX]$ srun -A --pty bash +[nidYYYYYY]$ podman build -t ngc-pytorch:24.01 . # (1)! # ... lots of output here ... -$ enroot import -x mount -o pytorch-24.01-py3-venv.sqsh podman://pytorch:24.01-py3-venv +[nidYYYYYY]$ enroot import -x mount \ + -o $SCRATCH/ce-images/ngc-pytorch+24.01.sqsh \ + podman://ngc-pytorch:24.01 # (2)! # ... more output here ... ``` +1. This builds the container image with the current working directory as the build context. The `Dockerfile` inside that directory is implicitly used as a recipe. If it is named differently use the `-f path/to/Dockerfile` option. +2. The newly built container image is imported and stored under `$SCRATCH/ce-images`. + where you should replace `` with your project account ID. At this point, you can exit the Slurm allocation by typing `exit`. -You should be able to see a new squashfs file next to your Dockerfile: +You should be able to see a new Squashfs file in your container image directory: ```console -$ ls -Dockerfile pytorch-24.01-py3-ven.sqsh +[clariden-lnXXX]$ ls $SCRATCH/ce-images +ngc-pytorch+24.01.sqsh ``` This squashfs file is essentially a compressed container image, which can be run directly by the container engine. -We will use our freshly-built container `pytorch-24.01-py3-venv.sqsh` in the following steps to run a PyTorch script that loads the Google Gemma-7B model and performs some inference with it. +We will use our freshly-built container `ngc-pytorch+24.01.sqsh` in the following steps to run a PyTorch script that loads the Google Gemma-7B model and performs some inference with it. + +!!! note + In order to import a container image from a registry without building additional layers on top of it, we can directly use `enroot` (without `podman`). This is useful in this tutorial if we want to use a more recent NGC PyTorch container that was released since `24.11`. Use the following syntax for importing the `25.06` release: + + ```console + [nidYYYYYY]$ enroot import -x mount \ + -o $SCRATCH/ce-images/ngc-pytorch+25.06.sqsh docker://nvcr.io#nvidia/pytorch:25.06-py3 + ``` + ### Set up an EDF -We need to set up an EDF (Environment Definition File) which tells the Container Engine what container to load, where to mount it, and what plugins to load. Use your favorite text editor to create a file `~/.edf/gemma-pytorch.toml` for the container engine. The EDF should look like this: +We need to set up an EDF (Environment Definition File) which tells the Container Engine what container image to load, which paths to mount from the host filesystem, and what plugins to load. Use your favorite text editor to create a file `ngc-pytorch-gemma-24.01.toml` for the container engine. The EDF should look like this: -```toml -image = "/capstor/scratch/cscs//pytorch-24.01-py3-venv/pytorch-24.01-py3-venv.sqsh" +```toml title="$SCRATCH/tutorials/gemma-7b/ngc-pytorch-gemma-24.01.toml" +image = "${SCRATCH}/ce-images/ngc-pytorch+24.01.sqsh" # (1)! -mounts = ["/capstor", "/users"] +mounts = [ + "/capstor", + "/iopsstor" +] # (2)! -writable = true +workdir = "${SCRATCH}/tutorials/gemma-7b" # (3)! [annotations] -com.hooks.aws_ofi_nccl.enabled = "true" +com.hooks.aws_ofi_nccl.enabled = "true" # (4)! com.hooks.aws_ofi_nccl.variant = "cuda12" [env] -NCCL_DEBUG = "INFO" +NCCL_DEBUG = "INFO" # (5)! +CUDA_CACHE_DISABLE = "1" # (6)! +TORCH_NCCL_ASYNC_ERROR_HANDLING = "1" # (7)! +MPICH_GPU_SUPPORT_ENABLED = "0" # (8)! ``` -Make sure to replace `` with your actual CSCS username. +1. It is important to use curly braces for environment variables used in the EDF +2. The path `/users` is not mounted since it often contains user-specific initialization scripts for the host environment and many frameworks leave temporary data behind that can lead to non-trivial runtime errors when swapping container images. Thus, it is recommended to selectively mount specific subfolders under `${HOME}` if needed. +3. You can use `${PWD}` as an alternative to use the path submitted from when the container is started +4. This enables NCCL installed in the container to make effective use of the Slingshot interconnect on Alps by interfacing with the [AWS OFI NCCL plugin][ref-ce-aws-ofi-hook] with libfabric. While not strictly needed for single node workloads, it is good practice to keep it always on. +5. This makes NCCL output debug info during initialization, which can be useful to spot communication-related issues in a distributed scenario (see later tutorials). Subsystems with debug log can be configured with `NCCL_DEBUG_SUBSYS`. +6. Disable CUDA JIT cache +7. Async error handling when an exception is observed in NCCL watchdog: aborting NCCL communicator and tearing down process upon error +8. Disable GPU support in MPICH, as it can lead to deadlocks when using together with NCCL + If you've decided to build the container somewhere else, make sure to supply the correct path to the `image` variable. The `image` variable defines which container we want to load. This could either be a container from an online docker repository, like `nvcr.io/nvidia/pytorch:24.01-py3`, or in our case, a local squashfs file which we built ourselves. The `mounts` variable defines which directories we want to mount where in our container. -In general, it's a good idea to use the scratch directory to store outputs from any scientific software. -In our case, we will not generate a lot of output, but it's a good practice to stick to anyways. +In general, it's a good idea to use a directory under `/capstor/scratch` directory to store outputs from any scientific software as this filesystem is optimized for sequential write-operations as described in [Alps storage][ref-alps-storage]. This particularly applies to e.g. checkpoints from ML training, which we will see in the next tutorials (and there it matters also to apply good LUSTRE settings beforehand as for container images). In this tutorial, we will not generate a lot of output, but it's a good practice to stick to anyways. Finally, the `workdir` variable tells the container engine where to start working. If we request a shell, this is where we will find ourselves dropped initially after starting the container. -### Set up the Python Virtual Environment +### Set up a Python Virtual Environment This will be the first time we run our modified container. To run the container, we need allocate some compute resources using Slurm and launch a shell, just like we already did to build the container. This time, we also use the `--environment` option to specify that we want to launch the shell inside the container specified by our gemma-pytorch EDF file: ```console -$ cd $SCRATCH && mkdir -p gemma-inference && cd gemma-inference -$ srun -A --environment=gemma-pytorch --container-workdir=$PWD --pty bash +[clariden-lnXXX]$ cd $SCRATCH/tutorials/gemma-7b +[clariden-lnXXX]$ srun -A \ + --environment=./ngc-pytorch-gemma-24.01.toml --pty bash ``` PyTorch is already setup in the container for us. We can verify this by asking pip for a list of installed packages: ```console -$ python -m pip list | grep torch +user@nidYYYYYY$ python -m pip list | grep torch pytorch-quantization 2.1.2 torch 2.2.0a0+81ea7a4 torch-tensorrt 2.2.0a0 @@ -143,36 +207,49 @@ torchvision 0.17.0a0 ``` However, we will need to install a few more Python packages to make it easier to do inference with Gemma-7B. -We create a virtual environment using python-venv. -The `--system-site-packages` option ensures that we install packages in addition to the existing packages and don't accidentally install a new version of PyTorch over the one that has been put in place by Nvidia. +While it is best practice to install stable dependencies in the container image, we can maintain frequently changing packages in a virtual environment built on top of the container image. +The `--system-site-packages` option of the Python `venv` creation command ensures that we install packages _in addition_ to the existing packages and don't accidentally re-install a new version of PyTorch shadowing the one that has been put in place by Nvidia. Next, we activate the environment and use pip to install the two packages we need, `accelerate` and `transformers`: ```console -$ python -m venv --system-site-packages ./gemma-venv -$ source ./gemma-venv/bin/activate -(gemma-venv)$ python -m pip install accelerate==0.30.1 transformers==4.38.1 +user@nidYYYYYY$ python -m venv --system-site-packages venv-gemma-24.01 +user@nidYYYYYY$ source venv-gemma-24.01/bin/activate +(venv-gemma-24.01) user@nidYYYYYY$ pip install \ + accelerate==0.30.1 transformers==4.38.1 huggingface_hub[cli] # ... pip output ... ``` Before we move on to running the Gemma-7B model, we additionally need to make an account at [HuggingFace](https://huggingface.co), get an API token, and accept the [license agreement](https://huggingface.co/google/gemma-7b-it) for the [Gemma-7B](https://huggingface.co/google/gemma-7b) model. You can save the token to `$SCRATCH` using the huggingface-cli: ```console -$ pip install -U "huggingface_hub[cli]" -$ HF_HOME=$SCRATCH/huggingface huggingface-cli login +(venv-gemma-24.01) user@nidYYYYYY$ export HF_HOME=$SCRATCH/huggingface +(venv-gemma-24.01) user@nidYYYYYY$ huggingface-cli login ``` At this point, you can exit the Slurm allocation again by typing `exit`. -If you `ls` the contents of the `gemma-inference` folder, you will see that the `gemma-venv` virtual environment folder persists outside of the Slurm job. -Keep in mind that this virtual environment won't actually work unless you're running something from inside the PyTorch container. -This is because the virtual environment ultimately relies on the resources packaged inside the container. +If you `ls` the contents of the `gemma-inference` folder, you will see that the `venv-gemma-24.01` virtual environment folder persists outside of the Slurm job. + +!!! note + Keep in mind that + + * this virtual environment won't actually work unless you're running something from inside the PyTorch container. + This is because the virtual environment ultimately relies on the resources packaged inside the container. + * every Slurm job making use of this virtual environment will need to activate it first (_inside_ the `srun` command). + +Since [`HF_HOME`](https://huggingface.co/docs/huggingface_hub/en/package_reference/environment_variables#hfhome) will not only contain the API token, but also be the storage location for model, dataset and space caches of `huggingface_hub` (unless `HF_HUB_CACHE` is set), we also want to apply proper LUSTRE striping settings before it gets populated. + +```console +[clariden-lnXXX]$ lfs setstripe -E 4M -c 1 -E 64M -c 4 -E -1 -c -1 -S 4M \ + $SCRATCH/huggingface +``` ### Run Inference on Gemma-7B Cool, now you have a working container with PyTorch and all the necessary Python packages installed! Let's move on to Gemma-7B. -We write a Python script `$SCRATCH/gemma-inference/gemma-inference.py` to load the model and prompt it with some custom text. +We write a Python script to load the model and prompt it with some custom text. The Python script should look like this: -```python title="$SCRATCH/gemma-inference/gemma-inference.py" +```python title="$SCRATCH/tutorials/gemma-7b/gemma-inference.py" from transformers import AutoTokenizer, AutoModelForCausalLM import torch @@ -191,64 +268,70 @@ Feel free to change the `input_text` variable to whatever prompt you like. All that remains is to run the python script inside the PyTorch container. There are several ways of doing this. As before, you could just use Slurm to get an interactive shell in the container. -Then you would source the virtual environment and run the python script we just wrote. +Then you would source the virtual environment and run the Python script we just wrote. There's nothing wrong with this approach per se, but consider that you might be running much more complex and lengthy Slurm jobs in the future. You'll want to document how you're calling Slurm, what commands you're running on the shell, and you might not want to (or might not be able to) keep a terminal open for the length of time the job might take. For this reason, it often makes sense to write a batch file, which enables you to document all these processes and run the Slurm job regardless of whether you're still connected to the cluster. -Create a Slurm batch file `gemma-inference.sbatch` anywhere you like, for example in your home directory. -The Slurm batch file should look like this: +Create a Slurm batch file `submit-gemma-inference.sh`. It should look like this: -```bash title="gemma-inference.sbatch" +```bash title="$SCRATCH/tutorials/gemma-7b/submit-gemma-inference.sh" #!/bin/bash +#SBATCH --account= #SBATCH --job-name=gemma-inference #SBATCH --time=00:15:00 #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=288 -#SBATCH --environment=gemma-pytorch -#SBATCH --account= +#SBATCH --output logs/slurm-%x-%j.out export HF_HOME=$SCRATCH/huggingface export TRANSFORMERS_VERBOSITY=info -cd $SCRATCH/gemma-inference/ -source ./gemma-venv/bin/activate +cd $SCRATCH/tutorials/gemma-7b # (1)! set -x -python ./gemma-inference.py +srun -ul --environment=./ngc-pytorch-gemma-24.01.toml bash -c " + source venv-gemma-24.01/bin/activate + python gemma-inference.py +" ``` +1. Change directory if submitted with sbatch from a different directory + The first few lines of the batch script declare the shell we want to use to run this batch file and pass several options to the Slurm scheduler. -You can see that one of these options is one we used previously to load our EDF file. -After this, we `cd` to our working directory, `source` our virtual environment and finally run our inference script. +After this, we `cd` to our working directory and `srun` the command in our container environment that `source`s our virtual environment and finally runs our inference script. + +The operations performed before the `srun` command resemble largely the operations performed on the login node above and, in fact, happen in the host environment. If you need to perform these steps in the container environment as well, you can alternatively use the `#SBATCH --environment=path/to/ngc-pytorch-gemma-24.01.toml` option _instead of_ using `--environment` with `srun`. + +!!! warning "#SBATCH --environment" + Use of the `--environment` option for `sbatch` is still considered experimental and could result in unexpected behavior. In particular, avoid mixing `#SBATCH --environment` and `srun --environment` in the same job. -As an alternative to using the `#SBATCH --environment=gemma-pytorch` option you can also run the code in the above script wrapped into an `srun -A -ul --environment=gemma-pytorch bash -c "..."` statement. -The tutorial on nanotron e.g. uses this pattern in `run_tiny_llama.sh`. + Use of `--environment` is currently only recommended for the `srun` command. Once you've finished editing the batch file, you can save it and run it with Slurm: ```console -$ sbatch ./gemma-inference.sbatch +[clariden-lnXXX]$ sbatch submit-gemma-inference.sh ``` This command should just finish without any output and return you to your terminal. -At this point, you can follow the output in your shell using `tail -f slurm-.out`. +At this point, you can follow the output in your shell using `tail -f logs/slurm-gemma-inference-.out`. Besides you're free to do whatever you like; you can close the terminal, keep working, or just wait for the Slurm job to finish. You can always check on the state of your job by logging back into the cluster and running `squeue -l --me`. -Once your job finishes, you will find a file in the same directory you ran it from, named something like `slurm-.out`, and containing the output generated by your Slurm job. +Once your job finishes, you will find a file in the same directory you ran it from, named something like `logs/slurm-gemma-inference-.out`, and containing the output generated by your Slurm job. For this tutorial, you should see something like the following: ```console -$ cat ./slurm-543210.out -/capstor/scratch/cscs/user/gemma-inference/gemma-venv/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`. +[clariden-lnXXX]$ cat logs/slurm-gemma-inference-543210.out +/capstor/scratch/cscs/user/gemma-inference/venv-gemma-24.01/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`. warnings.warn( Gemma's activation function should be approximate GeLU and not exact GeLU. Changing the activation function to `gelu_pytorch_tanh`.if you want to use the legacy `gelu`, edit the `model.config` to set `hidden_activation=gelu` instead of `hidden_act`. See https://github.com/huggingface/transformers/pull/29402 for more details. Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00, 1.13it/s] -/capstor/scratch/cscs/user/gemma-inference/gemma-venv/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`. +/capstor/scratch/cscs/user/gemma-inference/venv-gemma-24.01/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`. warnings.warn( Write me a poem about the Swiss Alps. @@ -276,20 +359,34 @@ They inspire awe, forevermore. Congrats! You've run Google Gemma-7B inference on four GH200 chips simultaneously. Move on to the next tutorial or try the challenge. +!!! info "Collaborating with Git" + + In order to track and exchange your progress with colleagues, you can use standard `git` commands on the host, i.e. in the directory `$SCRATCH/tutorials/gemma-7b` run + ```console + [clariden-lnXXX]$ git init . + [clariden-lnXXX]$ git remote add origin \ + git@github.com:/alps-mlp-tutorials-gemma-7b.git # (1)! + [clariden-lnXXX]$ ... # git add/commit + ``` + + 1. Use any alternative Git hosting service instead of Github + + where you can replace `` by the owner of the Github repository you want to push to. + + Note that for reproducibility, it is recommended to always track the Dockerfile and EDF alongside your application code in a Git repository. + + ### Challenge Using the same approach as in the latter half of step 4, use pip to install the package `nvitop`. This is a tool that shows you a concise real-time summary of GPU activity. Then, run Gemma and launch `nvitop` at the same time: ```console -(gemma-venv)$ python ./gemma-inference.py > ./gemma-output.log 2>&1 & nvitop +(venv-gemma-24.01) user@nidYYYYYY$ python gemma-inference.py \ + > gemma-output.log 2>&1 & nvitop ``` -Note the use of bash `> ./gemma-output.log 2>&1` to hide any output from Python. -Note also the use of the single ampersand `'&'` which backgrounds the first command and runs `nvitop` on top. +Note the use of bash `> gemma-output.log 2>&1` to hide any output from Python. +Note also the use of the single ampersand `'&'` which backgrounds the first command in order to run `nvitop` exclusively in the foreground. After a moment, you will see your Python script spawn on all four GPUs, after which the GPU activity will increase a bit and then go back to idle. -At this point, you can hit `q` to quit `nvitop` and you will find the output of your Python script in `./gemma-output.log`. - -### Collaborating in Git - -In order to track and exchange your progress with colleagues, it is recommended to store the EDF, Dockerfile and your application code alongside in a Git repository in a directory on `$SCRATCH` and share it with colleagues. +At this point, you can hit `q` to quite nvitop and you will find the output of your Python script in `gemma-output.log`. diff --git a/docs/guides/mlp_tutorials/llm-nanotron-training.md b/docs/guides/mlp_tutorials/llm-nanotron-training.md index 7382657a..5072b912 100644 --- a/docs/guides/mlp_tutorials/llm-nanotron-training.md +++ b/docs/guides/mlp_tutorials/llm-nanotron-training.md @@ -1,17 +1,17 @@ [](){#ref-mlp-llm-nanotron-tutorial} -# LLM Nanotron Training Tutorial +# LLM Nanotron Pre-training Tutorial -In this tutorial, we will build a container image to run nanotron training jobs. +In this tutorial, we will build a container image to run multi-node training jobs with [nanotron](https://github.com/huggingface/nanotron). We will train a 109M parameter model with ~100M wikitext tokens as a proof of concept. -### Prerequisites +### Prerequisites -It is also recommended to follow the previous tutorials: [LLM Inference][ref-mlp-llm-inference-tutorial] and [LLM Fine-tuning][ref-mlp-llm-finetuning-tutorial], as this will build up from it. +It is recommended to follow the previous two tutorials on [LLM Inference][ref-mlp-llm-inference-tutorial] and [LLM Fine-tuning][ref-mlp-llm-fine-tuning-tutorial] first, as this will build upon them. ### Set up Podman -Edit your `$HOME/.config/containers/storage.conf` according to the following minimal template: +If not already done as part of the [LLM Inference tutorial][ref-mlp-llm-inference-tutorial], edit your podman configuration in `$HOME/.config/containers/storage.conf` as follows: ```toml title="$HOME/.config/containers/storage.conf" [storage] @@ -23,16 +23,30 @@ Edit your `$HOME/.config/containers/storage.conf` according to the following min mount_program = "/usr/bin/fuse-overlayfs-1.13" ``` -## Modify the NGC Container +Create a directory to store container images used with CE and configure it with [recommended LUSTRE settings][ref-guides-storage-lustre]: + +```console title="Container image directory with recommended LUSTRE settings" +[clariden-lnXXX]$ mkdir -p $SCRATCH/ce-images +[clariden-lnXXX]$ lfs setstripe -E 4M -c 1 -E 64M -c 4 -E -1 -c -1 -S 4M $SCRATCH/ce-images # (1)! +``` + +1. This makes sure that files stored subsequently end up on the same storage node (up to 4 MB), on 4 storage nodes (between 4 and 64 MB) or are striped across all storage nodes (above 64 MB) + +## Build a modified NGC Container + +In this tutorial, we build a virtual environment on top of a customized NGC container image. This represents a typical task during development, where stable dependencies are captured in a static container image, whereas frequently changing packages are installed in a virtual environment on top. In contrast to the previous tutorials, the container in this case will be mostly self-contained. -See previous tutorial for context. Here, we assume we are already in a compute node (run `srun -A --pty bash` to get an interactive session). -In this case, we will be creating the dockerfile in `$SCRATCH/container-image/nanotron/Dockerfile`. -These are the contents of the dockerfile: +In this case, we create a Dockerfile with the following contents: -```dockerfile title="$SCRATCH/container-image/nanotron/Dockerfile" +```dockerfile title="$SCRATCH/tutorials/nanotron-pretrain/Dockerfile" FROM nvcr.io/nvidia/pytorch:24.04-py3 +RUN apt-get update && \ + apt-get install -y python3.10-venv && \ + apt-get clean && \ + rm -rf /var/lib/apt/lists/* + # Update flash-attn. RUN pip install --upgrade --no-build-isolation flash-attn==2.5.8 @@ -49,49 +63,128 @@ RUN pip install \ tqdm ``` +!!! note "More recent NGC releases" + As discussed in the [LLM Inference tutorial][ref-mlp-llm-inference-tutorial], starting with the 24.11 release, NGC PyTorch no longer requires the installation of the Python venv module. Furthermore, FlashAttention and several other packages were integrated into the hosted image. However, as `nanotron` as of June 2025 still requires Python 3.10 (cf. this [issue](https://github.com/huggingface/nanotron/issues/217)), this example is restricted to NGC releases up to `24.10`. + + ```dockerfile title="$SCRATCH/tutorials/nanotron-pretrain/Dockerfile" + FROM nvcr.io/nvidia/pytorch:24.10-py3 + + RUN apt-get update && \ + apt-get install -y python3.10-venv && \ + apt-get clean && \ + rm -rf /var/lib/apt/lists/* + + # Update flash-attn. + RUN pip install --upgrade --no-build-isolation flash-attn==2.5.8 + + # Install the rest of dependencies. + RUN pip install \ + datasets \ + transformers \ + wandb \ + dacite \ + pyyaml + ``` + + The remaining steps can then be performed equivalently, replacing the version number `24.04` by the one chosen in the Dockerfile (e.g. `24.10`). + + It is generally recommended to stick to one of the most recent versions of NGC, unless there is a strong reason from your application to stick to an old version for compatibility. + Then build and import the container. ```console -$ cd $SCRATCH/container-image/nanotron -$ podman build -t nanotron:v1.0 . -$ enroot import -x mount -o nanotron-v1.0.sqsh podman://nanotron:v1.0 +[nidYYYYYY]$ cd $SCRATCH/tutorials/nanotron-pretrain +[nidYYYYYY]$ podman build -f Dockerfile -t ngc-nanotron:24.04 . +[nidYYYYYY]$ enroot import -x mount \ + -o $SCRATCH/ce-images/ngc-nanotron+24.04.sqsh podman://ngc-nanotron:24.04 # (1)! ``` +1. We import container images into a canonical location under $SCRATCH. + + +!!! info "Debugging the container build" + If the container build fails, you can run an interactive shell using the image from the last successfully built layer with + + ```bash + podman run -it --rm -e NVIDIA_VISIBLE_DEVICES=void bash # (1)! + ``` + + 1. Setting `NVIDIA_VISIBLE_DEVICES` in the environment is required specifically to run NGC containers with podman + + replacing `` by the actual hash output in the build job and interactively test the failing command. + Now exit the interactive session by running `exit`. -### Set up an EDF +### Set up an environment description file (EDF) -See the previous tutorial for context. In this case, the EDF will be at `$HOME/.edf/nanotron.toml` and will have the following contents: +See the previous tutorial for context. In this case, the EDF will be co-located with the Dockerfile under `$SCRATCH/tutorials/nanotron-pretrain` and will have the following contents: + +```toml title="$SCRATCH/tutorials/nanotron-pretrain/ngc-nanotron-24.04.toml" +image = "${SCRATCH}/ce-images/ngc-nanotron+24.04.sqsh" # (1)! + +mounts = [ + "/capstor", + "/iopsstor" +] # (2)! + +workdir = "${SCRATCH}/tutorials/nanotron-pretrain/" # (3)! -```toml title="$HOME/.edf/nanotron.toml" -image = "/capstor/scratch/cscs//container-image/nanotron/nanotron-v1.0.sqsh" -mounts = ["/capstor", "/users"] -workdir = "/users//" -writable = true - [annotations] -com.hooks.aws_ofi_nccl.enabled = "true" +com.hooks.aws_ofi_nccl.enabled = "true" # (4)! com.hooks.aws_ofi_nccl.variant = "cuda12" - + [env] -NCCL_DEBUG = "INFO" +NCCL_DEBUG = "INFO" # (5)! +CUDA_CACHE_DISABLE = "1" # (6)! +TORCH_NCCL_ASYNC_ERROR_HANDLING = "1" # (7)! +MPICH_GPU_SUPPORT_ENABLED = "0" # (8)! ``` -Note that, if you built your own container image, you will need to modify the image path. +1. It is important to use curly braces for environment variables used in the EDF +2. The path `/users` is not mounted since it often contains user-specific initialization scripts for the host environment and many frameworks leave temporary data behind that can lead to non-trivial runtime errors when swapping container images. Thus, it is recommended to selectively mount specific subfolders under `${HOME}` if needed. +3. You can use `${PWD}` as an alternative to use the path submitted from when the container is started +4. This enables NCCL installed in the container to make effective use of the Slingshot interconnect on Alps by interfacing with the [AWS OFI NCCL plugin][ref-ce-aws-ofi-hook] with libfabric. While not strictly needed for single node workloads, it is good practice to keep it always on. +5. This makes NCCL output debug info during initialization, which can be useful to spot communication-related issues in a distributed scenario (see later tutorials). Subsystems with debug log can be configured with `NCCL_DEBUG_SUBSYS`. +6. Disable CUDA JIT cache +7. Async error handling when an exception is observed in NCCL watchdog: aborting NCCL communicator and tearing down process upon error +8. Disable GPU support in MPICH, as it can lead to deadlocks when using together with NCCL + +Note that, if you built your container image elsewhere, you will need to modify the image path. -### Preparing a Training Job +## Installing nanotron in a virtual environment Now let's download nanotron. In the login node run: ```console -$ git clone https://github.com/huggingface/nanotron.git -$ cd nanotron +[clariden-lnXXX]$ cd $SCRATCH/tutorials/nanotron-pretrain +[clariden-lnXXX]$ git clone https://github.com/huggingface/nanotron.git +[clariden-lnXXX]$ cd nanotron +[clariden-lnXXX]$ git checkout 5f8a52b08b702e206f31f2660e4b6f22ac328c95 # (1)! ``` -And with your favorite text editor, create the following nanotron configuration file in `$HOME/nanotron/examples/config_tiny_llama_wikitext.yaml`: +1. This ensures the compatibility of nanotron with the following example. For general usage, there is no reason to stick to an outdated version of nanotron, though. + +We will install nanotron in a thin virtual environment on top of the container image built above. This proceeds as in the [LLM Inference][ref-mlp-llm-inference-tutorial]. -```yaml title="$HOME/nanotron/examples/config_tiny_llama_wikitext.yaml" +```console +[clariden-lnXXX]$ srun -A --environment=./ngc-nanotron-24.04.toml --pty bash +user@nidYYYYYY$ python -m venv --system-site-packages venv-24.04 +user@nidYYYYYY$ source venv-24.04/bin/activate +(venv-24.04) user@nidYYYYYY$ cd nanotron/ && pip install -e . +``` + +This creates a virtual environment on top of this container image (`--system-site-packages` ensuring access to system-installed site-packages) and installs nanotron in editable mode inside it. Because all dependencies of nanotron are already installed in the Dockerfile, no extra libraries will be installed at this point. + +!!! note + Jobs making use of this virtual environment will always need to activate it first (_inside_ the `srun` command). + + +## Preparing a Training Job + +Now, with your favorite text editor, create the following nanotron configuration file: + +```yaml title="$SCRATCH/tutorials/nanotron-pretrain/nanotron/examples/config_tiny_llama_wikitext.yaml" general: benchmark_csv_path: null consumed_train_samples: null @@ -192,73 +285,115 @@ logging: log_level_replica: info ``` -This configuration file will train, as a proof of concept, a gpt-2-like (109M parameters) llama model with approximately 100M tokens of wikitext with settings `tp=4, dp=2, pp=1` (which means that it requires two nodes to train). -This training job will require approximately 10 minutes to run. -Now, create a batch file in `$HOME/nanotron/run_tiny_llama.sh` with the contents: +This configuration file will train, as a proof of concept, a GPT-2-like (109M parameters) Llama model with approximately 100M tokens of wikitext with 4-way tensor parallelism (`tp`), 2-way data-parallelism (`dp`) and no pipeline-parallelism (`pp`). As a consequence, two GH200 nodes are required to train the model. +The training job will require approximately 10 minutes to run. + +Now, create a submission script in `$SCRATCH/tutorials/nanotron-pretrain/nanotron/run_tiny_llama.sh` with the following content: -```bash title="$HOME/nanotron/run_tiny_llama.sh" +```bash title="$SCRATCH/tutorials/nanotron-pretrain/run_tiny_llama.sh" #!/bin/bash -#SBATCH --job-name=nanotron # create a short name for your job +#SBATCH --account= +#SBATCH --job-name=pretrain-nanotron # create a short name for your job +#SBATCH --time=00:45:00 #SBATCH --nodes=2 # total number of nodes #SBATCH --ntasks-per-node=1 # total number of tasks per node #SBATCH --gpus-per-task=4 -#SBATCH --time=1:00:00 -#SBATCH --account= -#SBATCH --output=logs/%x_%j.log # control where the stdout will be -#SBATCH --error=logs/%x_%j.err # control where the error messages will be# - -mkdir -p logs +#SBATCH --output=logs/slurm-%x-%j.log # if #SBATCH --error=... is not specified, + # this will also contain stderr (error messages) # Initialization. set -x cat $0 -export MASTER_PORT=25678 -export MASTER_ADDR=$(hostname) -export HF_HOME=$SCRATCH/huggingface_home -export CUDA_DEVICE_MAX_CONNECTIONS=1 # required by nanotron -# export either WANDB_API_KEY= or WANDB_MODE=offline +export HF_HOME=$SCRATCH/huggingface # (1)! +export CUDA_DEVICE_MAX_CONNECTIONS=1 #(2)! + +export WANDB_API_KEY= # alternatively: export WANDB_MODE=offline # Run main script. -srun -ul --environment=nanotron bash -c " - # Change cwd and run the main training script. +srun -ul --environment=./ngc-nanotron-24.04.toml bash -c " + # activate virtual environment + source venv-24.04/bin/activate + + # change cwd and run the training script cd nanotron/ - pip install -e . # Only required the first time. TORCHRUN_ARGS=\" + --master-addr=\$(scontrol show hostnames \$SLURM_JOB_NODELIST | head -n 1) \ + --master-port=29500 \ --node-rank=\${SLURM_PROCID} \ - --master-addr=\${MASTER_ADDR} \ - --master-port=\${MASTER_PORT} \ --nnodes=\${SLURM_NNODES} \ --nproc-per-node=\${SLURM_GPUS_ON_NODE} \ \" - torchrun \${TORCHRUN_ARGS} run_train.py --config-file examples/config_tiny_llama_wikitext.yaml -" + python -m torch.distributed.run \${TORCHRUN_ARGS} \ + run_train.py --config-file examples/config_tiny_llama_wikitext.yaml +" # (3)! ``` -A few comments: +1. Location for locally stored data (incl. token and cache for models/datasets/spaces if `HF_HUB_CACHE` is not set) from `huggingface_hub` (cf. [HuggingFace docs](https://huggingface.co/docs/huggingface_hub/en/package_reference/environment_variables#hfhome). +2. This setting is specifically required by nanotron. Note that this setting can lead to faulty Nsight Systems (`nsys`) profiles that do not show overlap of compute and communication when there actually is (e.g. observed in [this issue](https://github.com/NVIDIA/Megatron-LM/issues/1468)). The solution is to use a more recent version of `nsys`. +3. Use `python -m torch.distributed.run` instead of `torchrun` with virtual environments. + +!!! note "A few comments" + - The parts outside the srun command will be run on the first node of the Slurm allocation for this job. srun commands without further specifiers execute with the settings of the sbatch script (i.e. using all nodes allocated to the job). + - Note that we are setting `HF_HOME` to a directory in scratch. This is done to place the dataset downloaded from `huggingface_hub` in your scratch, instead of your home directory. The same applies to your HuggingFace token as well as any models/spaces unless `HF_HUB_CACHE` is set (cf. [HuggingFace docs](https://huggingface.co/docs/huggingface_hub/en/package_reference/environment_variables#hfhome)). As discussed in the tutorial on [LLM Inference][ref-mlp-llm-inference-tutorial], it is good practice to apply the [recommended LUSTRE settings][ref-guides-storage-lustre] there. + - If instead of downloading a dataset from HuggingFace you want to re-use one managed by a colleague, please refer to the [storage guide][ref-guides-storage-sharing] for instructions on dataset sharing. + - If you have a [wandb API key](https://docs.wandb.ai/guides/track/environment-variables/) and want to synchronize the training run, be sure to set the `WANDB_API_KEY` variable. Alternatively, `wandb` can write log data to the distributed filesystem with `WANDB_MODE=of​f​line` so that it can be uploaded with `wandb sync` (cf. [Weights & Biases docs](https://docs.wandb.ai/support/run_wandb_offline/)) after the training run has finished. + +!!! warning "`torchrun` with virtual environments" + When using a virtual environment on top of a base image with PyTorch, always replace `torchrun` with `python -m torch.distributed.run` to pick up the correct Python environment. Otherwise, the system Python environment will be used and virtual environment packages not available. If not using virtual environments such as with a self-contained PyTorch container, `torchrun` is equivalent to `python -m torch.distributed.run`. + +!!! note "Using srun instead of `torchrun`" + In many cases, workloads launched with `torchrun` can equivalently be launched purely with SLURM by setting some extra environment variables for `torch.distributed`. This simplifies the overall setup. That is, the `srun` statement in the above `sbatch` script can be rewritten as + + ```bash title="$SCRATCH/tutorials/nanotron-pretrain/run_tiny_llama.sh" + #!/bin/bash + #SBATCH --account= + #SBATCH --job-name=pretrain-nanotron # create a short name for your job + #SBATCH --time=00:45:00 + #SBATCH --nodes=2 # total number of nodes + #SBATCH --ntasks-per-node=4 # total number of tasks per node + #SBATCH --output=logs/slurm-%x-%j.log # if #SBATCH --error=... is not specified, + # this will also contain stderr (error messages) + + # Initialization. + set -x + cat $0 + export HF_HOME=$SCRATCH/huggingface + export CUDA_DEVICE_MAX_CONNECTIONS=1 + + export WANDB_API_KEY= # alternatively: export WANDB_MODE=offline + + # Run main script. + srun -ul --environment=./ngc-nanotron-24.04.toml bash -c " + # activate virtual environment + source venv-24.04/bin/activate + + # change cwd and run the training script + cd nanotron/ + + MASTER_ADDR=\$(scontrol show hostnames \$SLURM_JOB_NODELIST | head -n 1) \ + MASTER_PORT=29500 \ + RANK=\${SLURM_PROCID} \ + LOCAL_RANK=\${SLURM_LOCALID} \ + WORLD_SIZE=\${SLURM_NTASKS} \ + python run_train.py --config-file examples/config_tiny_llama_wikitext.yaml + " + ``` -- The parts outside the srun command will be run on the first node of the Slurm allocation for this job. srun commands without further specifiers execute with the settings of the sbatch script (i.e. using all nodes allocated to the job). -- If you have a [wandb](https://wandb.ai/) API key and want to synchronize the training run, be sure to set the `WANDB_API_KEY` variable. Otherwise, set `WANDB_MODE=of​f​line` instead. -- Note that we are setting `HF_HOME` in a directory in scratch. This is done to place the downloaded dataset in scratch, instead of your home directory. -- The pip install command is only run once in every container (compute node). -Note that this will only link the nanotron python package to be able to import it in any script irrespective of the current working directory. -Because all dependencies of nanotron are already installed in the Dockerfile, no extra libraries will be installed at this point. -If the installation of the package under development creates artefacts on the shared filesystem (such as binaries from compiled C++/CUDA source code), this results in a race condition when run from multiple nodes. -Therefore, in this case and also when additional external libraries are to be installed, you should either use venv as shown in previous tutorials, or directly build everything in the Dockerfile. -### Launch a Training Job with the new Image +## Launch a Training Job Run: ```console -$ sbatch run_tiny_llama.sh +[clariden-lnXXX]$ sbatch run_tiny_llama.sh ``` You can inspect if your job has been submitted successfully by running `squeue --me` and looking for your username. Once the run starts, there will be a new file under `logs/`. You can inspect the status of your run using: ```console -$ tail -f logs/ +[clariden-lnXXX]$ tail -f logs/ ``` In the end, the checkpoints of the model will be saved in `checkpoints/`. diff --git a/mkdocs.yml b/mkdocs.yml index 8587af72..deb8c820 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -121,8 +121,8 @@ nav: - 'MLP Tutorials': - guides/mlp_tutorials/index.md - 'LLM Inference': guides/mlp_tutorials/llm-inference.md - - 'LLM Fine-tuning': guides/mlp_tutorials/llm-finetuning.md - - 'LLM Training': guides/mlp_tutorials/llm-nanotron-training.md + - 'LLM Fine-tuning': guides/mlp_tutorials/llm-fine-tuning.md + - 'LLM Pre-training': guides/mlp_tutorials/llm-nanotron-training.md - 'Policies': - policies/index.md - 'User Regulations': policies/regulations.md