From cc50a6832067ba91c89bf6bf8d5c33cadd23ce8d Mon Sep 17 00:00:00 2001 From: Lukas Drescher Date: Wed, 9 Jul 2025 11:55:10 +0200 Subject: [PATCH 01/18] Spelling corrections to MLP tutorials --- docs/access/jupyterlab.md | 2 +- docs/guides/mlp_tutorials/index.md | 2 +- .../mlp_tutorials/{llm-finetuning.md => llm-fine-tuning.md} | 6 +++--- docs/guides/mlp_tutorials/llm-nanotron-training.md | 4 ++-- docs/platforms/mlp/index.md | 2 +- mkdocs.yml | 2 +- 6 files changed, 9 insertions(+), 9 deletions(-) rename docs/guides/mlp_tutorials/{llm-finetuning.md => llm-fine-tuning.md} (99%) diff --git a/docs/access/jupyterlab.md b/docs/access/jupyterlab.md index 9c5185b3..55d982d2 100644 --- a/docs/access/jupyterlab.md +++ b/docs/access/jupyterlab.md @@ -199,7 +199,7 @@ Examples of notebooks with `ipcmagic` can be found [here](https://github.com/ While it is generally recommended to submit long-running machine learning training and inference jobs via `sbatch`, certain use cases can benefit from an interactive Jupyter environment. -A popular approach to run multi-GPU ML workloads is with [`accelerate`](https://github.com/huggingface/accelerate) and [`torchrun`](https://docs.pytorch.org/docs/stable/elastic/run.html) as demonstrated in the [tutorials][ref-guides-mlp-tutorials]. In particular, the `accelerate launch` script in the [LLM fine-tuning tutorial][ref-mlp-llm-finetuning-tutorial] can be directly carried over to a Jupyter cell with a `%%bash` header (to run its contents interpreted by bash). For `torchrun`, one can adapt the command from the multi-node [nanotron tutorial][ref-mlp-llm-nanotron-tutorial] to run on a single GH200 node using the following line in a Jupyter cell +A popular approach to run multi-GPU ML workloads is with [`accelerate`](https://github.com/huggingface/accelerate) and [`torchrun`](https://docs.pytorch.org/docs/stable/elastic/run.html) as demonstrated in the [tutorials][ref-guides-mlp-tutorials]. In particular, the `accelerate launch` script in the [LLM fine-tuning tutorial][ref-mlp-llm-fine-tuning-tutorial] can be directly carried over to a Jupyter cell with a `%%bash` header (to run its contents interpreted by bash). For `torchrun`, one can adapt the command from the multi-node [nanotron tutorial][ref-mlp-llm-nanotron-tutorial] to run on a single GH200 node using the following line in a Jupyter cell ```bash !python -m torch.distributed.run --standalone --nproc_per_node=4 run_train.py ... diff --git a/docs/guides/mlp_tutorials/index.md b/docs/guides/mlp_tutorials/index.md index 2c1df914..d6db2fd9 100644 --- a/docs/guides/mlp_tutorials/index.md +++ b/docs/guides/mlp_tutorials/index.md @@ -4,7 +4,7 @@ These tutorials solve simple MLP tasks using the [Container Engine][ref-container-engine] on the ML Platform. 1. [LLM Inference][ref-mlp-llm-inference-tutorial] -2. [LLM Finetuning][ref-mlp-llm-finetuning-tutorial] +2. [LLM Fine-tuning][ref-mlp-llm-fine-tuning-tutorial] 3. [Nanotron Training][ref-mlp-llm-nanotron-tutorial] diff --git a/docs/guides/mlp_tutorials/llm-finetuning.md b/docs/guides/mlp_tutorials/llm-fine-tuning.md similarity index 99% rename from docs/guides/mlp_tutorials/llm-finetuning.md rename to docs/guides/mlp_tutorials/llm-fine-tuning.md index da58822b..ec731348 100644 --- a/docs/guides/mlp_tutorials/llm-finetuning.md +++ b/docs/guides/mlp_tutorials/llm-fine-tuning.md @@ -1,8 +1,8 @@ -[](){#ref-mlp-llm-finetuning-tutorial} +[](){#ref-mlp-llm-fine-tuning-tutorial} -# LLM Finetuning Tutorial +# LLM Fine-tuning Tutorial -This tutorial will take the model from the [LLM Inference][ref-mlp-llm-inference-tutorial] tutorial and show you how to perform finetuning. +This tutorial will take the model from the [LLM Inference][ref-mlp-llm-inference-tutorial] tutorial and show you how to perform fine-tuning. This means that we take the model and train it on some new custom data to change its behavior. To complete the tutorial, we set up some extra libraries that will help us to update the state of the machine learning model. diff --git a/docs/guides/mlp_tutorials/llm-nanotron-training.md b/docs/guides/mlp_tutorials/llm-nanotron-training.md index b45d8fc4..0121ea3f 100644 --- a/docs/guides/mlp_tutorials/llm-nanotron-training.md +++ b/docs/guides/mlp_tutorials/llm-nanotron-training.md @@ -5,9 +5,9 @@ In this tutorial, we will build a container image to run nanotron training jobs. We will train a 109M parameter model with ~100M wikitext tokens as a proof of concept. -### Prequisites +### Prerequisites -It is also recommended to follow the previous tutorials: [LLM Inference][ref-mlp-llm-inference-tutorial] and [LLM Finetuning][ref-mlp-llm-finetuning-tutorial], as this will build up from it. +It is also recommended to follow the previous tutorials: [LLM Inference][ref-mlp-llm-inference-tutorial] and [LLM Fine-tuning][ref-mlp-llm-fine-tuning-tutorial], as this will build up from it. ### Set up Podman diff --git a/docs/platforms/mlp/index.md b/docs/platforms/mlp/index.md index 3b2387ca..2eb07c3d 100644 --- a/docs/platforms/mlp/index.md +++ b/docs/platforms/mlp/index.md @@ -91,4 +91,4 @@ Project is per project - each project gets a project folder with project-specifi ## Guides and tutorials -Tutorials for finetuning and running inference of LLMs as well as training an LLM with Nanotron can be found in the [MLP Tutorials][ref-guides-mlp-tutorials] page. +Tutorials for fine-tuning and running inference of LLMs as well as training an LLM with Nanotron can be found in the [MLP Tutorials][ref-guides-mlp-tutorials] page. diff --git a/mkdocs.yml b/mkdocs.yml index 9ef866eb..795ed635 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -116,7 +116,7 @@ nav: - 'MLP Tutorials': - guides/mlp_tutorials/index.md - 'LLM Inference': guides/mlp_tutorials/llm-inference.md - - 'LLM Finetuning': guides/mlp_tutorials/llm-finetuning.md + - 'LLM Fine-tuning': guides/mlp_tutorials/llm-fine-tuning.md - 'LLM Training': guides/mlp_tutorials/llm-nanotron-training.md - 'Policies': - policies/index.md From 01fe8e59dd23f2888dc577cf6aac49138286ed2d Mon Sep 17 00:00:00 2001 From: Lukas Drescher Date: Tue, 22 Jul 2025 19:18:00 +0200 Subject: [PATCH 02/18] Updated MLP tutorials --- docs/access/jupyterlab.md | 13 +- docs/guides/mlp_tutorials/index.md | 11 +- docs/guides/mlp_tutorials/llm-fine-tuning.md | 72 +++-- docs/guides/mlp_tutorials/llm-inference.md | 251 +++++++++++------ .../mlp_tutorials/llm-nanotron-training.md | 264 +++++++++++++----- mkdocs.yml | 2 +- 6 files changed, 425 insertions(+), 188 deletions(-) diff --git a/docs/access/jupyterlab.md b/docs/access/jupyterlab.md index 55d982d2..8fcbd372 100644 --- a/docs/access/jupyterlab.md +++ b/docs/access/jupyterlab.md @@ -23,7 +23,7 @@ When resources are granted the page redirects to the JupyterLab session, where y [](){#ref-jupyter-runtime-environment} ## Runtime environment -A Jupyter session can be started with either a [uenv][ref-uenv] or a [container][ref-container-engine] as a base image. The JupyterHub Spawner form provides a set of default images such as the [prgenv-gnu][ref-uenv-prgenv-gnu] uenv or the [NGC Pytorch container][ref-software-ml] to choose from in a dropdown menu. When using uenv, the software stack will be mounted at `/user-environment`, and the specified view will be activated. For a container, the Jupyter session will launch inside the container filesystem with only a select set of paths mounted from the host. Once you have found a suitable option, you can start the session with `Launch JupyterLab`. +A Jupyter session can be started with either a [uenv][ref-uenv] or a [container][ref-container-engine] as a base image. The JupyterHub Spawner form provides a set of default images such as the [prgenv-gnu][ref-uenv-prgenv-gnu] uenv or the [NGC PyTorch container][ref-software-ml] to choose from in a dropdown menu. When using uenv, the software stack will be mounted at `/user-environment`, and the specified view will be activated. For a container, the Jupyter session will launch inside the container filesystem with only a select set of paths mounted from the host. Once you have found a suitable option, you can start the session with `Launch JupyterLab`. ??? info "Using remote uenv for the first time." If the uenv is not present in the local repository, it will be automatically fetched. @@ -34,8 +34,8 @@ A Jupyter session can be started with either a [uenv][ref-uenv] or a [container] If the default base images do not meet your requirements, you can specify a custom environment instead. For this purpose, you supply either a custom uenv image/view or [container engine (CE)][ref-container-engine] TOML file under the section `Advanced options` before launching the session. The supported uenvs are compatible with the Jupyter service out of the box, whereas container images typically require the installation of some additional packages. -??? "Example of a custom Pytorch container" - A container image based on recent a NGC Pytorch release requires the installation of the following additional packages to be compatible with the Jupyter service: +??? "Example of a custom PyTorch container" + A container image based on recent a NGC PyTorch release requires the installation of the following additional packages to be compatible with the Jupyter service: ```Dockerfile FROM nvcr.io/nvidia/pytorch:25.05-py3 @@ -206,7 +206,7 @@ A popular approach to run multi-GPU ML workloads is with [`accelerate`](https:// ``` !!! warning "torchrun with virtual environments" - When using a virtual environment on top of a base image with Pytorch, always replace `torchrun` with `python -m torch.distributed.run` to pick up the correct Python environment. Otherwise, the system Python environment will be used and virtual environment packages not available. If not using virtual environments such as with a self-contained Pytorch container, `torchrun` is equivalent to `python -m torch.distributed.run`. + When using a virtual environment on top of a base image with PyTorch, always replace `torchrun` with `python -m torch.distributed.run` to pick up the correct Python environment. Otherwise, the system Python environment will be used and virtual environment packages not available. If not using virtual environments such as with a self-contained PyTorch container, `torchrun` is equivalent to `python -m torch.distributed.run`. !!! note "Notebook structure" In none of these scenarios any significant memory allocations or background computations are performed on the main Jupyter process. Instead, the resources are kept available for the processes launched by `accelerate` or `torchrun`, respectively. @@ -216,19 +216,20 @@ Alternatively to using these launchers, it is also possible to use Slurm to obta ```bash !srun --overlap -ul --environment /path/to/edf.toml \ --container-workdir $PWD -n 4 bash -c "\ + . venv-/bin/activate MASTER_ADDR=\$(scontrol show hostnames \$SLURM_JOB_NODELIST | head -n 1) \ MASTER_PORT=29500 \ RANK=\$SLURM_PROCID LOCAL_RANK=\$SLURM_LOCALID WORLD_SIZE=\$SLURM_NPROCS \ python train.py ..." ``` -where `/path/to/edf.toml` should be replaced by the TOML file and `train.py` is a script using `torch.distributed` for distributed training. This can be further customized with extra Slurm options. +where `/path/to/edf.toml` should be replaced by the TOML file and `venv-` by the name of the virtual environment (if used). The script `train.py` is using `torch.distributed` for distributed training. This launch mechanism can be further customized with extra Slurm options. !!! warning "Concurrent usage of resources" Subtle bugs can occur when running multiple Jupyter notebooks concurrently that each assume access to the full node. Also, some notebooks may hold on to resources such as spawned child processes or allocated memory despite having completed. In this case, resources such as a GPU may still be busy, blocking another notebook from using it. Therefore, it is good practice to only keep one such notebook running that occupies the full node and restarting a kernel once a notebook has completed. If in doubt, system monitoring with `htop` and [nvdashboard](https://github.com/rapidsai/jupyterlab-nvdashboard) can be helpful for debugging. !!! warning "Multi-GPU training from a shared Jupyter process" - Running multi-GPU training workloads directly from the shared Jupyter process is generally not recommended due to potential inefficiencies and correctness issues (cf. the [Pytorch docs](https://docs.pytorch.org/docs/stable/notes/cuda.html#use-nn-parallel-distributeddataparallel-instead-of-multiprocessing-or-nn-dataparallel)). However, if you need it to e.g. reproduce existing results, it is possible to do so with utilities like `accelerate`'s `notebook_launcher` or [`transformers`](https://github.com/huggingface/transformers)' `Trainer` class. When using these in containers, you will currently need to unset the environment variables `RANK` and `LOCAL_RANK`, that is have the following in a cell at the top of the notebook: + Running multi-GPU training workloads directly from the shared Jupyter process is generally not recommended due to potential inefficiencies and correctness issues (cf. the [PyTorch docs](https://docs.pytorch.org/docs/stable/notes/cuda.html#use-nn-parallel-distributeddataparallel-instead-of-multiprocessing-or-nn-dataparallel)). However, if you need it to e.g. reproduce existing results, it is possible to do so with utilities like `accelerate`'s `notebook_launcher` or [`transformers`](https://github.com/huggingface/transformers)' `Trainer` class. When using these in containers, you will currently need to unset the environment variables `RANK` and `LOCAL_RANK`, that is have the following in a cell at the top of the notebook: ```python import os; os.environ.pop("RANK"); os.environ.pop("LOCAL_RANK"); diff --git a/docs/guides/mlp_tutorials/index.md b/docs/guides/mlp_tutorials/index.md index d6db2fd9..846c73ff 100644 --- a/docs/guides/mlp_tutorials/index.md +++ b/docs/guides/mlp_tutorials/index.md @@ -1,11 +1,10 @@ [](){#ref-guides-mlp-tutorials} -# MLP Tutorials +# Machine Learning Platform Tutorials -These tutorials solve simple MLP tasks using the [Container Engine][ref-container-engine] on the ML Platform. - -1. [LLM Inference][ref-mlp-llm-inference-tutorial] -2. [LLM Fine-tuning][ref-mlp-llm-fine-tuning-tutorial] -3. [Nanotron Training][ref-mlp-llm-nanotron-tutorial] +These tutorials gradually introduce key concepts of the Machine Learning Platform. A particular focus is on the [Container Engine][ref-container-engine] for managing the runtime environment. +In a [first tutorial][ref-mlp-llm-inference-tutorial], you will learn how to run an inference with an LLM on a single node using a container from the NVIDIA GPU Cloud (NGC). Concepts such as container environment description, layering a thin virtual environment on top of the container image and job launching and monitoring will be introduced. +Building on the first tutorial, in the [second tutorial][ref-mlp-llm-fine-tuning-tutorial] you will learn how to train (fine-tune) an LLM on multiple GPUs on a single node. For this purpose, you will use HuggingFace's `accelerate` and see best practices for dataset management. +In the [third tutorial][ref-mlp-llm-nanotron-tutorial], you will apply the techniques from the previous tutorials to enable distributed (pre-)training of a model `nanotron` on multiple nodes. In particular, this tutorial makes use of model-parallelism and introduces the usage of `torchrun` to manage jobs on individual nodes. diff --git a/docs/guides/mlp_tutorials/llm-fine-tuning.md b/docs/guides/mlp_tutorials/llm-fine-tuning.md index ec731348..faae839f 100644 --- a/docs/guides/mlp_tutorials/llm-fine-tuning.md +++ b/docs/guides/mlp_tutorials/llm-fine-tuning.md @@ -8,45 +8,49 @@ This means that we take the model and train it on some new custom data to change To complete the tutorial, we set up some extra libraries that will help us to update the state of the machine learning model. We also write a script that will allow us to unlock more of the performance offered by the cluster, by running our fine-tuning task on two or more nodes. +## Fine-tuning Gemma 7B on the OpenAssistant dataset + ### Prerequisites This tutorial assumes you've already successfully completed the [LLM Inference][ref-mlp-llm-inference-tutorial] tutorial. -For fine-tuning Gemma, we will rely on the NGC PyTorch container and the libraries we've already installed in the Python environment used previously. +For fine-tuning Gemma, we will rely on the NGC PyTorch container and the libraries we've already installed in the Python virtual environment used previously. ### Set up TRL -We will use HuggingFace TRL to fine-tune Gemma-7B on the [OpenAssistant dataset](https://huggingface.co/datasets/OpenAssistant/oasst_top1_2023-08-25). +We will use HuggingFace TRL (Transformer Reinforcement Learning) to fine-tune Gemma-7B on the [OpenAssistant dataset](https://huggingface.co/datasets/OpenAssistant/oasst_top1_2023-08-25). First, we need to update our Python environment with some extra libraries to support TRL. To do this, we can launch an interactive shell in the PyTorch container, just like we did in the previous tutorial. Then, we install `peft`: -```console -$ cd $SCRATCH/gemma-inference -$ srun --environment=gemma-pytorch --container-workdir=$PWD --pty bash -$ source ./gemma-venv/bin/activate -$ python -m pip install peft==0.11.1 +```bash +$ cd $SCRATCH/tutorials/gemma-7b +$ srun --environment=./ngc-pytorch-gemma-24.01.toml --pty bash +$ source venv-gemma-24.01/bin/activate +$ pip install peft==0.11.1 ``` Next, we also need to clone and install the `trl` Git repository so that we have access to the fine-tuning scripts in it. For this purpose, we will install the package in editable mode in the virtual environment. This makes it available in python scripts independent of the current working directory and without creating a redundant copy of the files. -```console +```bash $ git clone https://github.com/huggingface/trl -b v0.7.11 -$ pip install -e ./trl # install in editable mode +$ pip install -e ./trl # (1)! ``` +1. Installs trl in editable mode + When this step is complete, you can exit the shell by typing `exit`. ### Finetune Gemma-7B -t this point, we can set up a fine-tuning script and start training Gemma-7B. -Use your favorite text editor to create the file `fine-tune-gemma.sh` just outside the trl and gemma-venv directories: +At this point, we can set up a fine-tuning script and start training Gemma-7B. +Use your favorite text editor to create the file `fine-tune-gemma.sh` just outside the trl and `venv-gemma-24.01` directories: -```bash title="fine-tune-gemma.sh" +```bash title="$SCRATCH/tutorials/gemma-7b/fine-tune-gemma.sh" #!/bin/bash -source ./gemma-venv/bin/activate +source venv-gemma-24.01/bin/activate set -x @@ -73,38 +77,50 @@ accelerate launch --config_file trl/examples/accelerate_configs/multi_gpu.yaml \ --use_peft \ --lora_r 16 --lora_alpha 32 \ --lora_target_modules q_proj k_proj v_proj o_proj \ - --output_dir gemma-finetuned-openassistant + --output_dir gemma-fine-tuned-openassistant ``` This script has quite a bit more content to unpack. -We use HuggingFace accelerate to launch the fine-tuning process, so we need to make sure that accelerate understands which hardware is available and where. +We use HuggingFace `accelerate` to launch the fine-tuning process, so we need to make sure that `accelerate` understands which hardware is available and where. Setting this up will be useful in the long run because it means we can tell Slurm how much hardware to reserve, and this script will setup all the details for us. The cluster has four GH200 chips per compute node. -We can make them accessible to scripts run through srun/sbatch via the option `--gpus-per-node=4`. +We can make them accessible to scripts run through `srun`/`sbatch` via the option `--gpus-per-node=4`. Then, we calculate how many processes accelerate should launch. We want to map each GPU to a separate process, this should be four processes per node. We multiply this by the number of nodes to obtain the total number of processes. Next, we use some bash magic to extract the name of the head node from Slurm environment variables. -Accelerate expects one main node and launches tasks on the other nodes from this main node. +`accelerate` expects one main node and launches tasks on the other nodes from this main node. Having sourced our python environment at the top of the script, we can then launch Gemma fine-tuning. -The first four lines of the launch line are used to configure accelerate. +The first four lines of the launch line are used to configure `accelerate`. Everything after that configures the `trl/examples/scripts/sft.py` Python script, which we use to train Gemma. +!!! note "Dataset management and sharing" + For datasets, recommended LUSTRE settings should be used as illustrated in the tutorial on [LLM Inference][ref-mlp-llm-inference-tutorial]. As they have been set there for `HF_HOME`, which `huggingface_hub` uses for its dataset cache, they don't need to be re-applied here. + + To enable your colleagues to use also use your datasets, please refer to the [storage guide][ref-guides-storage-sharing]. + +Make this script executable with + +```bash +$ chmod u+x $SCRATCH/tutorials/gemma-7b/fine-tune-gemma.sh +``` + Next, we also need to create a short Slurm batch script to launch our fine-tuning script: -```bash title="fine-tune-sft.sbatch" +```bash title="$SCRATCH/tutorials/gemma-7b/submit-fine-tune-gemma.sh" #!/bin/bash -#SBATCH --job-name=gemma-finetune +#SBATCH --account= +#SBATCH --job-name=fine-tune-gemma #SBATCH --time=00:30:00 #SBATCH --ntasks-per-node=1 #SBATCH --gpus-per-node=4 #SBATCH --cpus-per-task=288 -#SBATCH --account= +#SBATCH --output logs/slurm-%x-%j.out set -x -srun -ul --environment=gemma-pytorch --container-workdir=$PWD bash fine-tune-gemma.sh +srun -ul --environment=./ngc-pytorch-gemma-24.01.toml fine-tune-gemma.sh ``` We set a few Slurm parameters like we already did in the previous tutorial. @@ -115,11 +131,11 @@ Now that we've setup a fine-tuning script and a Slurm batch script, we can launc We'll start out by launching it on two nodes. It should take about 10-15 minutes to fine-tune Gemma: -```console -$ sbatch --nodes=1 fine-tune-sft.sbatch +```bash +$ sbatch --nodes=1 submit-fine-tune-gemma.sh ``` -### Compare finetuned Gemma against default Gemma +### Compare fine-tuned Gemma against default Gemma We can reuse our python script from the first tutorial to do inference on the Gemma model that we just fine-tuned. Let's try out a different prompt in `gemma-inference.py`: @@ -130,8 +146,8 @@ input_text = "What are the 5 tallest mountains in the Swiss Alps?" We can run inference using our batch script from the previous tutorial: -```console -$ sbatch ./gemma-inference.sbatch +```bash +$ sbatch submit-gemma-inference.sh ``` Inspecting the output should yield something like this: @@ -152,7 +168,7 @@ the 5 tallest mountains in the Swiss Alps: Next, we can update the model line in our Python inference script to use the model that we just fine-tuned: ```python -model = AutoModelForCausalLM.from_pretrained("gemma-finetuned-openassistant/checkpoint-400", device_map="auto") +model = AutoModelForCausalLM.from_pretrained("gemma-fine-tuned-openassistant/checkpoint-400", device_map="auto") ``` If we re-run inference, the output will be a bit more detailed and explanatory, similar to output we might expect from a helpful chatbot. One example looks like this: diff --git a/docs/guides/mlp_tutorials/llm-inference.md b/docs/guides/mlp_tutorials/llm-inference.md index 550c1ce3..4c03d605 100644 --- a/docs/guides/mlp_tutorials/llm-inference.md +++ b/docs/guides/mlp_tutorials/llm-inference.md @@ -5,10 +5,10 @@ This tutorial will guide you through the steps required to set up a PyTorch container and do ML inference. This means that we load an existing machine learning model, prompt it with some custom data, and run the model to see what output it will generate with our data. -To complete the tutorial, we get a PyTorch container from Nvidia, customize it to suit our needs, and tell the Container Engine how to run it. +To complete the tutorial, we get a PyTorch container from Nvidia's GPU Cloud (NGC), customize it to suit our needs, and tell the Container Engine how to run it. Finally, we set up and run a python script to run the machine learning model and generate some output. -The model we will be running is Google's [Gemma-7B](https://huggingface.co/google/gemma-7b#description), an LLM similar in style to the popular ChatGPT, which can generate text responses to text prompts that we feed into it. +The model we will be running is Google's [Gemma-7B](https://huggingface.co/google/gemma-7b-it#description) in the instruction-tuned variant. This is an LLM similar in style to popular chat assistants like ChatGPT, which can generate text responses to text prompts that we feed into it. ## Gemma-7B Inference using NGC PyTorch @@ -16,43 +16,58 @@ The model we will be running is Google's [Gemma-7B](https://huggingface.co/googl This tutorial assumes you are able to access the cluster via SSH. To set up access to CSCS systems, follow the guide [here][ref-ssh], and read through the documentation about the [ML Platform][ref-platform-mlp]. -### Modify the NGC Container +### Build a modified NGC PyTorch Container -In theory, we could now just go ahead and use the container to run some PyTorch code. +In theory, we could just go ahead and use the vanilla container image to run some PyTorch code. However, chances are that we will need some additional libraries or software. -For this reason, we need to use some docker commands to build a container on top of what is provided by Nvidia. -To do this, we create a new directory for building containers in our home directory and set up a [Dockerfile](https://docs.docker.com/reference/dockerfile/): +For this reason, we need to use some docker commands to build on top of what is provided by Nvidia. +To do this, we create a new directory for recipes to build containers in our home directory and set up a [Dockerfile](https://docs.docker.com/reference/dockerfile/): -```console +```bash $ cd $SCRATCH -$ mkdir pytorch-24.01-py3-venv && cd pytorch-24.01-py3-venv +$ mkdir -p tutorials/gemma-7b +$ cd tutorials/gemma-7b ``` Use your favorite text editor to create a file `Dockerfile` here. The Dockerfile should look like this: -```dockerfile title="Dockerfile" +```dockerfile title="$SCRATCH/tutorials/gemma-7b/Dockerfile" FROM nvcr.io/nvidia/pytorch:24.01-py3 ENV DEBIAN_FRONTEND=noninteractive -RUN apt-get update && apt-get install -y python3.10-venv && apt-get clean && rm -rf /var/lib/apt/lists/* +RUN apt-get update && \ + apt-get install -y python3.10-venv && \ + apt-get clean && \ + rm -rf /var/lib/apt/lists/* ``` The first line specifies that we are working on top of an existing container. -In this case we start `FROM` an NGC PyTorch container. -Next, we set an `ENV`ironment variable that helps us run `apt-get` in the container. +In this case we start `FROM` an [NGC PyTorch container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch). +Next, we set an environment variable with `ENV` that helps us run `apt-get` in the container. Finally, we `RUN` the package installer `apt-get` to install python virtual environments. This will let us install python packages later on without having to rebuild the container again and again. There's a bunch of extra commands in this line to tidy things up. If you want to understand what is happening, take a look at the [Docker documentation](https://docs.docker.com/develop/develop-images/instructions/#apt-get). +!!! note "Recent changes in NGC releases" + Starting with the 24.11 release, NGC PyTorch no longer requires the installation of the Python venv module. That is, the Dockerfile simplifies to only the first line, e.g. for the `25.06` release + + ```dockerfile + FROM nvcr.io/nvidia/pytorch:25.06-py3 + ``` + + The remaining steps can then be performed equivalently, replacing the version number `24.01` by the one chosen in the Dockerfile (e.g. `25.06`). + + It is generally recommended to stick to one of the most recent versions of NGC, unless there is a strong reason from your application to stick to an old version for compatibility. + Now that we've setup the Dockerfile, we can go ahead and pass it to [Podman](https://podman.io/) to build a container. Podman is a tool that enables us to fetch, manipulate, and interact with containers on the cluster. For more information, please see the [Container Engine][ref-container-engine] page. To use Podman, we first need to configure some storage locations for it. -This step is straightforward, just make the file `$HOME/.config/containers/storage.conf` (or `$XDG_CONFIG_HOME/containers/storage.conf` if `XDG_CONFIG_HOME` is set): +This step is straightforward, just create the file in your home: -```toml +```toml title="$HOME/.config/containers/storage.conf" [storage] driver = "overlay" runroot = "/dev/shm/$USER/runroot" @@ -62,77 +77,117 @@ This step is straightforward, just make the file `$HOME/.config/containers/stora mount_program = "/usr/bin/fuse-overlayfs-1.13" ``` -To build a container with Podman, we need to request a shell on a compute node from [Slurm][ref-slurm], pass the Dockerfile to Podman, and finally import the freshly built container using enroot. +!!! warning + If `$XDG_CONFIG_HOME` is set, place this file at `$XDG_CONFIG_HOME/containers/storage.conf` instead. + +Before building the container image, we create a dedicated directory to keep track of all images used with the CE. Since container images are large files and the filesystem is a shared resource, we need to apply [best practices for LUSTRE][ref-guides-storage-lustre] so they are properly distributed across storage nodes. + +```bash title="Container image directory with recommended LUSTRE settings" +$ mkdir -p $SCRATCH/ce-images +$ lfs setstripe -E 4M -c 1 -E 64M -c 4 -E -1 -c -1 -S 4M $SCRATCH/ce-images # (1)! +``` + +1. This makes sure that files stored subsequently end up on the same storage node (up to 4 MB), on 4 storage nodes (between 4 and 64 MB) or are striped across all storage nodes (above 64 MB) + +To build a container with Podman, we need to request a shell on a compute node from [Slurm][ref-slurm], pass the Dockerfile to Podman, and finally import the freshly built container to the dedicated directory using enroot. Slurm is a workload manager which distributes workloads on the cluster. -Through Slurm, many people can use the supercomputer at the same time without interfering with one another in any way: +Through Slurm, many people can use the supercomputer at the same time without interfering with one another. -```console + +```bash $ srun -A --pty bash -$ podman build -t pytorch:24.01-py3-venv . +$ podman build -t ngc-pytorch:24.01 . # (1)! # ... lots of output here ... -$ enroot import -x mount -o pytorch-24.01-py3-venv.sqsh podman://pytorch:24.01-py3-venv +$ enroot import -x mount \ +-o $SCRATCH/ce-images/ngc-pytorch+24.01.sqsh \ +podman://ngc-pytorch:24.01 # (2)! # ... more output here ... ``` +1. This builds the container image with the current working directory as the build context. The `Dockerfile` inside that directory is implicitly used as a recipe. If it is named differently use the `-f path/to/Dockerfile` option. +2. The newly built container image is imported and stored under $SCRATCH/ce-images. + where you should replace `` with your project account ID. At this point, you can exit the Slurm allocation by typing `exit`. -You should be able to see a new squashfile next to your Dockerfile: +You should be able to see a new squashfile in your container image directory: -```console -$ ls -Dockerfile pytorch-24.01-py3-ven.sqsh +```bash +$ ls $SCRATCH/ce-images +ngc-pytorch+24.01.sqsh ``` This squashfile is essentially a compressed container image, which can be run directly by the container engine. -We will use our freshly-built container `pytorch-24.01-py3-venv.sqsh` in the following steps to run a PyTorch script that loads the Google Gemma-7B model and performs some inference with it. +We will use our freshly-built container `ngc-pytorch+24.01.sqsh` in the following steps to run a PyTorch script that loads the Google Gemma-7B model and performs some inference with it. + +!!! note + In order to import a container image from a registry without building additional layers on top of it, we can directly use `enroot` (without `podman`). This is useful in this tutorial if we want to use a more recent NGC Pytorch container that was released since `24.11`. Use the following syntax for importing the `25.06` release: + + ```bash + enroot import -x mount \ + -o $SCRATCH/ce-images/ngc-pytorch+25.06.sqsh docker://nvcr.io#nvidia/pytorch:25.06-py3 + ``` + ### Set up an EDF -We need to set up an EDF (Environment Definition File) which tells the Container Engine what container to load, where to mount it, and what plugins to load. Use your favorite text editor to create a file `~/.edf/gemma-pytorch.toml` for the container engine. The EDF should look like this: +We need to set up an EDF (Environment Definition File) which tells the Container Engine what container image to load, which paths to mount from the host filesystem, and what plugins to load. Use your favorite text editor to create a file `ngc-pytorch-gemma-24.01.toml` for the container engine. The EDF should look like this: -```toml -image = "/capstor/scratch/cscs//pytorch-24.01-py3-venv/pytorch-24.01-py3-venv.sqsh" +```toml title="$SCRATCH/tutorials/gemma-7b/ngc-pytorch-gemma-24.01.toml" +image = "${SCRATCH}/ce-images/ngc-pytorch+24.01.sqsh" # (1)! -mounts = ["/capstor", "/users"] +mounts = [ + "/capstor", + "/iopsstor" +] # (2)! -writable = true +workdir = "${SCRATCH}/tutorials/gemma-7b" # (3)! [annotations] -com.hooks.aws_ofi_nccl.enabled = "true" +com.hooks.aws_ofi_nccl.enabled = "true" # (4)! com.hooks.aws_ofi_nccl.variant = "cuda12" [env] -NCCL_DEBUG = "INFO" +NCCL_DEBUG = "INFO" # (5)! +CUDA_CACHE_DISABLE = "1" # (6)! +TORCH_NCCL_ASYNC_ERROR_HANDLING = "1" # (7)! +MPICH_GPU_SUPPORT_ENABLED = "0" # (8)! ``` -Make sure to replace `` with your actual CSCS username. +1. It is important to use curly braces for environment variables used in the EDF +2. The path `/users` is not mounted since it often contains user-specific initialization scripts for the host environment and many frameworks leave temporary data behind that can lead to non-trivial runtime errors when swapping container images. Thus, it is recommended to selectively mount specific subfolders under `${HOME}` if needed. +3. You can use `${PWD}` as an alternative to use the path submitted from when the container is started +4. This enables NCCL installed in the container to make effective use of the Slingshot interconnect on Alps by interfacing with the [AWS OFI NCCL plugin][ref-ce-aws-ofi-hook] with libfabric. While not strictly needed for single node workloads, it is good practice to keep it always on. +5. This makes NCCL output debug info during initialization, which can be useful to spot communication-related issues in a distributed scenario (see later tutorials). Subsystems with debug log can be configured with `NCCL_DEBUG_SUBSYS`. +6. Disable CUDA JIT cache +7. Async error handling when an exception is observed in NCCL watchdog: aborting NCCL communicator and tearing down process upon error +8. Disable GPU support in MPICH, as it can lead to deadlocks when using together with NCCL + If you've decided to build the container somewhere else, make sure to supply the correct path to the `image` variable. The `image` variable defines which container we want to load. This could either be a container from an online docker repository, like `nvcr.io/nvidia/pytorch:24.01-py3`, or in our case, a local squashfile which we built ourselves. The `mounts` variable defines which directories we want to mount where in our container. -In general, it's a good idea to use the scratch directory to store outputs from any scientific software. -In our case, we will not generate a lot of output, but it's a good practice to stick to anyways. +In general, it's a good idea to use a directory under `/capstor/scratch` directory to store outputs from any scientific software as this filesystem is optimized for sequential write-operations as described in [Alps storage][ref-alps-storage]. This particularly applies to e.g. checkpoints from ML training, which we will see in the next tutorials (and there it matters also to apply good LUSTRE settings beforehand as for container images). In this tutorial, we will not generate a lot of output, but it's a good practice to stick to anyways. Finally, the `workdir` variable tells the container engine where to start working. If we request a shell, this is where we will find ourselves dropped initially after starting the container. -### Set up the Python Virtual Environment +### Set up a Python Virtual Environment This will be the first time we run our modified container. To run the container, we need allocate some compute resources using Slurm and launch a shell, just like we already did to build the container. This time, we also use the `--environment` option to specify that we want to launch the shell inside the container specified by our gemma-pytorch EDF file: -```console -$ cd $SCRATCH && mkdir -p gemma-inference && cd gemma-inference -$ srun -A --environment=gemma-pytorch --container-workdir=$PWD --pty bash +```bash +$ cd $SCRATCH/tutorials/gemma-7b +$ srun -A --environment=./ngc-pytorch-gemma-24.01.toml --pty bash ``` PyTorch is already setup in the container for us. We can verify this by asking pip for a list of installed packages: -```console +```bash $ python -m pip list | grep torch pytorch-quantization 2.1.2 torch 2.2.0a0+81ea7a4 @@ -143,36 +198,48 @@ torchvision 0.17.0a0 ``` However, we will need to install a few more Python packages to make it easier to do inference with Gemma-7B. -We create a virtual environment using python-venv. -The `--system-site-packages` option ensures that we install packages in addition to the existing packages and don't accidentally install a new version of PyTorch over the one that has been put in place by Nvidia. +While it is best practice to install stable dependencies in the container image, we can maintain frequently changing packages in a virtual environment built on top of the container image. +The `--system-site-packages` option of the Python `venv` creation command ensures that we install packages _in addition_ to the existing packages and don't accidentally re-install a new version of PyTorch shadowing the one that has been put in place by Nvidia. Next, we activate the environment and use pip to install the two packages we need, `accelerate` and `transformers`: -```console -$ python -m venv --system-site-packages ./gemma-venv -$ source ./gemma-venv/bin/activate -(gemma-venv)$ python -m pip install accelerate==0.30.1 transformers==4.38.1 +```bash +$ python -m venv --system-site-packages venv-gemma-24.01 +$ source venv-gemma-24.01/bin/activate +(venv-gemma-24.01)$ pip install \ +accelerate==0.30.1 transformers==4.38.1 huggingface_hub[cli] # ... pip output ... ``` Before we move on to running the Gemma-7B model, we additionally need to make an account at [HuggingFace](https://huggingface.co), get an API token, and accept the [license agreement](https://huggingface.co/google/gemma-7b-it) for the [Gemma-7B](https://huggingface.co/google/gemma-7b) model. You can save the token to `$SCRATCH` using the huggingface-cli: -```console -$ pip install -U "huggingface_hub[cli]" -$ HF_HOME=$SCRATCH/huggingface huggingface-cli login +```bash +$ export HF_HOME=$SCRATCH/huggingface +$ huggingface-cli login ``` At this point, you can exit the Slurm allocation again by typing `exit`. -If you `ls` the contents of the `gemma-inference` folder, you will see that the `gemma-venv` virtual environment folder persists outside of the Slurm job. -Keep in mind that this virtual environment won't actually work unless you're running something from inside the PyTorch container. -This is because the virtual environment ultimately relies on the resources packaged inside the container. +If you `ls` the contents of the `gemma-inference` folder, you will see that the `venv-gemma-24.01` virtual environment folder persists outside of the Slurm job. + +!!! note + Keep in mind that + + * this virtual environment won't actually work unless you're running something from inside the PyTorch container. + This is because the virtual environment ultimately relies on the resources packaged inside the container. + * every SLURM job making use of this virtual environment will need to activate it first (_inside_ the `srun`-command). + +Since [`HF_HOME`](https://huggingface.co/docs/huggingface_hub/en/package_reference/environment_variables#hfhome) will not only contain the API token, but also be the storage location for model, dataset and space caches of `huggingface_hub` (unless `HF_HUB_CACHE` is set), we also want to apply proper LUSTRE striping settings before it gets populated. + +```bash +$ lfs setstripe -E 4M -c 1 -E 64M -c 4 -E -1 -c -1 -S 4M $SCRATCH/huggingface +``` ### Run Inference on Gemma-7B Cool, now you have a working container with PyTorch and all the necessary Python packages installed! Let's move on to Gemma-7B. -We write a Python script `$SCRATCH/gemma-inference/gemma-inference.py` to load the model and prompt it with some custom text. +We write a Python script to load the model and prompt it with some custom text. The Python script should look like this: -```python title="$SCRATCH/gemma-inference/gemma-inference.py" +```python title="$SCRATCH/tutorials/gemma-7b/gemma-inference.py" from transformers import AutoTokenizer, AutoModelForCausalLM import torch @@ -191,64 +258,70 @@ Feel free to change the `input_text` variable to whatever prompt you like. All that remains is to run the python script inside the PyTorch container. There are several ways of doing this. As before, you could just use Slurm to get an interactive shell in the container. -Then you would source the virtual environment and run the python script we just wrote. +Then you would source the virtual environment and run the Python script we just wrote. There's nothing wrong with this approach per se, but consider that you might be running much more complex and lengthy Slurm jobs in the future. You'll want to document how you're calling Slurm, what commands you're running on the shell, and you might not want to (or might not be able to) keep a terminal open for the length of time the job might take. For this reason, it often makes sense to write a batch file, which enables you to document all these processes and run the Slurm job regardless of whether you're still connected to the cluster. -Create a Slurm batch file `gemma-inference.sbatch` anywhere you like, for example in your home directory. -The Slurm batch file should look like this: +Create a Slurm batch file `submit-gemma-inference.sh`. It should look like this: -```bash title="gemma-inference.sbatch" +```bash title="$SCRATCH/tutorials/gemma-7b/submit-gemma-inference.sh" #!/bin/bash +#SBATCH --account= #SBATCH --job-name=gemma-inference #SBATCH --time=00:15:00 #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=288 -#SBATCH --environment=gemma-pytorch -#SBATCH --account= +#SBATCH --output logs/slurm-%x-%j.out export HF_HOME=$SCRATCH/huggingface export TRANSFORMERS_VERBOSITY=info -cd $SCRATCH/gemma-inference/ -source ./gemma-venv/bin/activate +cd $SCRATCH/tutorials/gemma-7b # (1)! set -x -python ./gemma-inference.py +srun -ul --environment=./ngc-pytorch-gemma-24.01.toml bash -c " + source venv-gemma-24.01/bin/activate + python gemma-inference.py +" ``` +1. Change directory if submitted with sbatch from a different directory + The first few lines of the batch script declare the shell we want to use to run this batch file and pass several options to the Slurm scheduler. -You can see that one of these options is one we used previously to load our EDF file. -After this, we `cd` to our working directory, `source` our virtual environment and finally run our inference script. +After this, we `cd` to our working directory and `srun` the command in our container environment that `source`s our virtual environment and finally runs our inference script. + +The operations performed before the `srun` command resemble largely the operations performed on the login node above and, in fact, happen in the host environment. If you need to perform these steps in the container environment as well, you can alternatively use the `#SBATCH --environment=path/to/ngc-pytorch-gemma-24.01.toml` option _instead of_ using `--environment` with `srun`. -As an alternative to using the `#SBATCH --environment=gemma-pytorch` option you can also run the code in the above script wrapped into an `srun -A -ul --environment=gemma-pytorch bash -c "..."` statement. -The tutorial on nanotron e.g. uses this pattern in `run_tiny_llama.sh`. +!!! warning "#SBATCH --environment" + Use of the `--environment` option for `sbatch` is still considered experimental and could result in unexpected behavior. In particular, avoid mixing `#SBATCH --environment` and `srun --environment` in the same job. + + Use of `--environment` is currently only recommended for the `srun` command. Once you've finished editing the batch file, you can save it and run it with Slurm: -```console -$ sbatch ./gemma-inference.sbatch +```bash +$ sbatch submit-gemma-inference.sh ``` This command should just finish without any output and return you to your terminal. -At this point, you can follow the output in your shell using `tail -f slurm-.out`. +At this point, you can follow the output in your shell using `tail -f logs/slurm-gemma-inference-.out`. Besides you're free to do whatever you like; you can close the terminal, keep working, or just wait for the Slurm job to finish. You can always check on the state of your job by logging back into the cluster and running `squeue -l --me`. -Once your job finishes, you will find a file in the same directory you ran it from, named something like `slurm-.out`, and containing the output generated by your Slurm job. +Once your job finishes, you will find a file in the same directory you ran it from, named something like `logs/slurm-gemma-inference-.out`, and containing the output generated by your Slurm job. For this tutorial, you should see something like the following: -```console -$ cat ./slurm-543210.out -/capstor/scratch/cscs/user/gemma-inference/gemma-venv/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`. +```bash +$ cat logs/slurm-gemma-inference-543210.out +/capstor/scratch/cscs/user/gemma-inference/venv-gemma-24.01/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`. warnings.warn( Gemma's activation function should be approximate GeLU and not exact GeLU. Changing the activation function to `gelu_pytorch_tanh`.if you want to use the legacy `gelu`, edit the `model.config` to set `hidden_activation=gelu` instead of `hidden_act`. See https://github.com/huggingface/transformers/pull/29402 for more details. Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00, 1.13it/s] -/capstor/scratch/cscs/user/gemma-inference/gemma-venv/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`. +/capstor/scratch/cscs/user/gemma-inference/venv-gemma-24.01/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`. warnings.warn( Write me a poem about the Swiss Alps. @@ -276,20 +349,32 @@ They inspire awe, forevermore. Congrats! You've run Google Gemma-7B inference on four GH200 chips simultaneously. Move on to the next tutorial or try the challenge. +!!! info "Collaborating in Git" + + In order to track and exchange your progress with colleagues, you can use standard `git` commands on the host, i.e. in the directory `$SCRATCH/tutorials/gemma-7b` run + ```bash + $ git init . + $ git remote add origin git@github.com:/alps-mlp-tutorials-gemma-7b.git # (1)! + $ ... # git add/commit + ``` + + 1. Use any alternative Git hosting service instead of Github + + where you can replace `` by the owner of the Github repository you want to push to. + + Note that for reproducibility, it is recommended to always track the Dockerfile, EDF and your application code alongside in a Git repository. + + ### Challenge Using the same approach as in the latter half of step 4, use pip to install the package `nvitop`. This is a tool that shows you a concise real-time summary of GPU activity. Then, run Gemma and launch nvitop at the same time: -```console -(gemma-venv)$ python ./gemma-inference.py > ./gemma-output.log 2>&1 & nvitop +```bash +(venv-gemma-24.01)$ python gemma-inference.py > gemma-output.log 2>&1 & nvitop ``` -Note the use of bash `> ./gemma-output.log 2>&1` to hide any output from Python. -Note also the use of the single ampersand `'&'` which backgrounds the first command and runs `nvitop` on top. +Note the use of bash `> gemma-output.log 2>&1` to hide any output from Python. +Note also the use of the single ampersand `'&'` which backgrounds the first command in order to run `nvitop` exclusively in the foreground. After a moment, you will see your Python script spawn on all four GPUs, after which the GPU activity will increase a bit and then go back to idle. -At this point, you can hit `q` to quite nvitop and you will find the output of your Python script in `./gemma-output.log`. - -### Collaborating in Git - -In order to track and exchange your progress with colleagues, it is recommended to store the EDF, Dockerfile and your application code alongside in a Git repository in a directory on `$SCRATCH` and share it with colleagues. +At this point, you can hit `q` to quite nvitop and you will find the output of your Python script in `gemma-output.log`. diff --git a/docs/guides/mlp_tutorials/llm-nanotron-training.md b/docs/guides/mlp_tutorials/llm-nanotron-training.md index 0121ea3f..ab716d67 100644 --- a/docs/guides/mlp_tutorials/llm-nanotron-training.md +++ b/docs/guides/mlp_tutorials/llm-nanotron-training.md @@ -1,17 +1,17 @@ [](){#ref-mlp-llm-nanotron-tutorial} -# LLM Nanotron Training Tutorial +# LLM Nanotron Pre-training Tutorial -In this tutorial, we will build a container image to run nanotron training jobs. +In this tutorial, we will build a container image to run multi-node training jobs with [nanotron](https://github.com/huggingface/nanotron). We will train a 109M parameter model with ~100M wikitext tokens as a proof of concept. ### Prerequisites -It is also recommended to follow the previous tutorials: [LLM Inference][ref-mlp-llm-inference-tutorial] and [LLM Fine-tuning][ref-mlp-llm-fine-tuning-tutorial], as this will build up from it. +It is recommended to follow the previous two tutorials on [LLM Inference][ref-mlp-llm-inference-tutorial] and [LLM Fine-tuning][ref-mlp-llm-fine-tuning-tutorial] first, as this will build upon them. ### Set up Podman -Edit your `$HOME/.config/containers/storage.conf` according to the following minimal template: +If not already done as part of the [LLM Inference tutorial][ref-mlp-llm-inference-tutorial], edit your podman configuration in `$HOME/.config/containers/storage.conf` as follows: ```toml title="$HOME/.config/containers/storage.conf" [storage] @@ -23,16 +23,30 @@ Edit your `$HOME/.config/containers/storage.conf` according to the following min mount_program = "/usr/bin/fuse-overlayfs-1.13" ``` -## Modify the NGC Container +Create a directory to store container images used with CE and configure it with [recommended LUSTRE settings][ref-guides-storage-lustre]: + +```bash title="Container image directory with recommended LUSTRE settings" +$ mkdir -p $SCRATCH/ce-images +$ lfs setstripe -E 4M -c 1 -E 64M -c 4 -E -1 -c -1 -S 4M $SCRATCH/ce-images # (1)! +``` + +1. This makes sure that files stored subsequently end up on the same storage node (up to 4 MB), on 4 storage nodes (between 4 and 64 MB) or are striped across all storage nodes (above 64 MB) + +## Build a modified NGC Container + +In this tutorial, we build a virtual environment on top of a customized NGC container image. This represents a typical task during development, where stable dependencies are captured in a static container image, whereas frequently changing packages are installed in a virtual environment on top. In contrast to the previous tutorials, the container in this case will be mostly self-contained. -See previous tutorial for context. Here, we assume we are already in a compute node (run `srun -A --pty bash` to get an interactive session). -In this case, we will be creating the dockerfile in `$SCRATCH/container-image/nanotron/Dockerfile`. -These are the contents of the dockerfile: +In this case, we create a Dockerfile with the following contents: -```dockerfile title="$SCRATCH/container-image/nanotron/Dockerfile" +```dockerfile title="$SCRATCH/tutorials/nanotron-pretrain/Dockerfile" FROM nvcr.io/nvidia/pytorch:24.04-py3 +RUN apt-get update && \ + apt-get install -y python3.10-venv && \ + apt-get clean && \ + rm -rf /var/lib/apt/lists/* + # Update flash-attn. RUN pip install --upgrade --no-build-isolation flash-attn==2.5.8 @@ -49,49 +63,129 @@ RUN pip install \ tqdm ``` +!!! note "More recent NGC releases" + As discussed in the [LLM Inference tutorial][ref-mlp-llm-inference-tutorial], starting with the 24.11 release, NGC PyTorch no longer requires the installation of the Python venv module. Furthermore, FlashAttention and several other packages were integrated into the hosted image. However, as `nanotron` as of June 2025 still requires Python 3.10 (cf. this [issue](https://github.com/huggingface/nanotron/issues/217)), this example is restricted to NGC releases up to `24.10`. + + ```dockerfile title="$SCRATCH/tutorials/nanotron-pretrain/Dockerfile" + FROM nvcr.io/nvidia/pytorch:24.10-py3 + + RUN apt-get update && \ + apt-get install -y python3.10-venv && \ + apt-get clean && \ + rm -rf /var/lib/apt/lists/* + + # Update flash-attn. + RUN pip install --upgrade --no-build-isolation flash-attn==2.5.8 + + # Install the rest of dependencies. + RUN pip install \ + datasets \ + transformers \ + wandb \ + dacite \ + pyyaml + ``` + + The remaining steps can then be performed equivalently, replacing the version number `24.04` by the one chosen in the Dockerfile (e.g. `24.10`). + + It is generally recommended to stick to one of the most recent versions of NGC, unless there is a strong reason from your application to stick to an old version for compatibility. + Then build and import the container. -```console -$ cd $SCRATCH/container-image/nanotron -$ podman build -t nanotron:v1.0 . -$ enroot import -x mount -o nanotron-v1.0.sqsh podman://nanotron:v1.0 +```bash +$ cd $SCRATCH/tutorials/nanotron-pretrain +$ podman build -f Dockerfile -t ngc-nanotron:24.04 . +$ enroot import -x mount \ +-o $SCRATCH/ce-images/ngc-nanotron+24.04.sqsh podman://ngc-nanotron:24.04 # (1)! ``` +1. We import container images into a canonical location under $SCRATCH. + + +!!! info "Debugging the container build" + If the container build fails, you can run an interactive shell using the image from the last successfully built layer with + + ```bash + podman run -it --rm -e NVIDIA_VISIBLE_DEVICES=void bash # (1)! + ``` + + 1. Setting `NVIDIA_VISIBLE_DEVICES` in the environment is required specifically to run NGC containers with podman + + replacing `` by the actual hash output in the build job and interactively test the failing command. + Now exit the interactive session by running `exit`. -### Set up an EDF +### Set up an environment description file (EDF) + +See the previous tutorial for context. In this case, the EDF will be co-located with the Dockerfile under `$SCRATCH/tutorials/nanotron-pretrain` and will have the following contents: + +```toml title="$SCRATCH/tutorials/nanotron-pretrain/ngc-nanotron-24.04.toml" +image = "${SCRATCH}/ce-images/ngc-nanotron+24.04.sqsh" # (1)! + +mounts = [ + "/capstor", + "/iopsstor" +] # (2)! -See the previous tutorial for context. In this case, the edf will be at `$HOME/.edf/nanotron.toml` and will have the following contents: +workdir = "${SCRATCH}/tutorials/nanotron-pretrain/" # (3)! -```toml title="$HOME/.edf/nanotron.toml" -image = "/capstor/scratch/cscs//container-image/nanotron/nanotron-v1.0.sqsh" -mounts = ["/capstor", "/users"] -workdir = "/users//" -writable = true - [annotations] -com.hooks.aws_ofi_nccl.enabled = "true" +com.hooks.aws_ofi_nccl.enabled = "true" # (4)! com.hooks.aws_ofi_nccl.variant = "cuda12" - + [env] -NCCL_DEBUG = "INFO" +NCCL_DEBUG = "INFO" # (5)! +CUDA_CACHE_DISABLE = "1" # (6)! +TORCH_NCCL_ASYNC_ERROR_HANDLING = "1" # (7)! +MPICH_GPU_SUPPORT_ENABLED = "0" # (8)! ``` -Note that, if you built your own container image, you will need to modify the image path. +1. It is important to use curly braces for environment variables used in the EDF +2. The path `/users` is not mounted since it often contains user-specific initialization scripts for the host environment and many frameworks leave temporary data behind that can lead to non-trivial runtime errors when swapping container images. Thus, it is recommended to selectively mount specific subfolders under `${HOME}` if needed. +3. You can use `${PWD}` as an alternative to use the path submitted from when the container is started +4. This enables NCCL installed in the container to make effective use of the Slingshot interconnect on Alps by interfacing with the [AWS OFI NCCL plugin][ref-ce-aws-ofi-hook] with libfabric. While not strictly needed for single node workloads, it is good practice to keep it always on. +5. This makes NCCL output debug info during initialization, which can be useful to spot communication-related issues in a distributed scenario (see later tutorials). Subsystems with debug log can be configured with `NCCL_DEBUG_SUBSYS`. +6. Disable CUDA JIT cache +7. Async error handling when an exception is observed in NCCL watchdog: aborting NCCL communicator and tearing down process upon error +8. Disable GPU support in MPICH, as it can lead to deadlocks when using together with NCCL + +Note that, if you built your container image elsewhere, you will need to modify the image path. -### Preparing a Training Job +## Installing nanotron in a virtual environment Now let's download nanotron. In the login node run: -```console +```bash +$ cd $SCRATCH/tutorials/nanotron-pretrain $ git clone https://github.com/huggingface/nanotron.git $ cd nanotron +$ git checkout 5f8a52b08b702e206f31f2660e4b6f22ac328c95 # (1)! ``` -And with your favorite text editor, create the following nanotron configuration file in `$HOME/nanotron/examples/config_tiny_llama_wikitext.yaml`: +1. This ensures the compatibility of nanotron with the following example. For general usage, there is no reason to stick to an outdated version of nanotron, though. + +We will install nanotron in a thin virtual environment on top of the container image built above. This proceeds as in the [LLM Inference][ref-mlp-llm-inference-tutorial]. + +```bash +$ srun -A --environment=./ngc-nanotron-24.04.toml --pty bash +$ python -m venv --system-site-packages venv-24.04 +$ source venv-24.04/bin/activate +$ cd nanotron/ && pip install -e . +" +``` + +This creates a virtual environment on top of this container image (`--system-site-packages` ensuring access to system-installed site-packages) and installs nanotron in editable mode inside it. Because all dependencies of nanotron are already installed in the Dockerfile, no extra libraries will be installed at this point. + +!!! note + Jobs making use of this virtual environment will always need to activate it first (_inside_ the `srun`-command). + + +## Preparing a Training Job -```yaml title="$HOME/nanotron/examples/config_tiny_llama_wikitext.yaml" +Now, with your favorite text editor, create the following nanotron configuration file: + +```yaml title="$SCRATCH/tutorials/nanotron-pretrain/nanotron/examples/config_tiny_llama_wikitext.yaml" general: benchmark_csv_path: null consumed_train_samples: null @@ -192,62 +286,104 @@ logging: log_level_replica: info ``` -This configuration file will train, as a proof of concept, a gpt-2-like (109M parameters) llama model with approximately 100M tokens of wikitext with settings `tp=4, dp=2, pp=1` (which means that it requires two nodes to train). -This training job will require approximately 10 minutes to run. -Now, create a batchfile in `$HOME/nanotron/run_tiny_llama.sh` with the contents: +This configuration file will train, as a proof of concept, a GPT-2-like (109M parameters) Llama model with approximately 100M tokens of wikitext with 4-way tensor parallelism (`tp`), 2-way data-parallelism (`dp`) and no pipeline-parallelism (`pp`). As a consequence, two GH200 nodes are required to train the model. +The training job will require approximately 10 minutes to run. + +Now, create a submission script in `$SCRATCH/tutorials/nanotron-pretrain/nanotron/run_tiny_llama.sh` with the following content: -```bash title="$HOME/nanotron/run_tiny_llama.sh" +```bash title="$SCRATCH/tutorials/nanotron-pretrain/run_tiny_llama.sh" #!/bin/bash -#SBATCH --job-name=nanotron # create a short name for your job +#SBATCH --account= +#SBATCH --job-name=pretrain-nanotron # create a short name for your job +#SBATCH --time=00:45:00 #SBATCH --nodes=2 # total number of nodes #SBATCH --ntasks-per-node=1 # total number of tasks per node #SBATCH --gpus-per-task=4 -#SBATCH --time=1:00:00 -#SBATCH --account= -#SBATCH --output=logs/%x_%j.log # control where the stdout will be -#SBATCH --error=logs/%x_%j.err # control where the error messages will be# - -mkdir -p logs +#SBATCH --output=logs/slurm-%x-%j.log # if #SBATCH --error=... is not specified, + # this will also contain stderr (error messages) # Initialization. set -x cat $0 -export MASTER_PORT=25678 -export MASTER_ADDR=$(hostname) -export HF_HOME=$SCRATCH/huggingface_home -export CUDA_DEVICE_MAX_CONNECTIONS=1 # required by nanotron -# export either WANDB_API_KEY= or WANDB_MODE=offline +export HF_HOME=$SCRATCH/huggingface # (1)! +export CUDA_DEVICE_MAX_CONNECTIONS=1 #(2)! + +export WANDB_API_KEY= # alternatively: export WANDB_MODE=offline # Run main script. -srun -ul --environment=nanotron bash -c " - # Change cwd and run the main training script. +srun -ul --environment=./ngc-nanotron-24.04.toml bash -c " + # activate virtual environment + source venv-24.04/bin/activate + + # change cwd and run the training script cd nanotron/ - pip install -e . # Only required the first time. TORCHRUN_ARGS=\" + --master-addr=\$(scontrol show hostnames \$SLURM_JOB_NODELIST | head -n 1) \ + --master-port=29500 \ --node-rank=\${SLURM_PROCID} \ - --master-addr=\${MASTER_ADDR} \ - --master-port=\${MASTER_PORT} \ --nnodes=\${SLURM_NNODES} \ --nproc-per-node=\${SLURM_GPUS_ON_NODE} \ \" - torchrun \${TORCHRUN_ARGS} run_train.py --config-file examples/config_tiny_llama_wikitext.yaml -" + python -m torch.distributed.run \${TORCHRUN_ARGS} \ + run_train.py --config-file examples/config_tiny_llama_wikitext.yaml +" # (3)! ``` -A few comments: - -- The parts outside the srun command will be run on the first node of the Slurm allocation for this job. srun commands without further specifiers execute with the settings of the sbatch script (i.e. using all nodes allocated to the job). -- If you have a [wandb](https://wandb.ai/) API key and want to synchronize the training run, be sure to set the `WANDB_API_KEY` variable. Otherwise, set `WANDB_MODE=of​f​line` instead. -- Note that we are setting `HF_HOME` in a directory in scratch. This is done to place the downloaded dataset in scratch, instead of your home directory. -- The pip install command is only run once in every container (compute node). -Note that this will only link the nanotron python package to be able to import it in any script irrespective of the current working directory. -Because all dependencies of nanotron are already installed in the Dockerfile, no extra libraries will be installed at this point. -If the installation of the package under development creates artefacts on the shared filesystem (such as binaries from compiled C++/CUDA source code), this results in a race condition when run from multiple nodes. -Therefore, in this case and also when additional external libraries are to be installed, you should either use venv as shown in previous tutorials, or directly build everything in the Dockerfile. - -### Launch a Training Job with the new Image +1. Location for locally stored data (incl. token and cache for models/datasets/spaces if `HF_HUB_CACHE` is not set) from `huggingface_hub` (cf. [HuggingFace docs](https://huggingface.co/docs/huggingface_hub/en/package_reference/environment_variables#hfhome). +2. This setting is specifically required by nanotron. Note that this setting can lead to faulty Nsight Systems (`nsys`) profiles that do not show overlap of compute and communication when there actually is (e.g. observed in [this issue](https://github.com/NVIDIA/Megatron-LM/issues/1468)). The solution is to use a more recent version of `nsys`. +3. Use `python -m torch.distributed.run` instead of `torchrun` with virtual environments + +!!! note "A few comments" + - The parts outside the srun command will be run on the first node of the Slurm allocation for this job. srun commands without further specifiers execute with the settings of the sbatch script (i.e. using all nodes allocated to the job). + - Note that we are setting `HF_HOME` to a directory in scratch. This is done to place the dataset downloaded from `huggingface_hub` in your scratch, instead of your home directory. The same applies to your HuggingFace token as well as any models/spaces unless `HF_HUB_CACHE` is set (cf. [HuggingFace docs](https://huggingface.co/docs/huggingface_hub/en/package_reference/environment_variables#hfhome)). As discussed in the tutorial on [LLM Inference][ref-mlp-llm-inference-tutorial], it is good practice to apply the [recommended LUSTRE settings][ref-guides-storage-lustre] there. + - If instead of downloading a dataset from HuggingFace you want to re-use one managed by a colleague, please refer to the [storage guide][ref-guides-storage-sharing] for instructions on dataset sharing. + - If you have a [wandb API key](https://docs.wandb.ai/guides/track/environment-variables/) and want to synchronize the training run, be sure to set the `WANDB_API_KEY` variable. Alternatively, `wandb` can write log data to the distributed filesystem with `WANDB_MODE=of​f​line` so that it can be uploaded with `wandb sync` (cf. [Weights & Biases docs](https://docs.wandb.ai/support/run_wandb_offline/)) after the training run has finished. + +!!! warning "torchrun with virtual environments" + When using a virtual environment on top of a base image with PyTorch, always replace `torchrun` with `python -m torch.distributed.run` to pick up the correct Python environment. Otherwise, the system Python environment will be used and virtual environment packages not available. If not using virtual environments such as with a self-contained PyTorch container, `torchrun` is equivalent to `python -m torch.distributed.run`. + +!!! note "Using srun instead of torchrun" + In many cases, workloads launched with `torchrun` can equivalently be launched purely with SLURM by setting some extra environment variables for `torch.distributed`. This simplifies the overall setup. That is, the `srun` statement in the above `sbatch` script can be rewritten as + + ```bash title="$SCRATCH/tutorials/nanotron-pretrain/run_tiny_llama.sh" + #!/bin/bash + #SBATCH --account= + #SBATCH --job-name=pretrain-nanotron # create a short name for your job + #SBATCH --time=00:45:00 + #SBATCH --nodes=2 # total number of nodes + #SBATCH --ntasks-per-node=4 # total number of tasks per node + #SBATCH --output=logs/slurm-%x-%j.log # if #SBATCH --error=... is not specified, + # this will also contain stderr (error messages) + + # Initialization. + set -x + cat $0 + export HF_HOME=$SCRATCH/huggingface + export CUDA_DEVICE_MAX_CONNECTIONS=1 + + export WANDB_API_KEY= # alternatively: export WANDB_MODE=offline + + # Run main script. + srun -ul --environment=./ngc-nanotron-24.04.toml bash -c " + # activate virtual environment + source venv-24.04/bin/activate + + # change cwd and run the training script + cd nanotron/ + + MASTER_ADDR=\$(scontrol show hostnames \$SLURM_JOB_NODELIST | head -n 1) \ + MASTER_PORT=29500 \ + RANK=\${SLURM_PROCID} \ + LOCAL_RANK=\${SLURM_LOCALID} \ + WORLD_SIZE=\${SLURM_NTASKS} \ + python run_train.py --config-file examples/config_tiny_llama_wikitext.yaml + " + ``` + + +## Launch a Training Job Run: diff --git a/mkdocs.yml b/mkdocs.yml index 795ed635..fe86d9fe 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -117,7 +117,7 @@ nav: - guides/mlp_tutorials/index.md - 'LLM Inference': guides/mlp_tutorials/llm-inference.md - 'LLM Fine-tuning': guides/mlp_tutorials/llm-fine-tuning.md - - 'LLM Training': guides/mlp_tutorials/llm-nanotron-training.md + - 'LLM Pre-training': guides/mlp_tutorials/llm-nanotron-training.md - 'Policies': - policies/index.md - 'User Regulations': policies/regulations.md From a10be369efd51f35f9337092d63da3f06c7cb0f9 Mon Sep 17 00:00:00 2001 From: Lukas Drescher <38319063+lukasgd@users.noreply.github.com> Date: Mon, 28 Jul 2025 16:03:37 +0200 Subject: [PATCH 03/18] Update docs/access/jupyterlab.md Co-authored-by: Rocco Meli --- docs/access/jupyterlab.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/access/jupyterlab.md b/docs/access/jupyterlab.md index 8fcbd372..321e03f1 100644 --- a/docs/access/jupyterlab.md +++ b/docs/access/jupyterlab.md @@ -206,7 +206,7 @@ A popular approach to run multi-GPU ML workloads is with [`accelerate`](https:// ``` !!! warning "torchrun with virtual environments" - When using a virtual environment on top of a base image with PyTorch, always replace `torchrun` with `python -m torch.distributed.run` to pick up the correct Python environment. Otherwise, the system Python environment will be used and virtual environment packages not available. If not using virtual environments such as with a self-contained PyTorch container, `torchrun` is equivalent to `python -m torch.distributed.run`. + When using a virtual environment on top of a base image with PyTorch, always replace `torchrun` with `python -m torch.distributed.run` to pick up the correct Python environment. Otherwise, the system Python environment will be used and virtual environment packages will not available. If not using virtual environments such as with a self-contained PyTorch container, `torchrun` is equivalent to `python -m torch.distributed.run`. !!! note "Notebook structure" In none of these scenarios any significant memory allocations or background computations are performed on the main Jupyter process. Instead, the resources are kept available for the processes launched by `accelerate` or `torchrun`, respectively. From 6017a7e6111190de950a2eec6071104272a018df Mon Sep 17 00:00:00 2001 From: Lukas Drescher <38319063+lukasgd@users.noreply.github.com> Date: Mon, 28 Jul 2025 16:08:32 +0200 Subject: [PATCH 04/18] Apply suggestions from code review Co-authored-by: Rocco Meli --- docs/guides/mlp_tutorials/index.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/guides/mlp_tutorials/index.md b/docs/guides/mlp_tutorials/index.md index 846c73ff..6d0b0a35 100644 --- a/docs/guides/mlp_tutorials/index.md +++ b/docs/guides/mlp_tutorials/index.md @@ -3,8 +3,8 @@ These tutorials gradually introduce key concepts of the Machine Learning Platform. A particular focus is on the [Container Engine][ref-container-engine] for managing the runtime environment. -In a [first tutorial][ref-mlp-llm-inference-tutorial], you will learn how to run an inference with an LLM on a single node using a container from the NVIDIA GPU Cloud (NGC). Concepts such as container environment description, layering a thin virtual environment on top of the container image and job launching and monitoring will be introduced. +In a [first tutorial][ref-mlp-llm-inference-tutorial], you will learn how to run inference with a LLM on a single node using a container from the NVIDIA GPU Cloud (NGC). Concepts such as container environment description, layering a thin virtual environment on top of the container image, and job launching and monitoring will be introduced. -Building on the first tutorial, in the [second tutorial][ref-mlp-llm-fine-tuning-tutorial] you will learn how to train (fine-tune) an LLM on multiple GPUs on a single node. For this purpose, you will use HuggingFace's `accelerate` and see best practices for dataset management. +Building on the first tutorial, in the [second tutorial][ref-mlp-llm-fine-tuning-tutorial] you will learn how to train (fine-tune) a LLM on multiple GPUs on a single node. For this purpose, you will use HuggingFace's `accelerate` and see best practices for dataset management. In the [third tutorial][ref-mlp-llm-nanotron-tutorial], you will apply the techniques from the previous tutorials to enable distributed (pre-)training of a model `nanotron` on multiple nodes. In particular, this tutorial makes use of model-parallelism and introduces the usage of `torchrun` to manage jobs on individual nodes. From e96ee0fe5cab648d406291e504b7750ddb489285 Mon Sep 17 00:00:00 2001 From: Lukas Drescher <38319063+lukasgd@users.noreply.github.com> Date: Mon, 28 Jul 2025 16:09:43 +0200 Subject: [PATCH 05/18] Update docs/access/jupyterlab.md Co-authored-by: Rocco Meli --- docs/access/jupyterlab.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/access/jupyterlab.md b/docs/access/jupyterlab.md index 321e03f1..8c233781 100644 --- a/docs/access/jupyterlab.md +++ b/docs/access/jupyterlab.md @@ -229,7 +229,7 @@ where `/path/to/edf.toml` should be replaced by the TOML file and `venv- Date: Mon, 28 Jul 2025 16:17:55 +0200 Subject: [PATCH 06/18] Update docs/guides/mlp_tutorials/llm-inference.md Co-authored-by: Rocco Meli --- docs/guides/mlp_tutorials/llm-inference.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/guides/mlp_tutorials/llm-inference.md b/docs/guides/mlp_tutorials/llm-inference.md index 0e56fb53..2abfc450 100644 --- a/docs/guides/mlp_tutorials/llm-inference.md +++ b/docs/guides/mlp_tutorials/llm-inference.md @@ -105,7 +105,7 @@ podman://ngc-pytorch:24.01 # (2)! ``` 1. This builds the container image with the current working directory as the build context. The `Dockerfile` inside that directory is implicitly used as a recipe. If it is named differently use the `-f path/to/Dockerfile` option. -2. The newly built container image is imported and stored under $SCRATCH/ce-images. +2. The newly built container image is imported and stored under `$SCRATCH/ce-images`. where you should replace `` with your project account ID. At this point, you can exit the Slurm allocation by typing `exit`. From d0476fd4d288f7bbb53bae49359bcff133bae997 Mon Sep 17 00:00:00 2001 From: Lukas Drescher <38319063+lukasgd@users.noreply.github.com> Date: Mon, 28 Jul 2025 16:19:05 +0200 Subject: [PATCH 07/18] Update docs/guides/mlp_tutorials/llm-inference.md Co-authored-by: Rocco Meli --- docs/guides/mlp_tutorials/llm-inference.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/guides/mlp_tutorials/llm-inference.md b/docs/guides/mlp_tutorials/llm-inference.md index 2abfc450..be30cd44 100644 --- a/docs/guides/mlp_tutorials/llm-inference.md +++ b/docs/guides/mlp_tutorials/llm-inference.md @@ -109,7 +109,7 @@ podman://ngc-pytorch:24.01 # (2)! where you should replace `` with your project account ID. At this point, you can exit the Slurm allocation by typing `exit`. -You should be able to see a new squashfs file in your container image directory: +You should be able to see a new Squashfs file in your container image directory: ```bash $ ls $SCRATCH/ce-images From fb22a00b8d63018dac11f0cad3848604da07a35b Mon Sep 17 00:00:00 2001 From: Lukas Drescher <38319063+lukasgd@users.noreply.github.com> Date: Mon, 28 Jul 2025 16:19:25 +0200 Subject: [PATCH 08/18] Update docs/guides/mlp_tutorials/llm-inference.md Co-authored-by: Rocco Meli --- docs/guides/mlp_tutorials/llm-inference.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/guides/mlp_tutorials/llm-inference.md b/docs/guides/mlp_tutorials/llm-inference.md index be30cd44..5e404cd0 100644 --- a/docs/guides/mlp_tutorials/llm-inference.md +++ b/docs/guides/mlp_tutorials/llm-inference.md @@ -120,7 +120,7 @@ This squashfs file is essentially a compressed container image, which can be run We will use our freshly-built container `ngc-pytorch+24.01.sqsh` in the following steps to run a PyTorch script that loads the Google Gemma-7B model and performs some inference with it. !!! note - In order to import a container image from a registry without building additional layers on top of it, we can directly use `enroot` (without `podman`). This is useful in this tutorial if we want to use a more recent NGC Pytorch container that was released since `24.11`. Use the following syntax for importing the `25.06` release: + In order to import a container image from a registry without building additional layers on top of it, we can directly use `enroot` (without `podman`). This is useful in this tutorial if we want to use a more recent NGC PyTorch container that was released since `24.11`. Use the following syntax for importing the `25.06` release: ```bash enroot import -x mount \ From 758d019a14e04c1de9bcecaa73f3f76175bad2f7 Mon Sep 17 00:00:00 2001 From: Lukas Drescher <38319063+lukasgd@users.noreply.github.com> Date: Mon, 28 Jul 2025 16:20:19 +0200 Subject: [PATCH 09/18] Update docs/guides/mlp_tutorials/llm-inference.md Co-authored-by: Rocco Meli --- docs/guides/mlp_tutorials/llm-inference.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/guides/mlp_tutorials/llm-inference.md b/docs/guides/mlp_tutorials/llm-inference.md index 5e404cd0..a9a9f3ab 100644 --- a/docs/guides/mlp_tutorials/llm-inference.md +++ b/docs/guides/mlp_tutorials/llm-inference.md @@ -124,7 +124,7 @@ We will use our freshly-built container `ngc-pytorch+24.01.sqsh` in the followin ```bash enroot import -x mount \ - -o $SCRATCH/ce-images/ngc-pytorch+25.06.sqsh docker://nvcr.io#nvidia/pytorch:25.06-py3 + -o $SCRATCH/ce-images/ngc-pytorch+25.06.sqsh docker://nvcr.io#nvidia/pytorch:25.06-py3 ``` From 0e0285fa0ac109db0be27ad6818d9ce703dc1805 Mon Sep 17 00:00:00 2001 From: Lukas Drescher <38319063+lukasgd@users.noreply.github.com> Date: Mon, 28 Jul 2025 16:21:10 +0200 Subject: [PATCH 10/18] Update docs/guides/mlp_tutorials/llm-inference.md Co-authored-by: Rocco Meli --- docs/guides/mlp_tutorials/llm-inference.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/guides/mlp_tutorials/llm-inference.md b/docs/guides/mlp_tutorials/llm-inference.md index a9a9f3ab..0d19b29b 100644 --- a/docs/guides/mlp_tutorials/llm-inference.md +++ b/docs/guides/mlp_tutorials/llm-inference.md @@ -225,7 +225,7 @@ If you `ls` the contents of the `gemma-inference` folder, you will see that the * this virtual environment won't actually work unless you're running something from inside the PyTorch container. This is because the virtual environment ultimately relies on the resources packaged inside the container. - * every SLURM job making use of this virtual environment will need to activate it first (_inside_ the `srun`-command). + * every Slurm job making use of this virtual environment will need to activate it first (_inside_ the `srun` command). Since [`HF_HOME`](https://huggingface.co/docs/huggingface_hub/en/package_reference/environment_variables#hfhome) will not only contain the API token, but also be the storage location for model, dataset and space caches of `huggingface_hub` (unless `HF_HUB_CACHE` is set), we also want to apply proper LUSTRE striping settings before it gets populated. From 4a74f9664eee8e11b54e21b99bb346101f83474a Mon Sep 17 00:00:00 2001 From: Lukas Drescher <38319063+lukasgd@users.noreply.github.com> Date: Mon, 28 Jul 2025 16:22:36 +0200 Subject: [PATCH 11/18] Update docs/guides/mlp_tutorials/llm-nanotron-training.md Co-authored-by: Rocco Meli --- docs/guides/mlp_tutorials/llm-nanotron-training.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/guides/mlp_tutorials/llm-nanotron-training.md b/docs/guides/mlp_tutorials/llm-nanotron-training.md index ab716d67..b1896524 100644 --- a/docs/guides/mlp_tutorials/llm-nanotron-training.md +++ b/docs/guides/mlp_tutorials/llm-nanotron-training.md @@ -178,7 +178,7 @@ $ cd nanotron/ && pip install -e . This creates a virtual environment on top of this container image (`--system-site-packages` ensuring access to system-installed site-packages) and installs nanotron in editable mode inside it. Because all dependencies of nanotron are already installed in the Dockerfile, no extra libraries will be installed at this point. !!! note - Jobs making use of this virtual environment will always need to activate it first (_inside_ the `srun`-command). + Jobs making use of this virtual environment will always need to activate it first (_inside_ the `srun` command). ## Preparing a Training Job From fb1629ff6ce641de65faf554d740f95151ee6372 Mon Sep 17 00:00:00 2001 From: Lukas Drescher <38319063+lukasgd@users.noreply.github.com> Date: Mon, 28 Jul 2025 16:22:49 +0200 Subject: [PATCH 12/18] Update docs/guides/mlp_tutorials/llm-nanotron-training.md Co-authored-by: Rocco Meli --- docs/guides/mlp_tutorials/llm-nanotron-training.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/guides/mlp_tutorials/llm-nanotron-training.md b/docs/guides/mlp_tutorials/llm-nanotron-training.md index b1896524..f896717e 100644 --- a/docs/guides/mlp_tutorials/llm-nanotron-training.md +++ b/docs/guides/mlp_tutorials/llm-nanotron-training.md @@ -333,7 +333,7 @@ srun -ul --environment=./ngc-nanotron-24.04.toml bash -c " 1. Location for locally stored data (incl. token and cache for models/datasets/spaces if `HF_HUB_CACHE` is not set) from `huggingface_hub` (cf. [HuggingFace docs](https://huggingface.co/docs/huggingface_hub/en/package_reference/environment_variables#hfhome). 2. This setting is specifically required by nanotron. Note that this setting can lead to faulty Nsight Systems (`nsys`) profiles that do not show overlap of compute and communication when there actually is (e.g. observed in [this issue](https://github.com/NVIDIA/Megatron-LM/issues/1468)). The solution is to use a more recent version of `nsys`. -3. Use `python -m torch.distributed.run` instead of `torchrun` with virtual environments +3. Use `python -m torch.distributed.run` instead of `torchrun` with virtual environments. !!! note "A few comments" - The parts outside the srun command will be run on the first node of the Slurm allocation for this job. srun commands without further specifiers execute with the settings of the sbatch script (i.e. using all nodes allocated to the job). From 691b11f6337c3c9f456de2f4c7ee6f5ef6a74911 Mon Sep 17 00:00:00 2001 From: Lukas Drescher <38319063+lukasgd@users.noreply.github.com> Date: Mon, 28 Jul 2025 16:23:00 +0200 Subject: [PATCH 13/18] Update docs/guides/mlp_tutorials/llm-nanotron-training.md Co-authored-by: Rocco Meli --- docs/guides/mlp_tutorials/llm-nanotron-training.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/guides/mlp_tutorials/llm-nanotron-training.md b/docs/guides/mlp_tutorials/llm-nanotron-training.md index f896717e..6026a2c0 100644 --- a/docs/guides/mlp_tutorials/llm-nanotron-training.md +++ b/docs/guides/mlp_tutorials/llm-nanotron-training.md @@ -341,7 +341,7 @@ srun -ul --environment=./ngc-nanotron-24.04.toml bash -c " - If instead of downloading a dataset from HuggingFace you want to re-use one managed by a colleague, please refer to the [storage guide][ref-guides-storage-sharing] for instructions on dataset sharing. - If you have a [wandb API key](https://docs.wandb.ai/guides/track/environment-variables/) and want to synchronize the training run, be sure to set the `WANDB_API_KEY` variable. Alternatively, `wandb` can write log data to the distributed filesystem with `WANDB_MODE=of​f​line` so that it can be uploaded with `wandb sync` (cf. [Weights & Biases docs](https://docs.wandb.ai/support/run_wandb_offline/)) after the training run has finished. -!!! warning "torchrun with virtual environments" +!!! warning "`torchrun` with virtual environments" When using a virtual environment on top of a base image with PyTorch, always replace `torchrun` with `python -m torch.distributed.run` to pick up the correct Python environment. Otherwise, the system Python environment will be used and virtual environment packages not available. If not using virtual environments such as with a self-contained PyTorch container, `torchrun` is equivalent to `python -m torch.distributed.run`. !!! note "Using srun instead of torchrun" From 80f2c197b689f19e8faa9873f86e10fbf168a100 Mon Sep 17 00:00:00 2001 From: Lukas Drescher <38319063+lukasgd@users.noreply.github.com> Date: Mon, 28 Jul 2025 16:25:02 +0200 Subject: [PATCH 14/18] Update docs/guides/mlp_tutorials/index.md Co-authored-by: Rocco Meli --- docs/guides/mlp_tutorials/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/guides/mlp_tutorials/index.md b/docs/guides/mlp_tutorials/index.md index 6d0b0a35..da7cb242 100644 --- a/docs/guides/mlp_tutorials/index.md +++ b/docs/guides/mlp_tutorials/index.md @@ -7,4 +7,4 @@ In a [first tutorial][ref-mlp-llm-inference-tutorial], you will learn how to run Building on the first tutorial, in the [second tutorial][ref-mlp-llm-fine-tuning-tutorial] you will learn how to train (fine-tune) a LLM on multiple GPUs on a single node. For this purpose, you will use HuggingFace's `accelerate` and see best practices for dataset management. -In the [third tutorial][ref-mlp-llm-nanotron-tutorial], you will apply the techniques from the previous tutorials to enable distributed (pre-)training of a model `nanotron` on multiple nodes. In particular, this tutorial makes use of model-parallelism and introduces the usage of `torchrun` to manage jobs on individual nodes. +In the [third tutorial][ref-mlp-llm-nanotron-tutorial], you will apply the techniques from the previous tutorials to enable distributed (pre-)training of a model in `nanotron` on multiple nodes. In particular, this tutorial makes use of model-parallelism and introduces the usage of `torchrun` to manage jobs on individual nodes. From b66b0eb30497aebacb3c30d8bac5ed7459852bad Mon Sep 17 00:00:00 2001 From: Lukas Drescher Date: Mon, 28 Jul 2025 18:28:36 +0200 Subject: [PATCH 15/18] Using console instead of bash with hostnames in the shell prompt and mention login-node policies --- docs/guides/mlp_tutorials/llm-fine-tuning.md | 32 ++++--- docs/guides/mlp_tutorials/llm-inference.md | 96 +++++++++++-------- .../mlp_tutorials/llm-nanotron-training.md | 43 ++++----- 3 files changed, 92 insertions(+), 79 deletions(-) diff --git a/docs/guides/mlp_tutorials/llm-fine-tuning.md b/docs/guides/mlp_tutorials/llm-fine-tuning.md index 8def2ad3..5d72cfd1 100644 --- a/docs/guides/mlp_tutorials/llm-fine-tuning.md +++ b/docs/guides/mlp_tutorials/llm-fine-tuning.md @@ -22,20 +22,21 @@ First, we need to update our Python environment with some extra libraries to sup To do this, we can launch an interactive shell in the PyTorch container, just like we did in the previous tutorial. Then, we install `peft`: -```bash -$ cd $SCRATCH/tutorials/gemma-7b -$ srun --environment=./ngc-pytorch-gemma-24.01.toml --pty bash -$ source venv-gemma-24.01/bin/activate -$ pip install peft==0.11.1 +```console +[clariden-lnXXX]$ cd $SCRATCH/tutorials/gemma-7b +[clariden-lnXXX]$ srun --environment=./ngc-pytorch-gemma-24.01.toml --pty bash +user@nidYYYYYY$ source venv-gemma-24.01/bin/activate +(venv-gemma-24.01) user@nidYYYYYY$ pip install peft==0.11.1 ``` Next, we also need to clone and install the `trl` Git repository so that we have access to the fine-tuning scripts in it. For this purpose, we will install the package in editable mode in the virtual environment. This makes it available in python scripts independent of the current working directory and without creating a redundant copy of the files. -```bash -$ git clone https://github.com/huggingface/trl -b v0.7.11 -$ pip install -e ./trl # (1)! +```console +(venv-gemma-24.01) user@nidYYYYYY$ git clone \ + https://github.com/huggingface/trl -b v0.7.11 +(venv-gemma-24.01) user@nidYYYYYY$ pip install -e ./trl # (1)! ``` 1. Installs trl in editable mode @@ -102,8 +103,8 @@ Everything after that configures the `trl/examples/scripts/sft.py` Python script Make this script executable with -```bash -$ chmod u+x $SCRATCH/tutorials/gemma-7b/fine-tune-gemma.sh +```console +[clariden-lnXXX]$ chmod u+x $SCRATCH/tutorials/gemma-7b/fine-tune-gemma.sh ``` Next, we also need to create a short Slurm batch script to launch our fine-tuning script: @@ -131,8 +132,8 @@ Now that we've setup a fine-tuning script and a Slurm batch script, we can launc We'll start out by launching it on two nodes. It should take about 10-15 minutes to fine-tune Gemma: -```bash -$ sbatch --nodes=1 submit-fine-tune-gemma.sh +```console +[clariden-lnXXX]$ sbatch --nodes=1 submit-fine-tune-gemma.sh ``` ### Compare fine-tuned Gemma against default Gemma @@ -146,8 +147,8 @@ input_text = "What are the 5 tallest mountains in the Swiss Alps?" We can run inference using our batch script from the previous tutorial: -```bash -$ sbatch submit-gemma-inference.sh +```console +[clariden-lnXXX]$ sbatch submit-gemma-inference.sh ``` Inspecting the output should yield something like this: @@ -168,7 +169,8 @@ the 5 tallest mountains in the Swiss Alps: Next, we can update the model line in our Python inference script to use the model that we just fine-tuned: ```python -model = AutoModelForCausalLM.from_pretrained("gemma-fine-tuned-openassistant/checkpoint-400", device_map="auto") +model = AutoModelForCausalLM.from_pretrained( + "gemma-fine-tuned-openassistant/checkpoint-400", device_map="auto") ``` If we re-run inference, the output will be a bit more detailed and explanatory, similar to output we might expect from a helpful chatbot. One example looks like this: diff --git a/docs/guides/mlp_tutorials/llm-inference.md b/docs/guides/mlp_tutorials/llm-inference.md index 0d19b29b..e89818d3 100644 --- a/docs/guides/mlp_tutorials/llm-inference.md +++ b/docs/guides/mlp_tutorials/llm-inference.md @@ -16,6 +16,13 @@ The model we will be running is Google's [Gemma-7B](https://huggingface.co/googl This tutorial assumes you are able to access the cluster via SSH. To set up access to CSCS systems, follow the guide [here][ref-ssh], and read through the documentation about the [ML Platform][ref-platform-mlp]. +For clarity, we prepend all shell commands with the hostname and any active Python virtual environment they are executed in. E.g. `clariden-lnXXX` refers to a login node on Clariden, while `nidYYYYYY` is a compute node (with placeholders for numeric values). The commands listed here are run on Clariden, but can be adapted slightly to run on other vClusters as well. + +!!! note + Login nodes are a shared environment for editing files, preparing and submitting SLURM jobs as well as inspecting logs. They are not intended for running significant data processing or compute work. Any memory- or compute-intensive work should instead be done on compute nodes. + + If you need to move data [externally][ref-data-xfer-external] or [internally][ref-data-xfer-internal], please follow the corresponding guides using Globus or the `xfer` queue, respectively. + ### Build a modified NGC PyTorch Container In theory, we could just go ahead and use the vanilla container image to run some PyTorch code. @@ -23,10 +30,10 @@ However, chances are that we will need some additional libraries or software. For this reason, we need to use some docker commands to build on top of what is provided by Nvidia. To do this, we create a new directory for recipes to build containers in our home directory and set up a [Dockerfile](https://docs.docker.com/reference/dockerfile/): -```bash -$ cd $SCRATCH -$ mkdir -p tutorials/gemma-7b -$ cd tutorials/gemma-7b +```console +[clariden-lnXXX]$ cd $SCRATCH +[clariden-lnXXX]$ mkdir -p tutorials/gemma-7b +[clariden-lnXXX]$ cd tutorials/gemma-7b ``` Use your favorite text editor to create a file `Dockerfile` here. The Dockerfile should look like this: @@ -82,9 +89,10 @@ This step is straightforward, just create the file in your home: Before building the container image, we create a dedicated directory to keep track of all images used with the CE. Since container images are large files and the filesystem is a shared resource, we need to apply [best practices for LUSTRE][ref-guides-storage-lustre] so they are properly distributed across storage nodes. -```bash title="Container image directory with recommended LUSTRE settings" -$ mkdir -p $SCRATCH/ce-images -$ lfs setstripe -E 4M -c 1 -E 64M -c 4 -E -1 -c -1 -S 4M $SCRATCH/ce-images # (1)! +```console title="Container image directory with recommended LUSTRE settings" +[clariden-lnXXX]$ mkdir -p $SCRATCH/ce-images +[clariden-lnXXX]$ lfs setstripe -E 4M -c 1 -E 64M -c 4 -E -1 -c -1 -S 4M \ + $SCRATCH/ce-images # (1)! ``` 1. This makes sure that files stored subsequently end up on the same storage node (up to 4 MB), on 4 storage nodes (between 4 and 64 MB) or are striped across all storage nodes (above 64 MB) @@ -94,13 +102,13 @@ Slurm is a workload manager which distributes workloads on the cluster. Through Slurm, many people can use the supercomputer at the same time without interfering with one another. -```bash -$ srun -A --pty bash -$ podman build -t ngc-pytorch:24.01 . # (1)! +```console +[clariden-lnXXX]$ srun -A --pty bash +[nidYYYYYY]$ podman build -t ngc-pytorch:24.01 . # (1)! # ... lots of output here ... -$ enroot import -x mount \ --o $SCRATCH/ce-images/ngc-pytorch+24.01.sqsh \ -podman://ngc-pytorch:24.01 # (2)! +[nidYYYYYY]$ enroot import -x mount \ + -o $SCRATCH/ce-images/ngc-pytorch+24.01.sqsh \ + podman://ngc-pytorch:24.01 # (2)! # ... more output here ... ``` @@ -111,8 +119,8 @@ where you should replace `` with your project account ID. At this point, you can exit the Slurm allocation by typing `exit`. You should be able to see a new Squashfs file in your container image directory: -```bash -$ ls $SCRATCH/ce-images +```console +[clariden-lnXXX]$ ls $SCRATCH/ce-images ngc-pytorch+24.01.sqsh ``` @@ -122,8 +130,8 @@ We will use our freshly-built container `ngc-pytorch+24.01.sqsh` in the followin !!! note In order to import a container image from a registry without building additional layers on top of it, we can directly use `enroot` (without `podman`). This is useful in this tutorial if we want to use a more recent NGC PyTorch container that was released since `24.11`. Use the following syntax for importing the `25.06` release: - ```bash - enroot import -x mount \ + ```console + [nidYYYYYY]$ enroot import -x mount \ -o $SCRATCH/ce-images/ngc-pytorch+25.06.sqsh docker://nvcr.io#nvidia/pytorch:25.06-py3 ``` @@ -179,16 +187,17 @@ This will be the first time we run our modified container. To run the container, we need allocate some compute resources using Slurm and launch a shell, just like we already did to build the container. This time, we also use the `--environment` option to specify that we want to launch the shell inside the container specified by our gemma-pytorch EDF file: -```bash -$ cd $SCRATCH/tutorials/gemma-7b -$ srun -A --environment=./ngc-pytorch-gemma-24.01.toml --pty bash +```console +[clariden-lnXXX]$ cd $SCRATCH/tutorials/gemma-7b +[clariden-lnXXX]$ srun -A \ + --environment=./ngc-pytorch-gemma-24.01.toml --pty bash ``` PyTorch is already setup in the container for us. We can verify this by asking pip for a list of installed packages: -```bash -$ python -m pip list | grep torch +```console +user@nidYYYYYY$ python -m pip list | grep torch pytorch-quantization 2.1.2 torch 2.2.0a0+81ea7a4 torch-tensorrt 2.2.0a0 @@ -202,19 +211,19 @@ While it is best practice to install stable dependencies in the container image, The `--system-site-packages` option of the Python `venv` creation command ensures that we install packages _in addition_ to the existing packages and don't accidentally re-install a new version of PyTorch shadowing the one that has been put in place by Nvidia. Next, we activate the environment and use pip to install the two packages we need, `accelerate` and `transformers`: -```bash -$ python -m venv --system-site-packages venv-gemma-24.01 -$ source venv-gemma-24.01/bin/activate -(venv-gemma-24.01)$ pip install \ -accelerate==0.30.1 transformers==4.38.1 huggingface_hub[cli] +```console +user@nidYYYYYY$ python -m venv --system-site-packages venv-gemma-24.01 +user@nidYYYYYY$ source venv-gemma-24.01/bin/activate +(venv-gemma-24.01) user@nidYYYYYY$ pip install \ + accelerate==0.30.1 transformers==4.38.1 huggingface_hub[cli] # ... pip output ... ``` Before we move on to running the Gemma-7B model, we additionally need to make an account at [HuggingFace](https://huggingface.co), get an API token, and accept the [license agreement](https://huggingface.co/google/gemma-7b-it) for the [Gemma-7B](https://huggingface.co/google/gemma-7b) model. You can save the token to `$SCRATCH` using the huggingface-cli: -```bash -$ export HF_HOME=$SCRATCH/huggingface -$ huggingface-cli login +```console +(venv-gemma-24.01) user@nidYYYYYY$ export HF_HOME=$SCRATCH/huggingface +(venv-gemma-24.01) user@nidYYYYYY$ huggingface-cli login ``` At this point, you can exit the Slurm allocation again by typing `exit`. @@ -229,8 +238,9 @@ If you `ls` the contents of the `gemma-inference` folder, you will see that the Since [`HF_HOME`](https://huggingface.co/docs/huggingface_hub/en/package_reference/environment_variables#hfhome) will not only contain the API token, but also be the storage location for model, dataset and space caches of `huggingface_hub` (unless `HF_HUB_CACHE` is set), we also want to apply proper LUSTRE striping settings before it gets populated. -```bash -$ lfs setstripe -E 4M -c 1 -E 64M -c 4 -E -1 -c -1 -S 4M $SCRATCH/huggingface +```console +[clariden-lnXXX]$ lfs setstripe -E 4M -c 1 -E 64M -c 4 -E -1 -c -1 -S 4M \ + $SCRATCH/huggingface ``` ### Run Inference on Gemma-7B @@ -302,8 +312,8 @@ The operations performed before the `srun` command resemble largely the operatio Once you've finished editing the batch file, you can save it and run it with Slurm: -```bash -$ sbatch submit-gemma-inference.sh +```console +[clariden-lnXXX]$ sbatch submit-gemma-inference.sh ``` This command should just finish without any output and return you to your terminal. @@ -314,8 +324,8 @@ Once your job finishes, you will find a file in the same directory you ran it fr For this tutorial, you should see something like the following: -```bash -$ cat logs/slurm-gemma-inference-543210.out +```console +[clariden-lnXXX]$ cat logs/slurm-gemma-inference-543210.out /capstor/scratch/cscs/user/gemma-inference/venv-gemma-24.01/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`. warnings.warn( Gemma's activation function should be approximate GeLU and not exact GeLU. @@ -352,10 +362,11 @@ Move on to the next tutorial or try the challenge. !!! info "Collaborating in Git" In order to track and exchange your progress with colleagues, you can use standard `git` commands on the host, i.e. in the directory `$SCRATCH/tutorials/gemma-7b` run - ```bash - $ git init . - $ git remote add origin git@github.com:/alps-mlp-tutorials-gemma-7b.git # (1)! - $ ... # git add/commit + ```console + [clariden-lnXXX]$ git init . + [clariden-lnXXX]$ git remote add origin \ + git@github.com:/alps-mlp-tutorials-gemma-7b.git # (1)! + [clariden-lnXXX]$ ... # git add/commit ``` 1. Use any alternative Git hosting service instead of Github @@ -369,8 +380,9 @@ Move on to the next tutorial or try the challenge. Using the same approach as in the latter half of step 4, use pip to install the package `nvitop`. This is a tool that shows you a concise real-time summary of GPU activity. Then, run Gemma and launch `nvitop` at the same time: -```bash -(venv-gemma-24.01)$ python gemma-inference.py > gemma-output.log 2>&1 & nvitop +```console +(venv-gemma-24.01) user@nidYYYYYY$ python gemma-inference.py \ + > gemma-output.log 2>&1 & nvitop ``` Note the use of bash `> gemma-output.log 2>&1` to hide any output from Python. diff --git a/docs/guides/mlp_tutorials/llm-nanotron-training.md b/docs/guides/mlp_tutorials/llm-nanotron-training.md index 6026a2c0..5072b912 100644 --- a/docs/guides/mlp_tutorials/llm-nanotron-training.md +++ b/docs/guides/mlp_tutorials/llm-nanotron-training.md @@ -25,9 +25,9 @@ If not already done as part of the [LLM Inference tutorial][ref-mlp-llm-inferenc Create a directory to store container images used with CE and configure it with [recommended LUSTRE settings][ref-guides-storage-lustre]: -```bash title="Container image directory with recommended LUSTRE settings" -$ mkdir -p $SCRATCH/ce-images -$ lfs setstripe -E 4M -c 1 -E 64M -c 4 -E -1 -c -1 -S 4M $SCRATCH/ce-images # (1)! +```console title="Container image directory with recommended LUSTRE settings" +[clariden-lnXXX]$ mkdir -p $SCRATCH/ce-images +[clariden-lnXXX]$ lfs setstripe -E 4M -c 1 -E 64M -c 4 -E -1 -c -1 -S 4M $SCRATCH/ce-images # (1)! ``` 1. This makes sure that files stored subsequently end up on the same storage node (up to 4 MB), on 4 storage nodes (between 4 and 64 MB) or are striped across all storage nodes (above 64 MB) @@ -92,11 +92,11 @@ RUN pip install \ Then build and import the container. -```bash -$ cd $SCRATCH/tutorials/nanotron-pretrain -$ podman build -f Dockerfile -t ngc-nanotron:24.04 . -$ enroot import -x mount \ --o $SCRATCH/ce-images/ngc-nanotron+24.04.sqsh podman://ngc-nanotron:24.04 # (1)! +```console +[nidYYYYYY]$ cd $SCRATCH/tutorials/nanotron-pretrain +[nidYYYYYY]$ podman build -f Dockerfile -t ngc-nanotron:24.04 . +[nidYYYYYY]$ enroot import -x mount \ + -o $SCRATCH/ce-images/ngc-nanotron+24.04.sqsh podman://ngc-nanotron:24.04 # (1)! ``` 1. We import container images into a canonical location under $SCRATCH. @@ -156,23 +156,22 @@ Note that, if you built your container image elsewhere, you will need to modify Now let's download nanotron. In the login node run: -```bash -$ cd $SCRATCH/tutorials/nanotron-pretrain -$ git clone https://github.com/huggingface/nanotron.git -$ cd nanotron -$ git checkout 5f8a52b08b702e206f31f2660e4b6f22ac328c95 # (1)! +```console +[clariden-lnXXX]$ cd $SCRATCH/tutorials/nanotron-pretrain +[clariden-lnXXX]$ git clone https://github.com/huggingface/nanotron.git +[clariden-lnXXX]$ cd nanotron +[clariden-lnXXX]$ git checkout 5f8a52b08b702e206f31f2660e4b6f22ac328c95 # (1)! ``` 1. This ensures the compatibility of nanotron with the following example. For general usage, there is no reason to stick to an outdated version of nanotron, though. We will install nanotron in a thin virtual environment on top of the container image built above. This proceeds as in the [LLM Inference][ref-mlp-llm-inference-tutorial]. -```bash -$ srun -A --environment=./ngc-nanotron-24.04.toml --pty bash -$ python -m venv --system-site-packages venv-24.04 -$ source venv-24.04/bin/activate -$ cd nanotron/ && pip install -e . -" +```console +[clariden-lnXXX]$ srun -A --environment=./ngc-nanotron-24.04.toml --pty bash +user@nidYYYYYY$ python -m venv --system-site-packages venv-24.04 +user@nidYYYYYY$ source venv-24.04/bin/activate +(venv-24.04) user@nidYYYYYY$ cd nanotron/ && pip install -e . ``` This creates a virtual environment on top of this container image (`--system-site-packages` ensuring access to system-installed site-packages) and installs nanotron in editable mode inside it. Because all dependencies of nanotron are already installed in the Dockerfile, no extra libraries will be installed at this point. @@ -344,7 +343,7 @@ srun -ul --environment=./ngc-nanotron-24.04.toml bash -c " !!! warning "`torchrun` with virtual environments" When using a virtual environment on top of a base image with PyTorch, always replace `torchrun` with `python -m torch.distributed.run` to pick up the correct Python environment. Otherwise, the system Python environment will be used and virtual environment packages not available. If not using virtual environments such as with a self-contained PyTorch container, `torchrun` is equivalent to `python -m torch.distributed.run`. -!!! note "Using srun instead of torchrun" +!!! note "Using srun instead of `torchrun`" In many cases, workloads launched with `torchrun` can equivalently be launched purely with SLURM by setting some extra environment variables for `torch.distributed`. This simplifies the overall setup. That is, the `srun` statement in the above `sbatch` script can be rewritten as ```bash title="$SCRATCH/tutorials/nanotron-pretrain/run_tiny_llama.sh" @@ -388,13 +387,13 @@ srun -ul --environment=./ngc-nanotron-24.04.toml bash -c " Run: ```console -$ sbatch run_tiny_llama.sh +[clariden-lnXXX]$ sbatch run_tiny_llama.sh ``` You can inspect if your job has been submitted successfully by running `squeue --me` and looking for your username. Once the run starts, there will be a new file under `logs/`. You can inspect the status of your run using: ```console -$ tail -f logs/ +[clariden-lnXXX]$ tail -f logs/ ``` In the end, the checkpoints of the model will be saved in `checkpoints/`. From 404a203c73863a7a14926a79d03359bdacdb674e Mon Sep 17 00:00:00 2001 From: Lukas Drescher Date: Mon, 28 Jul 2025 18:43:46 +0200 Subject: [PATCH 16/18] Integrating @Madeeks comment --- docs/guides/mlp_tutorials/llm-inference.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/guides/mlp_tutorials/llm-inference.md b/docs/guides/mlp_tutorials/llm-inference.md index e89818d3..73cb8cb6 100644 --- a/docs/guides/mlp_tutorials/llm-inference.md +++ b/docs/guides/mlp_tutorials/llm-inference.md @@ -27,7 +27,7 @@ For clarity, we prepend all shell commands with the hostname and any active Pyth In theory, we could just go ahead and use the vanilla container image to run some PyTorch code. However, chances are that we will need some additional libraries or software. -For this reason, we need to use some docker commands to build on top of what is provided by Nvidia. +For this reason, we need to build another image on top of the one provided by Nvidia. To do this, we create a new directory for recipes to build containers in our home directory and set up a [Dockerfile](https://docs.docker.com/reference/dockerfile/): ```console From 6b56fb40fbe69dcf8b32f05b9ef58a67455fd53c Mon Sep 17 00:00:00 2001 From: Lukas Drescher <38319063+lukasgd@users.noreply.github.com> Date: Mon, 28 Jul 2025 18:45:31 +0200 Subject: [PATCH 17/18] Update docs/guides/mlp_tutorials/llm-inference.md Co-authored-by: Rocco Meli --- docs/guides/mlp_tutorials/llm-inference.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/guides/mlp_tutorials/llm-inference.md b/docs/guides/mlp_tutorials/llm-inference.md index 73cb8cb6..9151add8 100644 --- a/docs/guides/mlp_tutorials/llm-inference.md +++ b/docs/guides/mlp_tutorials/llm-inference.md @@ -359,7 +359,7 @@ They inspire awe, forevermore. Congrats! You've run Google Gemma-7B inference on four GH200 chips simultaneously. Move on to the next tutorial or try the challenge. -!!! info "Collaborating in Git" +!!! info "Collaborating with Git" In order to track and exchange your progress with colleagues, you can use standard `git` commands on the host, i.e. in the directory `$SCRATCH/tutorials/gemma-7b` run ```console From 2b7f549496cbef203ddc26c00de2027e2348390d Mon Sep 17 00:00:00 2001 From: Lukas Drescher <38319063+lukasgd@users.noreply.github.com> Date: Mon, 28 Jul 2025 18:47:21 +0200 Subject: [PATCH 18/18] Update docs/guides/mlp_tutorials/llm-inference.md Co-authored-by: Rocco Meli --- docs/guides/mlp_tutorials/llm-inference.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/guides/mlp_tutorials/llm-inference.md b/docs/guides/mlp_tutorials/llm-inference.md index 9151add8..15af5ed0 100644 --- a/docs/guides/mlp_tutorials/llm-inference.md +++ b/docs/guides/mlp_tutorials/llm-inference.md @@ -373,7 +373,7 @@ Move on to the next tutorial or try the challenge. where you can replace `` by the owner of the Github repository you want to push to. - Note that for reproducibility, it is recommended to always track the Dockerfile, EDF and your application code alongside in a Git repository. + Note that for reproducibility, it is recommended to always track the Dockerfile and EDF alongside your application code in a Git repository. ### Challenge