Skip to content

Commit 43f6a6f

Browse files
lukasgdRMeli
andauthored
MLP tutorials update (#209)
This updates the existing MLP tutorials, to be merged with ML software in a next step. Also for the future, a Megatron-LM tutorial with recommended settings is left. --------- Co-authored-by: Rocco Meli <[email protected]>
1 parent f8a0227 commit 43f6a6f

File tree

6 files changed

+431
-181
lines changed

6 files changed

+431
-181
lines changed

docs/access/jupyterlab.md

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ When resources are granted the page redirects to the JupyterLab session, where y
2323
[](){#ref-jupyter-runtime-environment}
2424
## Runtime environment
2525

26-
A Jupyter session can be started with either a [uenv][ref-uenv] or a [container][ref-container-engine] as a base image. The JupyterHub Spawner form provides a set of default images such as the [prgenv-gnu][ref-uenv-prgenv-gnu] uenv or the [NGC Pytorch container][ref-software-ml] to choose from in a dropdown menu. When using uenv, the software stack will be mounted at `/user-environment`, and the specified view will be activated. For a container, the Jupyter session will launch inside the container filesystem with only a select set of paths mounted from the host. Once you have found a suitable option, you can start the session with `Launch JupyterLab`.
26+
A Jupyter session can be started with either a [uenv][ref-uenv] or a [container][ref-container-engine] as a base image. The JupyterHub Spawner form provides a set of default images such as the [prgenv-gnu][ref-uenv-prgenv-gnu] uenv or the [NGC PyTorch container][ref-software-ml] to choose from in a dropdown menu. When using uenv, the software stack will be mounted at `/user-environment`, and the specified view will be activated. For a container, the Jupyter session will launch inside the container filesystem with only a select set of paths mounted from the host. Once you have found a suitable option, you can start the session with `Launch JupyterLab`.
2727

2828
??? info "Using remote uenv for the first time."
2929
If the uenv is not present in the local repository, it will be automatically fetched.
@@ -34,8 +34,8 @@ A Jupyter session can be started with either a [uenv][ref-uenv] or a [container]
3434

3535
If the default base images do not meet your requirements, you can specify a custom environment instead. For this purpose, you supply either a custom uenv image/view or [container engine (CE)][ref-container-engine] TOML file under the section `Advanced options` before launching the session. The supported uenvs are compatible with the Jupyter service out of the box, whereas container images typically require the installation of some additional packages.
3636

37-
??? "Example of a custom Pytorch container"
38-
A container image based on recent a NGC Pytorch release requires the installation of the following additional packages to be compatible with the Jupyter service:
37+
??? "Example of a custom PyTorch container"
38+
A container image based on recent a NGC PyTorch release requires the installation of the following additional packages to be compatible with the Jupyter service:
3939

4040
```Dockerfile
4141
FROM nvcr.io/nvidia/pytorch:25.05-py3
@@ -199,14 +199,14 @@ Examples of notebooks with `ipcmagic` can be found [here](https://github.com/
199199

200200
While it is generally recommended to submit long-running machine learning training and inference jobs via `sbatch`, certain use cases can benefit from an interactive Jupyter environment.
201201

202-
A popular approach to run multi-GPU ML workloads is with [`accelerate`](https://github.com/huggingface/accelerate) and [`torchrun`](https://docs.pytorch.org/docs/stable/elastic/run.html) as demonstrated in the [tutorials][ref-guides-mlp-tutorials]. In particular, the `accelerate launch` script in the [LLM fine-tuning tutorial][ref-mlp-llm-finetuning-tutorial] can be directly carried over to a Jupyter cell with a `%%bash` header (to run its contents interpreted by bash). For `torchrun`, one can adapt the command from the multi-node [nanotron tutorial][ref-mlp-llm-nanotron-tutorial] to run on a single GH200 node using the following line in a Jupyter cell
202+
A popular approach to run multi-GPU ML workloads is with [`accelerate`](https://github.com/huggingface/accelerate) and [`torchrun`](https://docs.pytorch.org/docs/stable/elastic/run.html) as demonstrated in the [tutorials][ref-guides-mlp-tutorials]. In particular, the `accelerate launch` script in the [LLM fine-tuning tutorial][ref-mlp-llm-fine-tuning-tutorial] can be directly carried over to a Jupyter cell with a `%%bash` header (to run its contents interpreted by bash). For `torchrun`, one can adapt the command from the multi-node [nanotron tutorial][ref-mlp-llm-nanotron-tutorial] to run on a single GH200 node using the following line in a Jupyter cell
203203

204204
```bash
205205
!python -m torch.distributed.run --standalone --nproc_per_node=4 run_train.py ...
206206
```
207207

208208
!!! warning "torchrun with virtual environments"
209-
When using a virtual environment on top of a base image with Pytorch, always replace `torchrun` with `python -m torch.distributed.run` to pick up the correct Python environment. Otherwise, the system Python environment will be used and virtual environment packages not available. If not using virtual environments such as with a self-contained Pytorch container, `torchrun` is equivalent to `python -m torch.distributed.run`.
209+
When using a virtual environment on top of a base image with PyTorch, always replace `torchrun` with `python -m torch.distributed.run` to pick up the correct Python environment. Otherwise, the system Python environment will be used and virtual environment packages will not available. If not using virtual environments such as with a self-contained PyTorch container, `torchrun` is equivalent to `python -m torch.distributed.run`.
210210

211211
!!! note "Notebook structure"
212212
In none of these scenarios any significant memory allocations or background computations are performed on the main Jupyter process. Instead, the resources are kept available for the processes launched by `accelerate` or `torchrun`, respectively.
@@ -216,19 +216,20 @@ Alternatively to using these launchers, it is also possible to use Slurm to obta
216216
```bash
217217
!srun --overlap -ul --environment /path/to/edf.toml \
218218
--container-workdir $PWD -n 4 bash -c "\
219+
. venv-<base-image-version>/bin/activate
219220
MASTER_ADDR=\$(scontrol show hostnames \$SLURM_JOB_NODELIST | head -n 1) \
220221
MASTER_PORT=29500 \
221222
RANK=\$SLURM_PROCID LOCAL_RANK=\$SLURM_LOCALID WORLD_SIZE=\$SLURM_NPROCS \
222223
python train.py ..."
223224
```
224225

225-
where `/path/to/edf.toml` should be replaced by the TOML file and `train.py` is a script using `torch.distributed` for distributed training. This can be further customized with extra Slurm options.
226+
where `/path/to/edf.toml` should be replaced by the TOML file and `venv-<base-image-version>` by the name of the virtual environment (if used). The script `train.py` is using `torch.distributed` for distributed training. This launch mechanism can be further customized with extra Slurm options.
226227

227228
!!! warning "Concurrent usage of resources"
228229
Subtle bugs can occur when running multiple Jupyter notebooks concurrently that each assume access to the full node. Also, some notebooks may hold on to resources such as spawned child processes or allocated memory despite having completed. In this case, resources such as a GPU may still be busy, blocking another notebook from using it. Therefore, it is good practice to only keep one such notebook running that occupies the full node and restarting a kernel once a notebook has completed. If in doubt, system monitoring with `htop` and [nvdashboard](https://github.com/rapidsai/jupyterlab-nvdashboard) can be helpful for debugging.
229230

230231
!!! warning "Multi-GPU training from a shared Jupyter process"
231-
Running multi-GPU training workloads directly from the shared Jupyter process is generally not recommended due to potential inefficiencies and correctness issues (cf. the [Pytorch docs](https://docs.pytorch.org/docs/stable/notes/cuda.html#use-nn-parallel-distributeddataparallel-instead-of-multiprocessing-or-nn-dataparallel)). However, if you need it to e.g. reproduce existing results, it is possible to do so with utilities like `accelerate`'s `notebook_launcher` or [`transformers`](https://github.com/huggingface/transformers)' `Trainer` class. When using these in containers, you will currently need to unset the environment variables `RANK` and `LOCAL_RANK`, that is have the following in a cell at the top of the notebook:
232+
Running multi-GPU training workloads directly from the shared Jupyter process is generally not recommended due to potential inefficiencies and correctness issues (cf. the [PyTorch docs](https://docs.pytorch.org/docs/stable/notes/cuda.html#use-nn-parallel-distributeddataparallel-instead-of-multiprocessing-or-nn-dataparallel)). However, if you need it to e.g. reproduce existing results, it is possible to do so with utilities like `accelerate`'s `notebook_launcher` or [`transformers`](https://github.com/huggingface/transformers)' `Trainer` class. When using these in containers, you will currently need to unset the environment variables `RANK` and `LOCAL_RANK` by adding the following in a cell at the top of the notebook:
232233

233234
```python
234235
import os; os.environ.pop("RANK"); os.environ.pop("LOCAL_RANK");

docs/guides/mlp_tutorials/index.md

Lines changed: 5 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,10 @@
11
[](){#ref-guides-mlp-tutorials}
2-
# MLP Tutorials
2+
# Machine Learning Platform Tutorials
33

4-
These tutorials solve simple MLP tasks using the [Container Engine][ref-container-engine] on the ML Platform.
5-
6-
1. [LLM Inference][ref-mlp-llm-inference-tutorial]
7-
2. [LLM Fine-tuning][ref-mlp-llm-finetuning-tutorial]
8-
3. [Nanotron Training][ref-mlp-llm-nanotron-tutorial]
4+
These tutorials gradually introduce key concepts of the Machine Learning Platform. A particular focus is on the [Container Engine][ref-container-engine] for managing the runtime environment.
95

6+
In a [first tutorial][ref-mlp-llm-inference-tutorial], you will learn how to run inference with a LLM on a single node using a container from the NVIDIA GPU Cloud (NGC). Concepts such as container environment description, layering a thin virtual environment on top of the container image, and job launching and monitoring will be introduced.
107

8+
Building on the first tutorial, in the [second tutorial][ref-mlp-llm-fine-tuning-tutorial] you will learn how to train (fine-tune) a LLM on multiple GPUs on a single node. For this purpose, you will use HuggingFace's `accelerate` and see best practices for dataset management.
119

10+
In the [third tutorial][ref-mlp-llm-nanotron-tutorial], you will apply the techniques from the previous tutorials to enable distributed (pre-)training of a model in `nanotron` on multiple nodes. In particular, this tutorial makes use of model-parallelism and introduces the usage of `torchrun` to manage jobs on individual nodes.

docs/guides/mlp_tutorials/llm-finetuning.md renamed to docs/guides/mlp_tutorials/llm-fine-tuning.md

Lines changed: 43 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
[](){#ref-mlp-llm-finetuning-tutorial}
1+
[](){#ref-mlp-llm-fine-tuning-tutorial}
22

33
# LLM Fine-tuning Tutorial
44

@@ -8,45 +8,50 @@ This means that we take the model and train it on some new custom data to change
88
To complete the tutorial, we set up some extra libraries that will help us to update the state of the machine learning model.
99
We also write a script that will allow us to unlock more of the performance offered by the cluster, by running our fine-tuning task on two or more nodes.
1010

11+
## Fine-tuning Gemma 7B on the OpenAssistant dataset
12+
1113
### Prerequisites
1214

1315
This tutorial assumes you've already successfully completed the [LLM Inference][ref-mlp-llm-inference-tutorial] tutorial.
14-
For fine-tuning Gemma, we will rely on the NGC PyTorch container and the libraries we've already installed in the Python environment used previously.
16+
For fine-tuning Gemma, we will rely on the NGC PyTorch container and the libraries we've already installed in the Python virtual environment used previously.
1517

1618
### Set up TRL
1719

18-
We will use HuggingFace TRL to fine-tune Gemma-7B on the [OpenAssistant dataset](https://huggingface.co/datasets/OpenAssistant/oasst_top1_2023-08-25).
20+
We will use HuggingFace TRL (Transformer Reinforcement Learning) to fine-tune Gemma-7B on the [OpenAssistant dataset](https://huggingface.co/datasets/OpenAssistant/oasst_top1_2023-08-25).
1921
First, we need to update our Python environment with some extra libraries to support TRL.
2022
To do this, we can launch an interactive shell in the PyTorch container, just like we did in the previous tutorial.
2123
Then, we install `peft`:
2224

2325
```console
24-
$ cd $SCRATCH/gemma-inference
25-
$ srun --environment=gemma-pytorch --container-workdir=$PWD --pty bash
26-
$ source ./gemma-venv/bin/activate
27-
$ python -m pip install peft==0.11.1
26+
[clariden-lnXXX]$ cd $SCRATCH/tutorials/gemma-7b
27+
[clariden-lnXXX]$ srun --environment=./ngc-pytorch-gemma-24.01.toml --pty bash
28+
user@nidYYYYYY$ source venv-gemma-24.01/bin/activate
29+
(venv-gemma-24.01) user@nidYYYYYY$ pip install peft==0.11.1
2830
```
2931

3032
Next, we also need to clone and install the `trl` Git repository so that we have access to the fine-tuning scripts in it.
3133
For this purpose, we will install the package in editable mode in the virtual environment.
3234
This makes it available in python scripts independent of the current working directory and without creating a redundant copy of the files.
3335

3436
```console
35-
$ git clone https://github.com/huggingface/trl -b v0.7.11
36-
$ pip install -e ./trl # install in editable mode
37+
(venv-gemma-24.01) user@nidYYYYYY$ git clone \
38+
https://github.com/huggingface/trl -b v0.7.11
39+
(venv-gemma-24.01) user@nidYYYYYY$ pip install -e ./trl # (1)!
3740
```
3841

42+
1. Installs trl in editable mode
43+
3944
When this step is complete, you can exit the shell by typing `exit`.
4045

4146
### Fine-tune Gemma-7B
4247

43-
t this point, we can set up a fine-tuning script and start training Gemma-7B.
44-
Use your favorite text editor to create the file `fine-tune-gemma.sh` just outside the `trl` and `gemma-venv` directories:
48+
At this point, we can set up a fine-tuning script and start training Gemma-7B.
49+
Use your favorite text editor to create the file `fine-tune-gemma.sh` just outside the `trl` and `venv-gemma-24.01` directories:
4550

46-
```bash title="fine-tune-gemma.sh"
51+
```bash title="$SCRATCH/tutorials/gemma-7b/fine-tune-gemma.sh"
4752
#!/bin/bash
4853

49-
source ./gemma-venv/bin/activate
54+
source venv-gemma-24.01/bin/activate
5055

5156
set -x
5257

@@ -73,38 +78,50 @@ accelerate launch --config_file trl/examples/accelerate_configs/multi_gpu.yaml \
7378
--use_peft \
7479
--lora_r 16 --lora_alpha 32 \
7580
--lora_target_modules q_proj k_proj v_proj o_proj \
76-
--output_dir gemma-finetuned-openassistant
81+
--output_dir gemma-fine-tuned-openassistant
7782
```
7883

7984
This script has quite a bit more content to unpack.
80-
We use HuggingFace accelerate to launch the fine-tuning process, so we need to make sure that accelerate understands which hardware is available and where.
85+
We use HuggingFace `accelerate` to launch the fine-tuning process, so we need to make sure that `accelerate` understands which hardware is available and where.
8186
Setting this up will be useful in the long run because it means we can tell Slurm how much hardware to reserve, and this script will setup all the details for us.
8287

8388
The cluster has four GH200 chips per compute node.
84-
We can make them accessible to scripts run through srun/sbatch via the option `--gpus-per-node=4`.
89+
We can make them accessible to scripts run through `srun`/`sbatch` via the option `--gpus-per-node=4`.
8590
Then, we calculate how many processes accelerate should launch.
8691
We want to map each GPU to a separate process, this should be four processes per node.
8792
We multiply this by the number of nodes to obtain the total number of processes.
8893
Next, we use some bash magic to extract the name of the head node from Slurm environment variables.
89-
Accelerate expects one main node and launches tasks on the other nodes from this main node.
94+
`accelerate` expects one main node and launches tasks on the other nodes from this main node.
9095
Having sourced our python environment at the top of the script, we can then launch Gemma fine-tuning.
91-
The first four lines of the launch line are used to configure accelerate.
96+
The first four lines of the launch line are used to configure `accelerate`.
9297
Everything after that configures the `trl/examples/scripts/sft.py` Python script, which we use to train Gemma.
9398

99+
!!! note "Dataset management and sharing"
100+
For datasets, recommended LUSTRE settings should be used as illustrated in the tutorial on [LLM Inference][ref-mlp-llm-inference-tutorial]. As they have been set there for `HF_HOME`, which `huggingface_hub` uses for its dataset cache, they don't need to be re-applied here.
101+
102+
To enable your colleagues to use also use your datasets, please refer to the [storage guide][ref-guides-storage-sharing].
103+
104+
Make this script executable with
105+
106+
```console
107+
[clariden-lnXXX]$ chmod u+x $SCRATCH/tutorials/gemma-7b/fine-tune-gemma.sh
108+
```
109+
94110
Next, we also need to create a short Slurm batch script to launch our fine-tuning script:
95111

96-
```bash title="fine-tune-sft.sbatch"
112+
```bash title="$SCRATCH/tutorials/gemma-7b/submit-fine-tune-gemma.sh"
97113
#!/bin/bash
98-
#SBATCH --job-name=gemma-finetune
114+
#SBATCH --account=<ACCOUNT>
115+
#SBATCH --job-name=fine-tune-gemma
99116
#SBATCH --time=00:30:00
100117
#SBATCH --ntasks-per-node=1
101118
#SBATCH --gpus-per-node=4
102119
#SBATCH --cpus-per-task=288
103-
#SBATCH --account=<ACCOUNT>
120+
#SBATCH --output logs/slurm-%x-%j.out
104121

105122
set -x
106123

107-
srun -ul --environment=gemma-pytorch --container-workdir=$PWD bash fine-tune-gemma.sh
124+
srun -ul --environment=./ngc-pytorch-gemma-24.01.toml fine-tune-gemma.sh
108125
```
109126

110127
We set a few Slurm parameters like we already did in the previous tutorial.
@@ -116,7 +133,7 @@ We'll start out by launching it on two nodes.
116133
It should take about 10-15 minutes to fine-tune Gemma:
117134

118135
```console
119-
$ sbatch --nodes=1 fine-tune-sft.sbatch
136+
[clariden-lnXXX]$ sbatch --nodes=1 submit-fine-tune-gemma.sh
120137
```
121138

122139
### Compare fine-tuned Gemma against default Gemma
@@ -131,7 +148,7 @@ input_text = "What are the 5 tallest mountains in the Swiss Alps?"
131148
We can run inference using our batch script from the previous tutorial:
132149

133150
```console
134-
$ sbatch ./gemma-inference.sbatch
151+
[clariden-lnXXX]$ sbatch submit-gemma-inference.sh
135152
```
136153

137154
Inspecting the output should yield something like this:
@@ -152,7 +169,8 @@ the 5 tallest mountains in the Swiss Alps:
152169
Next, we can update the model line in our Python inference script to use the model that we just fine-tuned:
153170

154171
```python
155-
model = AutoModelForCausalLM.from_pretrained("gemma-finetuned-openassistant/checkpoint-400", device_map="auto")
172+
model = AutoModelForCausalLM.from_pretrained(
173+
"gemma-fine-tuned-openassistant/checkpoint-400", device_map="auto")
156174
```
157175

158176
If we re-run inference, the output will be a bit more detailed and explanatory, similar to output we might expect from a helpful chatbot. One example looks like this:

0 commit comments

Comments
 (0)