Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/guides/mlp_tutorials/llm-nanotron-training.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
[](){#ref-mlp-llm-nanotron-tutorial}

# LLM Nanotron Training Tutorial

Check failure on line 3 in docs/guides/mlp_tutorials/llm-nanotron-training.md

View workflow job for this annotation

GitHub Actions / Check Spelling

`Nanotron` is not a recognized word. (unrecognized-spelling)

In this tutorial, we will build a container image to run nanotron training jobs.

Check failure on line 5 in docs/guides/mlp_tutorials/llm-nanotron-training.md

View workflow job for this annotation

GitHub Actions / Check Spelling

`nanotron` is not a recognized word. (unrecognized-spelling)
We will train a 109M parameter model with ~100M wikitext tokens as a proof of concept.

Check failure on line 6 in docs/guides/mlp_tutorials/llm-nanotron-training.md

View workflow job for this annotation

GitHub Actions / Check Spelling

`wikitext` is not a recognized word. (unrecognized-spelling)

### Prequisites

Check failure on line 8 in docs/guides/mlp_tutorials/llm-nanotron-training.md

View workflow job for this annotation

GitHub Actions / Check Spelling

`Prequisites` is not a recognized word. (unrecognized-spelling)

It is also recommended to follow the previous tutorials: [LLM Inference][ref-mlp-llm-inference-tutorial] and [LLM Finetuning][ref-mlp-llm-finetuning-tutorial], as this will build up from it.

Check failure on line 10 in docs/guides/mlp_tutorials/llm-nanotron-training.md

View workflow job for this annotation

GitHub Actions / Check Spelling

`Finetuning` is not a recognized word. (unrecognized-spelling)

### Set up Podman

Expand Down Expand Up @@ -61,7 +61,7 @@

### Set up an EDF

See the previous tutorial for context. In this case, the edf will be at `$HOME/.edf/nanotron.toml` and will have the following contents:

Check failure on line 64 in docs/guides/mlp_tutorials/llm-nanotron-training.md

View workflow job for this annotation

GitHub Actions / Check Spelling

`edf` is not a recognized word. (unrecognized-spelling)

Check failure on line 64 in docs/guides/mlp_tutorials/llm-nanotron-training.md

View workflow job for this annotation

GitHub Actions / Check Spelling

`edf` is not a recognized word. (unrecognized-spelling)

Check warning on line 64 in docs/guides/mlp_tutorials/llm-nanotron-training.md

View workflow job for this annotation

GitHub Actions / Check Spelling

`edf` is not a recognized word. (unrecognized-spelling)

```toml title="$HOME/.edf/nanotron.toml"
image = "/capstor/scratch/cscs/<USER>/container-image/nanotron/nanotron-v1.0.sqsh"
Expand All @@ -81,7 +81,7 @@

### Preparing a Training Job

Now let's download nanotron.

Check warning on line 84 in docs/guides/mlp_tutorials/llm-nanotron-training.md

View workflow job for this annotation

GitHub Actions / Check Spelling

`nanotron` is not a recognized word. (unrecognized-spelling)
In the login node run:

```console
Expand All @@ -89,7 +89,7 @@
$ cd nanotron
```

And with your favorite text editor, create the following nanotron configuration file in `$HOME/nanotron/examples/config_tiny_llama_wikitext.yaml`:

Check warning on line 92 in docs/guides/mlp_tutorials/llm-nanotron-training.md

View workflow job for this annotation

GitHub Actions / Check Spelling

`nanotron` is not a recognized word. (unrecognized-spelling)

Check warning on line 92 in docs/guides/mlp_tutorials/llm-nanotron-training.md

View workflow job for this annotation

GitHub Actions / Check Spelling

`nanotron` is not a recognized word. (unrecognized-spelling)

```yaml title="$HOME/nanotron/examples/config_tiny_llama_wikitext.yaml"
general:
Expand Down Expand Up @@ -192,9 +192,9 @@
log_level_replica: info
```

This configuration file will train, as a proof of concept, a gpt-2-like (109M parameters) llama model with approximately 100M tokens of wikitext with settings `tp=4, dp=2, pp=1` (which means that it requires two nodes to train).

Check failure on line 195 in docs/guides/mlp_tutorials/llm-nanotron-training.md

View workflow job for this annotation

GitHub Actions / Check Spelling

`gpt` is not a recognized word. (unrecognized-spelling)

Check warning on line 195 in docs/guides/mlp_tutorials/llm-nanotron-training.md

View workflow job for this annotation

GitHub Actions / Check Spelling

`wikitext` is not a recognized word. (unrecognized-spelling)
This training job will require approximately 10 minutes to run.
Now, create a batchfile in `$HOME/nanotron/run_tiny_llama.sh` with the contents:

Check failure on line 197 in docs/guides/mlp_tutorials/llm-nanotron-training.md

View workflow job for this annotation

GitHub Actions / Check Spelling

`batchfile` is not a recognized word. (unrecognized-spelling)

```bash title="$HOME/nanotron/run_tiny_llama.sh"
#!/bin/bash
Expand Down Expand Up @@ -229,7 +229,7 @@
--master-addr=\${MASTER_ADDR} \
--master-port=\${MASTER_PORT} \
--nnodes=\${SLURM_NNODES} \
--nproc-per-node=\${SLURM_GPUS_PER_TASK} \
--nproc-per-node=\${SLURM_GPUS_ON_NODE} \
\"

torchrun \${TORCHRUN_ARGS} run_train.py --config-file examples/config_tiny_llama_wikitext.yaml
Expand All @@ -239,10 +239,10 @@
A few comments:

- The parts outside the srun command will be run on the first node of the Slurm allocation for this job. srun commands without further specifiers execute with the settings of the sbatch script (i.e. using all nodes allocated to the job).
- If you have a [wandb](https://wandb.ai/) API key and want to synchronize the training run, be sure to set the `WANDB_API_KEY` variable. Otherwise, set `WANDB_MODE=of​f​line` instead.

Check failure on line 242 in docs/guides/mlp_tutorials/llm-nanotron-training.md

View workflow job for this annotation

GitHub Actions / Check Spelling

`wandb` is not a recognized word. (unrecognized-spelling)

Check warning on line 242 in docs/guides/mlp_tutorials/llm-nanotron-training.md

View workflow job for this annotation

GitHub Actions / Check Spelling

`wandb` is not a recognized word. (unrecognized-spelling)
- Note that we are setting `HF_HOME` in a directory in scratch. This is done to place the downloaded dataset in scratch, instead of your home directory.
- The pip install command is only run once in every container (compute node).
Note that this will only link the nanotron python package to be able to import it in any script irrespective of the current working directory.

Check warning on line 245 in docs/guides/mlp_tutorials/llm-nanotron-training.md

View workflow job for this annotation

GitHub Actions / Check Spelling

`nanotron` is not a recognized word. (unrecognized-spelling)
Because all dependencies of nanotron are already installed in the Dockerfile, no extra libraries will be installed at this point.
If the installation of the package under development creates artefacts on the shared filesystem (such as binaries from compiled C++/CUDA source code), this results in a race condition when run from multiple nodes.
Therefore, in this case and also when additional external libraries are to be installed, you should either use venv as shown in previous tutorials, or directly build everything in the Dockerfile.
Expand Down
Loading