Skip to content

Commit b66b0eb

Browse files
committed
Using console instead of bash with hostnames in the shell prompt and mention login-node policies
1 parent 80f2c19 commit b66b0eb

File tree

3 files changed

+92
-79
lines changed

3 files changed

+92
-79
lines changed

docs/guides/mlp_tutorials/llm-fine-tuning.md

Lines changed: 17 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -22,20 +22,21 @@ First, we need to update our Python environment with some extra libraries to sup
2222
To do this, we can launch an interactive shell in the PyTorch container, just like we did in the previous tutorial.
2323
Then, we install `peft`:
2424

25-
```bash
26-
$ cd $SCRATCH/tutorials/gemma-7b
27-
$ srun --environment=./ngc-pytorch-gemma-24.01.toml --pty bash
28-
$ source venv-gemma-24.01/bin/activate
29-
$ pip install peft==0.11.1
25+
```console
26+
[clariden-lnXXX]$ cd $SCRATCH/tutorials/gemma-7b
27+
[clariden-lnXXX]$ srun --environment=./ngc-pytorch-gemma-24.01.toml --pty bash
28+
user@nidYYYYYY$ source venv-gemma-24.01/bin/activate
29+
(venv-gemma-24.01) user@nidYYYYYY$ pip install peft==0.11.1
3030
```
3131

3232
Next, we also need to clone and install the `trl` Git repository so that we have access to the fine-tuning scripts in it.
3333
For this purpose, we will install the package in editable mode in the virtual environment.
3434
This makes it available in python scripts independent of the current working directory and without creating a redundant copy of the files.
3535

36-
```bash
37-
$ git clone https://github.com/huggingface/trl -b v0.7.11
38-
$ pip install -e ./trl # (1)!
36+
```console
37+
(venv-gemma-24.01) user@nidYYYYYY$ git clone \
38+
https://github.com/huggingface/trl -b v0.7.11
39+
(venv-gemma-24.01) user@nidYYYYYY$ pip install -e ./trl # (1)!
3940
```
4041

4142
1. Installs trl in editable mode
@@ -102,8 +103,8 @@ Everything after that configures the `trl/examples/scripts/sft.py` Python script
102103

103104
Make this script executable with
104105

105-
```bash
106-
$ chmod u+x $SCRATCH/tutorials/gemma-7b/fine-tune-gemma.sh
106+
```console
107+
[clariden-lnXXX]$ chmod u+x $SCRATCH/tutorials/gemma-7b/fine-tune-gemma.sh
107108
```
108109

109110
Next, we also need to create a short Slurm batch script to launch our fine-tuning script:
@@ -131,8 +132,8 @@ Now that we've setup a fine-tuning script and a Slurm batch script, we can launc
131132
We'll start out by launching it on two nodes.
132133
It should take about 10-15 minutes to fine-tune Gemma:
133134

134-
```bash
135-
$ sbatch --nodes=1 submit-fine-tune-gemma.sh
135+
```console
136+
[clariden-lnXXX]$ sbatch --nodes=1 submit-fine-tune-gemma.sh
136137
```
137138

138139
### Compare fine-tuned Gemma against default Gemma
@@ -146,8 +147,8 @@ input_text = "What are the 5 tallest mountains in the Swiss Alps?"
146147

147148
We can run inference using our batch script from the previous tutorial:
148149

149-
```bash
150-
$ sbatch submit-gemma-inference.sh
150+
```console
151+
[clariden-lnXXX]$ sbatch submit-gemma-inference.sh
151152
```
152153

153154
Inspecting the output should yield something like this:
@@ -168,7 +169,8 @@ the 5 tallest mountains in the Swiss Alps:
168169
Next, we can update the model line in our Python inference script to use the model that we just fine-tuned:
169170

170171
```python
171-
model = AutoModelForCausalLM.from_pretrained("gemma-fine-tuned-openassistant/checkpoint-400", device_map="auto")
172+
model = AutoModelForCausalLM.from_pretrained(
173+
"gemma-fine-tuned-openassistant/checkpoint-400", device_map="auto")
172174
```
173175

174176
If we re-run inference, the output will be a bit more detailed and explanatory, similar to output we might expect from a helpful chatbot. One example looks like this:

docs/guides/mlp_tutorials/llm-inference.md

Lines changed: 54 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -16,17 +16,24 @@ The model we will be running is Google's [Gemma-7B](https://huggingface.co/googl
1616

1717
This tutorial assumes you are able to access the cluster via SSH. To set up access to CSCS systems, follow the guide [here][ref-ssh], and read through the documentation about the [ML Platform][ref-platform-mlp].
1818

19+
For clarity, we prepend all shell commands with the hostname and any active Python virtual environment they are executed in. E.g. `clariden-lnXXX` refers to a login node on Clariden, while `nidYYYYYY` is a compute node (with placeholders for numeric values). The commands listed here are run on Clariden, but can be adapted slightly to run on other vClusters as well.
20+
21+
!!! note
22+
Login nodes are a shared environment for editing files, preparing and submitting SLURM jobs as well as inspecting logs. They are not intended for running significant data processing or compute work. Any memory- or compute-intensive work should instead be done on compute nodes.
23+
24+
If you need to move data [externally][ref-data-xfer-external] or [internally][ref-data-xfer-internal], please follow the corresponding guides using Globus or the `xfer` queue, respectively.
25+
1926
### Build a modified NGC PyTorch Container
2027

2128
In theory, we could just go ahead and use the vanilla container image to run some PyTorch code.
2229
However, chances are that we will need some additional libraries or software.
2330
For this reason, we need to use some docker commands to build on top of what is provided by Nvidia.
2431
To do this, we create a new directory for recipes to build containers in our home directory and set up a [Dockerfile](https://docs.docker.com/reference/dockerfile/):
2532

26-
```bash
27-
$ cd $SCRATCH
28-
$ mkdir -p tutorials/gemma-7b
29-
$ cd tutorials/gemma-7b
33+
```console
34+
[clariden-lnXXX]$ cd $SCRATCH
35+
[clariden-lnXXX]$ mkdir -p tutorials/gemma-7b
36+
[clariden-lnXXX]$ cd tutorials/gemma-7b
3037
```
3138

3239
Use your favorite text editor to create a file `Dockerfile` here. The Dockerfile should look like this:
@@ -82,9 +89,10 @@ This step is straightforward, just create the file in your home:
8289

8390
Before building the container image, we create a dedicated directory to keep track of all images used with the CE. Since container images are large files and the filesystem is a shared resource, we need to apply [best practices for LUSTRE][ref-guides-storage-lustre] so they are properly distributed across storage nodes.
8491

85-
```bash title="Container image directory with recommended LUSTRE settings"
86-
$ mkdir -p $SCRATCH/ce-images
87-
$ lfs setstripe -E 4M -c 1 -E 64M -c 4 -E -1 -c -1 -S 4M $SCRATCH/ce-images # (1)!
92+
```console title="Container image directory with recommended LUSTRE settings"
93+
[clariden-lnXXX]$ mkdir -p $SCRATCH/ce-images
94+
[clariden-lnXXX]$ lfs setstripe -E 4M -c 1 -E 64M -c 4 -E -1 -c -1 -S 4M \
95+
$SCRATCH/ce-images # (1)!
8896
```
8997

9098
1. This makes sure that files stored subsequently end up on the same storage node (up to 4 MB), on 4 storage nodes (between 4 and 64 MB) or are striped across all storage nodes (above 64 MB)
@@ -94,13 +102,13 @@ Slurm is a workload manager which distributes workloads on the cluster.
94102
Through Slurm, many people can use the supercomputer at the same time without interfering with one another.
95103

96104

97-
```bash
98-
$ srun -A <ACCOUNT> --pty bash
99-
$ podman build -t ngc-pytorch:24.01 . # (1)!
105+
```console
106+
[clariden-lnXXX]$ srun -A <ACCOUNT> --pty bash
107+
[nidYYYYYY]$ podman build -t ngc-pytorch:24.01 . # (1)!
100108
# ... lots of output here ...
101-
$ enroot import -x mount \
102-
-o $SCRATCH/ce-images/ngc-pytorch+24.01.sqsh \
103-
podman://ngc-pytorch:24.01 # (2)!
109+
[nidYYYYYY]$ enroot import -x mount \
110+
-o $SCRATCH/ce-images/ngc-pytorch+24.01.sqsh \
111+
podman://ngc-pytorch:24.01 # (2)!
104112
# ... more output here ...
105113
```
106114

@@ -111,8 +119,8 @@ where you should replace `<ACCOUNT>` with your project account ID.
111119
At this point, you can exit the Slurm allocation by typing `exit`.
112120
You should be able to see a new Squashfs file in your container image directory:
113121

114-
```bash
115-
$ ls $SCRATCH/ce-images
122+
```console
123+
[clariden-lnXXX]$ ls $SCRATCH/ce-images
116124
ngc-pytorch+24.01.sqsh
117125
```
118126

@@ -122,8 +130,8 @@ We will use our freshly-built container `ngc-pytorch+24.01.sqsh` in the followin
122130
!!! note
123131
In order to import a container image from a registry without building additional layers on top of it, we can directly use `enroot` (without `podman`). This is useful in this tutorial if we want to use a more recent NGC PyTorch container that was released since `24.11`. Use the following syntax for importing the `25.06` release:
124132

125-
```bash
126-
enroot import -x mount \
133+
```console
134+
[nidYYYYYY]$ enroot import -x mount \
127135
-o $SCRATCH/ce-images/ngc-pytorch+25.06.sqsh docker://nvcr.io#nvidia/pytorch:25.06-py3
128136
```
129137

@@ -179,16 +187,17 @@ This will be the first time we run our modified container.
179187
To run the container, we need allocate some compute resources using Slurm and launch a shell, just like we already did to build the container.
180188
This time, we also use the `--environment` option to specify that we want to launch the shell inside the container specified by our gemma-pytorch EDF file:
181189

182-
```bash
183-
$ cd $SCRATCH/tutorials/gemma-7b
184-
$ srun -A <ACCOUNT> --environment=./ngc-pytorch-gemma-24.01.toml --pty bash
190+
```console
191+
[clariden-lnXXX]$ cd $SCRATCH/tutorials/gemma-7b
192+
[clariden-lnXXX]$ srun -A <ACCOUNT> \
193+
--environment=./ngc-pytorch-gemma-24.01.toml --pty bash
185194
```
186195

187196
PyTorch is already setup in the container for us.
188197
We can verify this by asking pip for a list of installed packages:
189198

190-
```bash
191-
$ python -m pip list | grep torch
199+
```console
200+
user@nidYYYYYY$ python -m pip list | grep torch
192201
pytorch-quantization 2.1.2
193202
torch 2.2.0a0+81ea7a4
194203
torch-tensorrt 2.2.0a0
@@ -202,19 +211,19 @@ While it is best practice to install stable dependencies in the container image,
202211
The `--system-site-packages` option of the Python `venv` creation command ensures that we install packages _in addition_ to the existing packages and don't accidentally re-install a new version of PyTorch shadowing the one that has been put in place by Nvidia.
203212
Next, we activate the environment and use pip to install the two packages we need, `accelerate` and `transformers`:
204213

205-
```bash
206-
$ python -m venv --system-site-packages venv-gemma-24.01
207-
$ source venv-gemma-24.01/bin/activate
208-
(venv-gemma-24.01)$ pip install \
209-
accelerate==0.30.1 transformers==4.38.1 huggingface_hub[cli]
214+
```console
215+
user@nidYYYYYY$ python -m venv --system-site-packages venv-gemma-24.01
216+
user@nidYYYYYY$ source venv-gemma-24.01/bin/activate
217+
(venv-gemma-24.01) user@nidYYYYYY$ pip install \
218+
accelerate==0.30.1 transformers==4.38.1 huggingface_hub[cli]
210219
# ... pip output ...
211220
```
212221

213222
Before we move on to running the Gemma-7B model, we additionally need to make an account at [HuggingFace](https://huggingface.co), get an API token, and accept the [license agreement](https://huggingface.co/google/gemma-7b-it) for the [Gemma-7B](https://huggingface.co/google/gemma-7b) model. You can save the token to `$SCRATCH` using the huggingface-cli:
214223

215-
```bash
216-
$ export HF_HOME=$SCRATCH/huggingface
217-
$ huggingface-cli login
224+
```console
225+
(venv-gemma-24.01) user@nidYYYYYY$ export HF_HOME=$SCRATCH/huggingface
226+
(venv-gemma-24.01) user@nidYYYYYY$ huggingface-cli login
218227
```
219228

220229
At this point, you can exit the Slurm allocation again by typing `exit`.
@@ -229,8 +238,9 @@ If you `ls` the contents of the `gemma-inference` folder, you will see that the
229238

230239
Since [`HF_HOME`](https://huggingface.co/docs/huggingface_hub/en/package_reference/environment_variables#hfhome) will not only contain the API token, but also be the storage location for model, dataset and space caches of `huggingface_hub` (unless `HF_HUB_CACHE` is set), we also want to apply proper LUSTRE striping settings before it gets populated.
231240

232-
```bash
233-
$ lfs setstripe -E 4M -c 1 -E 64M -c 4 -E -1 -c -1 -S 4M $SCRATCH/huggingface
241+
```console
242+
[clariden-lnXXX]$ lfs setstripe -E 4M -c 1 -E 64M -c 4 -E -1 -c -1 -S 4M \
243+
$SCRATCH/huggingface
234244
```
235245

236246
### Run Inference on Gemma-7B
@@ -302,8 +312,8 @@ The operations performed before the `srun` command resemble largely the operatio
302312

303313
Once you've finished editing the batch file, you can save it and run it with Slurm:
304314

305-
```bash
306-
$ sbatch submit-gemma-inference.sh
315+
```console
316+
[clariden-lnXXX]$ sbatch submit-gemma-inference.sh
307317
```
308318

309319
This command should just finish without any output and return you to your terminal.
@@ -314,8 +324,8 @@ Once your job finishes, you will find a file in the same directory you ran it fr
314324
For this tutorial, you should see something like the following:
315325

316326

317-
```bash
318-
$ cat logs/slurm-gemma-inference-543210.out
327+
```console
328+
[clariden-lnXXX]$ cat logs/slurm-gemma-inference-543210.out
319329
/capstor/scratch/cscs/user/gemma-inference/venv-gemma-24.01/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
320330
warnings.warn(
321331
Gemma's activation function should be approximate GeLU and not exact GeLU.
@@ -352,10 +362,11 @@ Move on to the next tutorial or try the challenge.
352362
!!! info "Collaborating in Git"
353363

354364
In order to track and exchange your progress with colleagues, you can use standard `git` commands on the host, i.e. in the directory `$SCRATCH/tutorials/gemma-7b` run
355-
```bash
356-
$ git init .
357-
$ git remote add origin [email protected]:<github-username>/alps-mlp-tutorials-gemma-7b.git # (1)!
358-
$ ... # git add/commit
365+
```console
366+
[clariden-lnXXX]$ git init .
367+
[clariden-lnXXX]$ git remote add origin \
368+
[email protected]:<github-username>/alps-mlp-tutorials-gemma-7b.git # (1)!
369+
[clariden-lnXXX]$ ... # git add/commit
359370
```
360371

361372
1. Use any alternative Git hosting service instead of Github
@@ -369,8 +380,9 @@ Move on to the next tutorial or try the challenge.
369380

370381
Using the same approach as in the latter half of step 4, use pip to install the package `nvitop`. This is a tool that shows you a concise real-time summary of GPU activity. Then, run Gemma and launch `nvitop` at the same time:
371382

372-
```bash
373-
(venv-gemma-24.01)$ python gemma-inference.py > gemma-output.log 2>&1 & nvitop
383+
```console
384+
(venv-gemma-24.01) user@nidYYYYYY$ python gemma-inference.py \
385+
> gemma-output.log 2>&1 & nvitop
374386
```
375387

376388
Note the use of bash `> gemma-output.log 2>&1` to hide any output from Python.

docs/guides/mlp_tutorials/llm-nanotron-training.md

Lines changed: 21 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -25,9 +25,9 @@ If not already done as part of the [LLM Inference tutorial][ref-mlp-llm-inferenc
2525

2626
Create a directory to store container images used with CE and configure it with [recommended LUSTRE settings][ref-guides-storage-lustre]:
2727

28-
```bash title="Container image directory with recommended LUSTRE settings"
29-
$ mkdir -p $SCRATCH/ce-images
30-
$ lfs setstripe -E 4M -c 1 -E 64M -c 4 -E -1 -c -1 -S 4M $SCRATCH/ce-images # (1)!
28+
```console title="Container image directory with recommended LUSTRE settings"
29+
[clariden-lnXXX]$ mkdir -p $SCRATCH/ce-images
30+
[clariden-lnXXX]$ lfs setstripe -E 4M -c 1 -E 64M -c 4 -E -1 -c -1 -S 4M $SCRATCH/ce-images # (1)!
3131
```
3232

3333
1. This makes sure that files stored subsequently end up on the same storage node (up to 4 MB), on 4 storage nodes (between 4 and 64 MB) or are striped across all storage nodes (above 64 MB)
@@ -92,11 +92,11 @@ RUN pip install \
9292

9393
Then build and import the container.
9494

95-
```bash
96-
$ cd $SCRATCH/tutorials/nanotron-pretrain
97-
$ podman build -f Dockerfile -t ngc-nanotron:24.04 .
98-
$ enroot import -x mount \
99-
-o $SCRATCH/ce-images/ngc-nanotron+24.04.sqsh podman://ngc-nanotron:24.04 # (1)!
95+
```console
96+
[nidYYYYYY]$ cd $SCRATCH/tutorials/nanotron-pretrain
97+
[nidYYYYYY]$ podman build -f Dockerfile -t ngc-nanotron:24.04 .
98+
[nidYYYYYY]$ enroot import -x mount \
99+
-o $SCRATCH/ce-images/ngc-nanotron+24.04.sqsh podman://ngc-nanotron:24.04 # (1)!
100100
```
101101

102102
1. We import container images into a canonical location under $SCRATCH.
@@ -156,23 +156,22 @@ Note that, if you built your container image elsewhere, you will need to modify
156156
Now let's download nanotron.
157157
In the login node run:
158158

159-
```bash
160-
$ cd $SCRATCH/tutorials/nanotron-pretrain
161-
$ git clone https://github.com/huggingface/nanotron.git
162-
$ cd nanotron
163-
$ git checkout 5f8a52b08b702e206f31f2660e4b6f22ac328c95 # (1)!
159+
```console
160+
[clariden-lnXXX]$ cd $SCRATCH/tutorials/nanotron-pretrain
161+
[clariden-lnXXX]$ git clone https://github.com/huggingface/nanotron.git
162+
[clariden-lnXXX]$ cd nanotron
163+
[clariden-lnXXX]$ git checkout 5f8a52b08b702e206f31f2660e4b6f22ac328c95 # (1)!
164164
```
165165

166166
1. This ensures the compatibility of nanotron with the following example. For general usage, there is no reason to stick to an outdated version of nanotron, though.
167167

168168
We will install nanotron in a thin virtual environment on top of the container image built above. This proceeds as in the [LLM Inference][ref-mlp-llm-inference-tutorial].
169169

170-
```bash
171-
$ srun -A <ACCOUNT> --environment=./ngc-nanotron-24.04.toml --pty bash
172-
$ python -m venv --system-site-packages venv-24.04
173-
$ source venv-24.04/bin/activate
174-
$ cd nanotron/ && pip install -e .
175-
"
170+
```console
171+
[clariden-lnXXX]$ srun -A <ACCOUNT> --environment=./ngc-nanotron-24.04.toml --pty bash
172+
user@nidYYYYYY$ python -m venv --system-site-packages venv-24.04
173+
user@nidYYYYYY$ source venv-24.04/bin/activate
174+
(venv-24.04) user@nidYYYYYY$ cd nanotron/ && pip install -e .
176175
```
177176

178177
This creates a virtual environment on top of this container image (`--system-site-packages` ensuring access to system-installed site-packages) and installs nanotron in editable mode inside it. Because all dependencies of nanotron are already installed in the Dockerfile, no extra libraries will be installed at this point.
@@ -344,7 +343,7 @@ srun -ul --environment=./ngc-nanotron-24.04.toml bash -c "
344343
!!! warning "`torchrun` with virtual environments"
345344
When using a virtual environment on top of a base image with PyTorch, always replace `torchrun` with `python -m torch.distributed.run` to pick up the correct Python environment. Otherwise, the system Python environment will be used and virtual environment packages not available. If not using virtual environments such as with a self-contained PyTorch container, `torchrun` is equivalent to `python -m torch.distributed.run`.
346345

347-
!!! note "Using srun instead of torchrun"
346+
!!! note "Using srun instead of `torchrun`"
348347
In many cases, workloads launched with `torchrun` can equivalently be launched purely with SLURM by setting some extra environment variables for `torch.distributed`. This simplifies the overall setup. That is, the `srun` statement in the above `sbatch` script can be rewritten as
349348

350349
```bash title="$SCRATCH/tutorials/nanotron-pretrain/run_tiny_llama.sh"
@@ -388,13 +387,13 @@ srun -ul --environment=./ngc-nanotron-24.04.toml bash -c "
388387
Run:
389388

390389
```console
391-
$ sbatch run_tiny_llama.sh
390+
[clariden-lnXXX]$ sbatch run_tiny_llama.sh
392391
```
393392

394393
You can inspect if your job has been submitted successfully by running `squeue --me` and looking for your username. Once the run starts, there will be a new file under `logs/`. You can inspect the status of your run using:
395394

396395
```console
397-
$ tail -f logs/<logfile>
396+
[clariden-lnXXX]$ tail -f logs/<logfile>
398397
```
399398

400399
In the end, the checkpoints of the model will be saved in `checkpoints/`.

0 commit comments

Comments
 (0)