You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If we re-run inference, the output will be a bit more detailed and explanatory, similar to output we might expect from a helpful chatbot. One example looks like this:
Copy file name to clipboardExpand all lines: docs/guides/mlp_tutorials/llm-inference.md
+54-42Lines changed: 54 additions & 42 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -16,17 +16,24 @@ The model we will be running is Google's [Gemma-7B](https://huggingface.co/googl
16
16
17
17
This tutorial assumes you are able to access the cluster via SSH. To set up access to CSCS systems, follow the guide [here][ref-ssh], and read through the documentation about the [ML Platform][ref-platform-mlp].
18
18
19
+
For clarity, we prepend all shell commands with the hostname and any active Python virtual environment they are executed in. E.g. `clariden-lnXXX` refers to a login node on Clariden, while `nidYYYYYY` is a compute node (with placeholders for numeric values). The commands listed here are run on Clariden, but can be adapted slightly to run on other vClusters as well.
20
+
21
+
!!! note
22
+
Login nodes are a shared environment for editing files, preparing and submitting SLURM jobs as well as inspecting logs. They are not intended for running significant data processing or compute work. Any memory- or compute-intensive work should instead be done on compute nodes.
23
+
24
+
If you need to move data [externally][ref-data-xfer-external] or [internally][ref-data-xfer-internal], please follow the corresponding guides using Globus or the `xfer` queue, respectively.
25
+
19
26
### Build a modified NGC PyTorch Container
20
27
21
28
In theory, we could just go ahead and use the vanilla container image to run some PyTorch code.
22
29
However, chances are that we will need some additional libraries or software.
23
30
For this reason, we need to use some docker commands to build on top of what is provided by Nvidia.
24
31
To do this, we create a new directory for recipes to build containers in our home directory and set up a [Dockerfile](https://docs.docker.com/reference/dockerfile/):
25
32
26
-
```bash
27
-
$ cd$SCRATCH
28
-
$ mkdir -p tutorials/gemma-7b
29
-
$ cd tutorials/gemma-7b
33
+
```console
34
+
[clariden-lnXXX]$ cd $SCRATCH
35
+
[clariden-lnXXX]$ mkdir -p tutorials/gemma-7b
36
+
[clariden-lnXXX]$ cd tutorials/gemma-7b
30
37
```
31
38
32
39
Use your favorite text editor to create a file `Dockerfile` here. The Dockerfile should look like this:
@@ -82,9 +89,10 @@ This step is straightforward, just create the file in your home:
82
89
83
90
Before building the container image, we create a dedicated directory to keep track of all images used with the CE. Since container images are large files and the filesystem is a shared resource, we need to apply [best practices for LUSTRE][ref-guides-storage-lustre] so they are properly distributed across storage nodes.
84
91
85
-
```bash title="Container image directory with recommended LUSTRE settings"
1. This makes sure that files stored subsequently end up on the same storage node (up to 4 MB), on 4 storage nodes (between 4 and 64 MB) or are striped across all storage nodes (above 64 MB)
@@ -94,13 +102,13 @@ Slurm is a workload manager which distributes workloads on the cluster.
94
102
Through Slurm, many people can use the supercomputer at the same time without interfering with one another.
@@ -111,8 +119,8 @@ where you should replace `<ACCOUNT>` with your project account ID.
111
119
At this point, you can exit the Slurm allocation by typing `exit`.
112
120
You should be able to see a new Squashfs file in your container image directory:
113
121
114
-
```bash
115
-
$ ls $SCRATCH/ce-images
122
+
```console
123
+
[clariden-lnXXX]$ ls $SCRATCH/ce-images
116
124
ngc-pytorch+24.01.sqsh
117
125
```
118
126
@@ -122,8 +130,8 @@ We will use our freshly-built container `ngc-pytorch+24.01.sqsh` in the followin
122
130
!!! note
123
131
In order to import a container image from a registry without building additional layers on top of it, we can directly use `enroot` (without `podman`). This is useful in this tutorial if we want to use a more recent NGC PyTorch container that was released since `24.11`. Use the following syntax for importing the `25.06` release:
@@ -179,16 +187,17 @@ This will be the first time we run our modified container.
179
187
To run the container, we need allocate some compute resources using Slurm and launch a shell, just like we already did to build the container.
180
188
This time, we also use the `--environment` option to specify that we want to launch the shell inside the container specified by our gemma-pytorch EDF file:
181
189
182
-
```bash
183
-
$ cd$SCRATCH/tutorials/gemma-7b
184
-
$ srun -A <ACCOUNT> --environment=./ngc-pytorch-gemma-24.01.toml --pty bash
We can verify this by asking pip for a list of installed packages:
189
198
190
-
```bash
191
-
$ python -m pip list | grep torch
199
+
```console
200
+
user@nidYYYYYY$ python -m pip list | grep torch
192
201
pytorch-quantization 2.1.2
193
202
torch 2.2.0a0+81ea7a4
194
203
torch-tensorrt 2.2.0a0
@@ -202,19 +211,19 @@ While it is best practice to install stable dependencies in the container image,
202
211
The `--system-site-packages` option of the Python `venv` creation command ensures that we install packages _in addition_ to the existing packages and don't accidentally re-install a new version of PyTorch shadowing the one that has been put in place by Nvidia.
203
212
Next, we activate the environment and use pip to install the two packages we need, `accelerate` and `transformers`:
Before we move on to running the Gemma-7B model, we additionally need to make an account at [HuggingFace](https://huggingface.co), get an API token, and accept the [license agreement](https://huggingface.co/google/gemma-7b-it) for the [Gemma-7B](https://huggingface.co/google/gemma-7b) model. You can save the token to `$SCRATCH` using the huggingface-cli:
At this point, you can exit the Slurm allocation again by typing `exit`.
@@ -229,8 +238,9 @@ If you `ls` the contents of the `gemma-inference` folder, you will see that the
229
238
230
239
Since [`HF_HOME`](https://huggingface.co/docs/huggingface_hub/en/package_reference/environment_variables#hfhome) will not only contain the API token, but also be the storage location for model, dataset and space caches of `huggingface_hub` (unless `HF_HUB_CACHE` is set), we also want to apply proper LUSTRE striping settings before it gets populated.
/capstor/scratch/cscs/user/gemma-inference/venv-gemma-24.01/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
320
330
warnings.warn(
321
331
Gemma's activation function should be approximate GeLU and not exact GeLU.
@@ -352,10 +362,11 @@ Move on to the next tutorial or try the challenge.
352
362
!!! info "Collaborating in Git"
353
363
354
364
In order to track and exchange your progress with colleagues, you can use standard `git` commands on the host, i.e. in the directory `$SCRATCH/tutorials/gemma-7b` run
1. Use any alternative Git hosting service instead of Github
@@ -369,8 +380,9 @@ Move on to the next tutorial or try the challenge.
369
380
370
381
Using the same approach as in the latter half of step 4, use pip to install the package `nvitop`. This is a tool that shows you a concise real-time summary of GPU activity. Then, run Gemma and launch `nvitop` at the same time:
1. This makes sure that files stored subsequently end up on the same storage node (up to 4 MB), on 4 storage nodes (between 4 and 64 MB) or are striped across all storage nodes (above 64 MB)
1. This ensures the compatibility of nanotron with the following example. For general usage, there is no reason to stick to an outdated version of nanotron, though.
167
167
168
168
We will install nanotron in a thin virtual environment on top of the container image built above. This proceeds as in the [LLM Inference][ref-mlp-llm-inference-tutorial].
169
169
170
-
```bash
171
-
$ srun -A <ACCOUNT> --environment=./ngc-nanotron-24.04.toml --pty bash
(venv-24.04) user@nidYYYYYY$ cd nanotron/ && pip install -e .
176
175
```
177
176
178
177
This creates a virtual environment on top of this container image (`--system-site-packages` ensuring access to system-installed site-packages) and installs nanotron in editable mode inside it. Because all dependencies of nanotron are already installed in the Dockerfile, no extra libraries will be installed at this point.
!!! warning "`torchrun` with virtual environments"
345
344
When using a virtual environment on top of a base image with PyTorch, always replace `torchrun` with `python -m torch.distributed.run` to pick up the correct Python environment. Otherwise, the system Python environment will be used and virtual environment packages not available. If not using virtual environments such as with a self-contained PyTorch container, `torchrun` is equivalent to `python -m torch.distributed.run`.
346
345
347
-
!!! note "Using srun instead of torchrun"
346
+
!!! note "Using srun instead of `torchrun`"
348
347
In many cases, workloads launched with `torchrun` can equivalently be launched purely with SLURM by setting some extra environment variables for `torch.distributed`. This simplifies the overall setup. That is, the `srun` statement in the above `sbatch` script can be rewritten as
You can inspect if your job has been submitted successfully by running `squeue --me` and looking for your username. Once the run starts, there will be a new file under `logs/`. You can inspect the status of your run using:
395
394
396
395
```console
397
-
$ tail -f logs/<logfile>
396
+
[clariden-lnXXX]$ tail -f logs/<logfile>
398
397
```
399
398
400
399
In the end, the checkpoints of the model will be saved in `checkpoints/`.
0 commit comments