Skip to content

Commit ec581f3

Browse files
author
Harsh Shah
committed
Consolidate Llama3.1 and Llama3.3 recipes in a single recipe
1 parent 95190af commit ec581f3

File tree

2 files changed

+26
-245
lines changed

2 files changed

+26
-245
lines changed

inference/trillium/vLLM/Llama3.1/README.md

Lines changed: 0 additions & 231 deletions
This file was deleted.

inference/trillium/vLLM/Llama3.3/README.md renamed to inference/trillium/vLLM/Llama3/README.md

Lines changed: 26 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,25 +1,28 @@
1-
# Serve Llama3.3 with vLLM on TPU VMs
1+
# Serve Llama3.1 and Llama3.3 with vLLM on TPU VMs
22

33
In this guide, we show how to serve
4+
[Llama3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct),
5+
[Llama3.1-70B](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) and
46
[Llama3.3-70B](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct).
57

68
> **Note:** Access to Llama models on Hugging Face requires accepting the Community License Agreement and awaiting approval before you can download and serve them.
79
8-
## Step 0: Install `gcloud cli`
10+
## Step 0: Install `gcloud CLI`
911

1012
You can reproduce this experiment from your dev environment
1113
(e.g. your laptop).
1214
You need to install `gcloud` locally to complete this tutorial.
1315

14-
To install `gcloud cli` please follow this guide:
16+
To install `gcloud CLI` please follow this guide:
1517
[Install the gcloud CLI](https://cloud.google.com/sdk/docs/install#mac)
1618

1719
Once it is installed, you can login to GCP from your terminal with this
1820
command: `gcloud auth login`.
1921

2022
## Step 1: Create a v6e TPU instance
2123

22-
We create a single VM. For Llama3.3-70B, at least 8 chips are required. If you need a different number of
24+
We create a single VM. For Llama3.1-8B, 1 chip is sufficient and for the 70B
25+
models, at least 8 chips are required. If you need a different number of
2326
chips, you can set a different value for `--topology` such as `1x1`,
2427
`2x4`, etc.
2528

@@ -53,7 +56,7 @@ export ZONE=your-tpu-zone
5356
export PROJECT=your-tpu-project
5457
export QR_ID=your-queued-resource-id # e.g. my-qr-request
5558

56-
# This command requests a v6e-8 (8 chips). Adjust accelerator-type for different sizes. For 1 chip, use --accelerator-type v6e-1.
59+
# This command requests a v6e-8 (8 chips). Adjust accelerator-type for different sizes. For 1 chip (Llama3.1-8B), use --accelerator-type v6e-1.
5760
gcloud alpha compute tpus queued-resources create $QR_ID \
5861
--node-id $TPU_NAME \
5962
--project $PROJECT --zone $ZONE \
@@ -69,25 +72,25 @@ gcloud alpha compute tpus queued-resources list --project $PROJECT --zone $ZONE
6972

7073
Once the state is `ACTIVE`, your TPU VM is ready and you can proceed to the next steps.
7174

72-
## Step 2: ssh to the instance
75+
## Step 2: SSH to the instance
7376

7477
```bash
7578
gcloud compute tpus tpu-vm ssh $TPU_NAME --project $PROJECT --zone=$ZONE
7679
```
7780

78-
## Step 3: Use the vllm docker image for TPU
81+
## Step 3: Use the latest vLLM Docker image for TPU
7982

8083
```bash
8184
export DOCKER_URI=vllm/vllm-tpu:nightly-20251129-28607fc-39e63de
8285
```
8386

84-
The docker image is pinged here for users to reproduce the [results below](#section-benchmarking).
87+
The docker image is pinned here for users to reproduce the [results below](#section-benchmarking).
8588

8689
To use the latest stable version, set `DOCKER_URI=vllm/vllm-tpu:latest`.
8790

8891
To use the latest nightly built image that has more recent features/improvements, set `DOCKER_URI=vllm/vllm-tpu:nightly`.
8992

90-
## Step 4: Run the docker container in the TPU instance
93+
## Step 4: Run the Docker container in the TPU instance
9194

9295
```bash
9396
sudo docker run -it --rm --name $USER-vllm --privileged --net=host \
@@ -97,7 +100,7 @@ sudo docker run -it --rm --name $USER-vllm --privileged --net=host \
97100
--entrypoint /bin/bash ${DOCKER_URI}
98101
```
99102

100-
> **Note:** 150GB should be sufficient for the 70B model. For the 8B model allocate at least 17GB for the weights.
103+
> **Note:** 150GB should be sufficient for the 70B models. For the 8B model allocate at least 17GB for the weights.
101104
102105
> **Note:** See [this guide](https://cloud.google.com/tpu/docs/attach-durable-block-storage) for attaching durable block storage to TPUs.
103106
@@ -116,6 +119,8 @@ export HF_TOKEN=<your HF token>
116119
Now we start the vllm server.
117120
Make sure you keep this terminal open for the entire duration of this experiment.
118121

122+
Here is the serving command for the 70B model:
123+
119124
```bash
120125
export MAX_MODEL_LEN=2048
121126
export TP=8 # number of chips
@@ -132,6 +137,13 @@ vllm serve meta-llama/Llama-3.3-70B-Instruct \
132137
--max-model-len $MAX_MODEL_LEN
133138
```
134139

140+
| Model | Input/Output Scenario | max-num-batched-tokens | max-num-seqs |
141+
| :--- | :--- | :--- | :--- |
142+
| Llama-3.3-70B-Instruct/Llama-3.1-70B-Instruct | Prefill Heavy | 2048 | 256 |
143+
| Llama-3.3-70B-Instruct/Llama-3.1-70B-Instruct | Decode Heavy/ Balanced | 512 | 256 |
144+
145+
For the 8B model on a v6e-1 (1-chip) instance, we recommend `--max-num-batched-tokens 1024 --max-num-seqs 128`.
146+
135147
It takes a few minutes depending on the model size to prepare the server.
136148
Once you see the below snippet in the logs, it means that the server is ready
137149
to serve requests or run benchmarks:
@@ -157,7 +169,7 @@ export PROJECT=your-tpu-project
157169
gcloud compute tpus tpu-vm ssh $TPU_NAME --project $PROJECT --zone=$ZONE
158170
```
159171

160-
## Step 8: access the running container
172+
## Step 8: Access the running container
161173

162174
```bash
163175
sudo docker exec -it $USER-vllm bash
@@ -178,7 +190,7 @@ curl http://localhost:8000/v1/completions \
178190
}'
179191
```
180192

181-
## Step 10: Preparing the test image
193+
## Step 10: Prepare the test image
182194

183195
You will need to install datasets as it's not available in the base vllm
184196
image.
@@ -187,7 +199,7 @@ image.
187199
pip install datasets
188200
```
189201

190-
## <a id="section-benchmarking"></a>Step 11: Run the benchmarking
202+
## <a id="section-benchmarking"></a>Step 11: Run the benchmark
191203

192204
Finally, we are ready to run the benchmark:
193205

@@ -271,4 +283,4 @@ Mean ITL (ms): 35.12
271283
Median ITL (ms): 30.73
272284
P99 ITL (ms): 47.03
273285
==================================================
274-
```
286+
```

0 commit comments

Comments
 (0)