Skip to content

Commit b005d56

Browse files
authored
Create Llama3.3-70B vLLM v6e recipe with in/out=1k/1k benchmark results (#114)
1 parent af30a56 commit b005d56

File tree

2 files changed

+275
-0
lines changed

2 files changed

+275
-0
lines changed
Lines changed: 274 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,274 @@
1+
# Serve Llama3.3 with vLLM on TPU VMs
2+
3+
In this guide, we show how to serve
4+
[Llama3.3-70B](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct).
5+
6+
> **Note:** Access to Llama models on Hugging Face requires accepting the Community License Agreement and awaiting approval before you can download and serve them.
7+
8+
## Step 0: Install `gcloud cli`
9+
10+
You can reproduce this experiment from your dev environment
11+
(e.g. your laptop).
12+
You need to install `gcloud` locally to complete this tutorial.
13+
14+
To install `gcloud cli` please follow this guide:
15+
[Install the gcloud CLI](https://cloud.google.com/sdk/docs/install#mac)
16+
17+
Once it is installed, you can login to GCP from your terminal with this
18+
command: `gcloud auth login`.
19+
20+
## Step 1: Create a v6e TPU instance
21+
22+
We create a single VM. For Llama3.3-70B, at least 8 chips are required. If you need a different number of
23+
chips, you can set a different value for `--topology` such as `1x1`,
24+
`2x4`, etc.
25+
26+
To learn more about topologies: [v6e VM Types](https://cloud.google.com/tpu/docs/v6e#vm-types).
27+
28+
> **Note:** Acquiring on-demand TPUs can be challenging due to high demand. We recommend using [Queued Resources](https://cloud.google.com/tpu/docs/queued-resources) to ensure you get the required capacity.
29+
30+
### Option 1: Create an on-demand TPU VM
31+
32+
This command attempts to create a TPU VM immediately.
33+
34+
```bash
35+
export TPU_NAME=your-tpu-name
36+
export ZONE=your-tpu-zone
37+
export PROJECT=your-tpu-project
38+
39+
# this command creates a tpu vm with 8 Trillium (v6e) chips - adjust it to suit your needs
40+
gcloud alpha compute tpus tpu-vm create $TPU_NAME \
41+
--type v6e --topology 2x4 \
42+
--project $PROJECT --zone $ZONE --version v2-alpha-tpuv6e
43+
```
44+
45+
### Option 2: Use Queued Resources (Recommended)
46+
47+
With Queued Resources, you submit a request for TPUs and it gets fulfilled
48+
when capacity is available.
49+
50+
```bash
51+
export TPU_NAME=your-tpu-name
52+
export ZONE=your-tpu-zone
53+
export PROJECT=your-tpu-project
54+
export QR_ID=your-queued-resource-id # e.g. my-qr-request
55+
56+
# This command requests a v6e-8 (8 chips). Adjust accelerator-type for different sizes. For 1 chip, use --accelerator-type v6e-1.
57+
gcloud alpha compute tpus queued-resources create $QR_ID \
58+
--node-id $TPU_NAME \
59+
--project $PROJECT --zone $ZONE \
60+
--accelerator-type v6e-8 \
61+
--runtime-version v2-alpha-tpuv6e
62+
```
63+
64+
You can check the status of your request with:
65+
66+
```bash
67+
gcloud alpha compute tpus queued-resources list --project $PROJECT --zone $ZONE
68+
```
69+
70+
Once the state is `ACTIVE`, your TPU VM is ready and you can proceed to the next steps.
71+
72+
## Step 2: ssh to the instance
73+
74+
```bash
75+
gcloud compute tpus tpu-vm ssh $TPU_NAME --project $PROJECT --zone=$ZONE
76+
```
77+
78+
## Step 3: Use the vllm docker image for TPU
79+
80+
```bash
81+
export DOCKER_URI=vllm/vllm-tpu:nightly-20251129-28607fc-39e63de
82+
```
83+
84+
The docker image is pinged here for users to reproduce the [results below](#section-benchmarking).
85+
86+
To use the latest stable version, set `DOCKER_URI=vllm/vllm-tpu:latest`.
87+
88+
To use the latest nightly built image that has more recent features/improvements, set `DOCKER_URI=vllm/vllm-tpu:nightly`.
89+
90+
## Step 4: Run the docker container in the TPU instance
91+
92+
```bash
93+
sudo docker run -it --rm --name $USER-vllm --privileged --net=host \
94+
-v /dev/shm:/dev/shm \
95+
--shm-size 150gb \
96+
-p 8000:8000 \
97+
--entrypoint /bin/bash ${DOCKER_URI}
98+
```
99+
100+
> **Note:** 150GB should be sufficient for the 70B model. For the 8B model allocate at least 17GB for the weights.
101+
102+
> **Note:** See [this guide](https://cloud.google.com/tpu/docs/attach-durable-block-storage) for attaching durable block storage to TPUs.
103+
104+
## Step 5: Set up env variables
105+
106+
Export your hugging face token along with other environment variables inside
107+
the container.
108+
109+
```bash
110+
export HF_HOME=/dev/shm
111+
export HF_TOKEN=<your HF token>
112+
```
113+
114+
## Step 6: Serve the model
115+
116+
Now we start the vllm server.
117+
Make sure you keep this terminal open for the entire duration of this experiment.
118+
119+
```bash
120+
export MAX_MODEL_LEN=2048
121+
export TP=8 # number of chips
122+
123+
vllm serve meta-llama/Llama-3.3-70B-Instruct \
124+
--seed 42 \
125+
--disable-log-requests \
126+
--no-enable-prefix-caching \
127+
--async-scheduling \
128+
--gpu-memory-utilization 0.98 \
129+
--max-num-batched-tokens 512 \
130+
--max-num-seqs 256 \
131+
--tensor-parallel-size $TP \
132+
--max-model-len $MAX_MODEL_LEN
133+
```
134+
135+
It takes a few minutes depending on the model size to prepare the server.
136+
Once you see the below snippet in the logs, it means that the server is ready
137+
to serve requests or run benchmarks:
138+
139+
```bash
140+
INFO: Started server process [7]
141+
INFO: Waiting for application startup.
142+
INFO: Application startup complete.
143+
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
144+
```
145+
146+
## Step 7: Prepare the test environment
147+
148+
Open a new terminal to test the server and run the benchmark (keep the previous terminal open).
149+
150+
First, we ssh into the TPU vm via the new terminal:
151+
152+
```bash
153+
export TPU_NAME=your-tpu-name
154+
export ZONE=your-tpu-zone
155+
export PROJECT=your-tpu-project
156+
157+
gcloud compute tpus tpu-vm ssh $TPU_NAME --project $PROJECT --zone=$ZONE
158+
```
159+
160+
## Step 8: access the running container
161+
162+
```bash
163+
sudo docker exec -it $USER-vllm bash
164+
```
165+
166+
## Step 9: Test the server
167+
168+
Let's submit a test request to the server. This helps us to see if the server is launched properly and we can see legitimate response from the model.
169+
170+
```bash
171+
curl http://localhost:8000/v1/completions \
172+
-H "Content-Type: application/json" \
173+
-d '{
174+
"model": "meta-llama/Llama-3.3-70B-Instruct",
175+
"prompt": "I love the mornings, because ",
176+
"max_tokens": 200,
177+
"temperature": 0
178+
}'
179+
```
180+
181+
## Step 10: Preparing the test image
182+
183+
You will need to install datasets as it's not available in the base vllm
184+
image.
185+
186+
```bash
187+
pip install datasets
188+
```
189+
190+
## <a id="section-benchmarking"></a>Step 11: Run the benchmarking
191+
192+
Finally, we are ready to run the benchmark:
193+
194+
```bash
195+
export MAX_INPUT_LEN=1000
196+
export MAX_OUTPUT_LEN=1000
197+
export MAX_CONCURRENCY=128
198+
export HF_TOKEN=<your HF token>
199+
200+
cd /workspace/vllm
201+
202+
vllm bench serve \
203+
--backend vllm \
204+
--model "meta-llama/Llama-3.3-70B-Instruct" \
205+
--dataset-name random \
206+
--ignore-eos \
207+
--num-prompts 3000 \
208+
--random-input-len=$MAX_INPUT_LEN \
209+
--random-output-len=$MAX_OUTPUT_LEN \
210+
--max-concurrency=$MAX_CONCURRENCY \
211+
--seed 100
212+
```
213+
214+
The snippet below is what you’d expect to see - the numbers vary based on the vllm version, the model size and the TPU instance type/size.
215+
216+
With `MAX_CONCURRENCY=64`:
217+
218+
```text
219+
============ Serving Benchmark Result ============
220+
Successful requests: 3000
221+
Failed requests: 0
222+
Maximum request concurrency: 64
223+
Benchmark duration (s): 1202.50
224+
Total input tokens: 2997000
225+
Total generated tokens: 3000000
226+
Request throughput (req/s): 2.49
227+
Output token throughput (tok/s): 2494.80
228+
Peak output token throughput (tok/s): 2872.00
229+
Peak concurrent requests: 77.00
230+
Total Token throughput (tok/s): 4987.10
231+
---------------Time to First Token----------------
232+
Mean TTFT (ms): 251.04
233+
Median TTFT (ms): 198.81
234+
P99 TTFT (ms): 2501.65
235+
-----Time per Output Token (excl. 1st token)------
236+
Mean TPOT (ms): 25.35
237+
Median TPOT (ms): 25.40
238+
P99 TPOT (ms): 25.43
239+
---------------Inter-token Latency----------------
240+
Mean ITL (ms): 25.35
241+
Median ITL (ms): 23.11
242+
P99 ITL (ms): 40.75
243+
==================================================
244+
```
245+
246+
With `MAX_CONCURRENCY=128`:
247+
248+
```text
249+
============ Serving Benchmark Result ============
250+
Successful requests: 3000
251+
Failed requests: 0
252+
Maximum request concurrency: 128
253+
Benchmark duration (s): 846.39
254+
Total input tokens: 2997000
255+
Total generated tokens: 3000000
256+
Request throughput (req/s): 3.54
257+
Output token throughput (tok/s): 3544.46
258+
Peak output token throughput (tok/s): 4352.00
259+
Peak concurrent requests: 139.00
260+
Total Token throughput (tok/s): 7085.38
261+
---------------Time to First Token----------------
262+
Mean TTFT (ms): 512.34
263+
Median TTFT (ms): 229.09
264+
P99 TTFT (ms): 8250.26
265+
-----Time per Output Token (excl. 1st token)------
266+
Mean TPOT (ms): 35.12
267+
Median TPOT (ms): 35.43
268+
P99 TPOT (ms): 35.49
269+
---------------Inter-token Latency----------------
270+
Mean ITL (ms): 35.12
271+
Median ITL (ms): 30.73
272+
P99 ITL (ms): 47.03
273+
==================================================
274+
```

inference/trillium/vLLM/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ Although vLLM TPU’s [new unified backend](https://github.com/vllm-project/tpu-
55
For this reason, we’ve provided a set of stress-tested recipes for deploying and serving vLLM on Trillium TPUs using Google Compute Engine (GCE).
66

77
- [Llama3.1-8B/70B](./Llama3.1/README.md)
8+
- [Llama3.3-70B](./Llama3.3/README.md)
89
- [Qwen2.5-32B](./Qwen2.5-32B/README.md)
910
- [Qwen2.5-VL-7B](./Qwen2.5-VL/README.md)
1011
- [Qwen3-4B/32B](./Qwen3/README.md)

0 commit comments

Comments
 (0)