Skip to content

Commit 3b4f051

Browse files
added v5e and fixed a couple of typos
1 parent af2a7cd commit 3b4f051

File tree

6 files changed

+375
-5
lines changed

6 files changed

+375
-5
lines changed

inference/trillium/vLLM/Llama3-8b/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ Once it is installed, you can login to GCP from your terminal with this command:
1212

1313
## Step 1: Create a v6e TPU instance
1414

15-
We create a single VM with 1 trillium chip as it's enought to serve an 8B parameter model - if you need larger instances, you can set a different value for `--topology` such as `2x2`, `4x2`, etc.
15+
We create a single VM with 1 trillium chip as it's enought to serve an 8B parameter model - if you need larger instances, you can set a different value for `--topology` such as `2x2`, `2x4`, etc.
1616

1717
To learn more about topologies: [v6e VM Types](https://cloud.google.com/tpu/docs/v6e#vm-types).
1818

@@ -60,7 +60,7 @@ Now we serve the vllm server. Make sure you keep this terminal open for the enti
6060

6161
```bash
6262
export MAX_MODEL_LEN=4096
63-
export TP=4 # number of chips
63+
export TP=1 # number of chips
6464
# export RATIO=0.8
6565
# export PREFIX_LEN=0
6666

inference/trillium/vLLM/Qwen2.5-32B/README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ Once it is installed, you can login to GCP from your terminal with this command:
1212

1313
## Step 1: Create a v6e TPU instance
1414

15-
We create a single VM with 4 trillium chips - if you need a different number of chips, you can set a different value for `--topology` such as `1x1`, `4x2`, etc.
15+
We create a single VM with 4 trillium chips - if you need a different number of chips, you can set a different value for `--topology` such as `1x1`, `2x4`, etc.
1616

1717
To learn more about topologies: [v6e VM Types](https://cloud.google.com/tpu/docs/v6e#vm-types).
1818

@@ -21,7 +21,7 @@ export TPU_NAME=your-tpu-name
2121
export ZONE=your-tpu-zone
2222
export PROJECT=your-tpu-project
2323

24-
# this command creates a tpu vm with 1 Trillium (v6e) chips - adjust it to suit your needs
24+
# this command creates a tpu vm with 4 Trillium (v6e) chips - adjust it to suit your needs
2525
gcloud alpha compute tpus tpu-vm create $TPU_NAME \
2626
--type v6e --topology 2x2 \
2727
--project $PROJECT --zone $ZONE --version v2-alpha-tpuv6e
@@ -59,7 +59,7 @@ export HF_TOKEN=<your HF token>
5959
Now we serve the vllm server. Make sure you keep this terminal open for the entire duration of this experiment.
6060

6161
```bash
62-
export MAX_MODEL_LEN=2048
62+
export MAX_MODEL_LEN=4096
6363
export TP=4 # number of chips
6464
# export RATIO=0.8
6565
# export PREFIX_LEN=0

inference/trillium/vLLM/README.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
# Serve vLLM on Trillium TPUs (v6e):
2+
3+
This repository provides examples demonstrating how to deploy and serve vLLM on Trillium TPUs using GCE (Google Compute Engine) for a select set of models.
4+
5+
- [Llama3-8b](./Llama3-8b/README.md)
6+
- [Qwen2.5-32B](./Qwen2.5-32B/README.md)
7+
8+
These models were chosen for demonstration purposes only. You can serve any model from this list: [vLLM Supported Models](https://docs.vllm.ai/en/latest/models/supported_models.html)
9+
10+
If you are looking for GKE-based deployment, please refer to this documentation: [Serve an LLM using TPU Trillium on GKE with vLLM](https://cloud.google.com/kubernetes-engine/docs/tutorials/serve-vllm-tpu)
11+
12+
To serve vLLM on v5e TPUs, please refer to this page: [Serve vLLM on v5e TPUs](../../v5e/vLLM/README.md)
13+
Lines changed: 172 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,172 @@
1+
# Serve Llama-3.1-8B (or any other model) with vLLM on TPU VMs.
2+
3+
In this guide, we show how to serve Llama-3.1-8B ([deepseek-ai/DeepSeek-R1-Distill-Llama-8B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B)). You can host [any other supported model](https://docs.vllm.ai/en/latest/models/supported_models.html) based on your needs.
4+
5+
## Step 0: Install `gcloud cli`
6+
7+
You can reproduce this experiment from your dev environment (e.g. your laptop). You need to install `gcloud` locally to complete this tutorial.
8+
9+
To install `gcloud cli` please follow this guide: [Install the gcloud CLI](https://cloud.google.com/sdk/docs/install#mac)
10+
11+
Once it is installed, you can login to GCP from your terminal with this command: `gcloud auth login`.
12+
13+
## Step 1: Create a v5e TPU instance
14+
15+
We create a single VM with 4 v5e chips - to serve an 8b model <sup>1</sup> if you need larger instances, you can set a different value for `--topology` such as `2x2`, `2x4`, etc.
16+
17+
<small>*1- Why 4 chips for an 8B model? we need at least 16GB of HBM for the weights (assuming the model is served in BF16 => 8B * 2 bytes = 16B bytes or 16 GB) and some room in HBM for KV Cache - given that each v5e has [16GB](https://cloud.google.com/tpu/docs/v5e) of HBM per chip, we provision 4 chips to accommodate for both the weights and the KV Cache*</small>
18+
19+
To learn more about topologies: [v5e VM Types](https://cloud.google.com/tpu/docs/v5e#vm-types).
20+
21+
```bash
22+
export TPU_NAME=your-tpu-name
23+
export ZONE=your-tpu-zone
24+
export PROJECT=your-tpu-project
25+
26+
# this command creates a tpu vm with 4 v5e chips - adjust it to suit your needs
27+
gcloud alpha compute tpus tpu-vm create $TPU_NAME \
28+
--type v5litepod --topology 2x2 \
29+
--project $PROJECT --zone $ZONE --version v2-alpha-tpuv5-lite
30+
```
31+
32+
## Step 2: ssh to the instance
33+
34+
```bash
35+
gcloud compute tpus tpu-vm ssh $TPU_NAME --project $PROJECT --zone=$ZONE
36+
```
37+
38+
## Step 3: Use the latest vllm docker image for TPU
39+
We use a pinned image but you can change it to `vllm/vllm-tpu:nightly` to get the latest TPU nightly image.
40+
41+
```bash
42+
export DOCKER_URI=vllm/vllm-tpu:nightly
43+
```
44+
45+
## Step 4: Run the docker container in the TPU instance
46+
47+
```bash
48+
sudo docker run -t --rm --name $USER-vllm --privileged --net=host -v /dev/shm:/dev/shm --shm-size 10gb -p 8000:8000 --entrypoint /bin/bash -it ${DOCKER_URI}
49+
```
50+
51+
## Step 5: Set up env variables
52+
Export your hugging face token along with other environment variables inside the container.
53+
54+
```bash
55+
export HF_HOME=/dev/shm
56+
export HF_TOKEN=<your HF token>
57+
```
58+
59+
## Step 6: Serve the model
60+
61+
Now we serve the vllm server. Make sure you keep this terminal open for the entire duration of this experiment.
62+
63+
```bash
64+
export MAX_MODEL_LEN=4096
65+
export TP=4 # number of chips
66+
# export RATIO=0.8
67+
# export PREFIX_LEN=0
68+
69+
VLLM_USE_V1=1 vllm serve deepseek-ai/DeepSeek-R1-Distill-Llama-8B --seed 42 --disable-log-requests --gpu-memory-utilization 0.95 --max-num-batched-tokens 8192 --max-num-seqs 128 --tensor-parallel-size $TP --max-model-len $MAX_MODEL_LEN
70+
```
71+
72+
It takes a few minutes depending on the model size to prepare the server - once you see the below snippet in the logs, it means that the server is ready to serve requests or run benchmarks:
73+
74+
```bash
75+
INFO: Started server process [7]
76+
INFO: Waiting for application startup.
77+
INFO: Application startup complete.
78+
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
79+
```
80+
81+
## Step 7: Prepare the test environment
82+
83+
Open a new terminal to test the server and run the benchmark (keep the previous terminal open).
84+
85+
First, we ssh into the TPU vm via the new terminal:
86+
87+
```bash
88+
export TPU_NAME=your-tpu-name
89+
export ZONE=your-tpu-zone
90+
export PROJECT=your-tpu-project
91+
92+
gcloud compute tpus tpu-vm ssh $TPU_NAME --project $PROJECT --zone=$ZONE
93+
```
94+
95+
## Step 8: access the running container
96+
97+
```bash
98+
sudo docker exec -it $USER-vllm bash
99+
```
100+
101+
## Step 9: Test the server.
102+
103+
Let's submit a test request to the server. This helps us to see if the server is launched properly and we can see legitimate response from the model.
104+
105+
```bash
106+
curl http://localhost:8000/v1/completions \
107+
-H "Content-Type: application/json" \
108+
-d '{
109+
"model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
110+
"prompt": "I love the mornings, because ",
111+
"max_tokens": 200,
112+
"temperature": 0
113+
}'
114+
```
115+
116+
## Step 9: Preparing the test image
117+
118+
You might need to install datasets as it's not available in the base vllm image.
119+
120+
```bash
121+
pip install datasets datasets
122+
```
123+
124+
## Step 10: Run the benchmarking
125+
126+
Finally, we are ready to run the benchmark:
127+
128+
```bash
129+
export MAX_INPUT_LEN=1800
130+
export MAX_OUTPUT_LEN=128
131+
export HF_TOKEN=<your HF token>
132+
133+
cd /workspace/vllm
134+
135+
python benchmarks/benchmark_serving.py \
136+
--backend vllm \
137+
--model "deepseek-ai/DeepSeek-R1-Distill-Llama-8B" \
138+
--dataset-name random \
139+
--num-prompts 1000 \
140+
--random-input-len=$MAX_INPUT_LEN \
141+
--random-output-len=$MAX_OUTPUT_LEN \
142+
--seed 100
143+
# --random-range-ratio=$RATIO \
144+
# --random-prefix-len=$PREFIX_LEN
145+
```
146+
147+
The snippet below is what you’d expect to see - the numbers vary based on the vllm version, the model size and the TPU instance type/size.
148+
149+
```bash
150+
============ Serving Benchmark Result ============
151+
Successful requests: xxxxxxx
152+
Benchmark duration (s): xxxxxxx
153+
Total input tokens: xxxxxxx
154+
Total generated tokens: xxxxxxx
155+
Request throughput (req/s): xxxxxxx
156+
Output token throughput (tok/s): xxxxxxx
157+
Total Token throughput (tok/s): xxxxxxx
158+
---------------Time to First Token----------------
159+
Mean TTFT (ms): xxxxxxx
160+
Median TTFT (ms): xxxxxxx
161+
P99 TTFT (ms): xxxxxxx
162+
-----Time per Output Token (excl. 1st token)------
163+
Mean TPOT (ms): xxxxxxx
164+
Median TPOT (ms): xxxxxxx
165+
P99 TPOT (ms): xxxxxxx
166+
---------------Inter-token Latency----------------
167+
Mean ITL (ms): xxxxxxx
168+
Median ITL (ms): xxxxxxx
169+
P99 ITL (ms): xxxxxxx
170+
==================================================
171+
```
172+

0 commit comments

Comments
 (0)