Skip to content

Commit a7cae1d

Browse files
authored
Merge pull request #118 from AI-Hypercomputer/consolidate-llama-recipes
Consolidate Llama3.1 and Llama3.3 recipes in a single recipe
2 parents e6ac6f6 + d44ab3b commit a7cae1d

File tree

2 files changed

+26
-244
lines changed

2 files changed

+26
-244
lines changed

inference/trillium/vLLM/Llama3.1/README.md

Lines changed: 0 additions & 230 deletions
This file was deleted.

inference/trillium/vLLM/Llama3.3/README.md renamed to inference/trillium/vLLM/Llama3.x/README.md

Lines changed: 26 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,25 +1,27 @@
1-
# Serve Llama3.3 with vLLM on TPU VMs
1+
# Serve Llama3.x with vLLM on TPU VMs
22

33
In this guide, we show how to serve
4+
[Llama3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct),
45
[Llama3.3-70B](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct).
56

67
> **Note:** Access to Llama models on Hugging Face requires accepting the Community License Agreement and awaiting approval before you can download and serve them.
78
8-
## Step 0: Install `gcloud cli`
9+
## Step 0: Install `gcloud CLI`
910

1011
You can reproduce this experiment from your dev environment
1112
(e.g. your laptop).
1213
You need to install `gcloud` locally to complete this tutorial.
1314

14-
To install `gcloud cli` please follow this guide:
15+
To install `gcloud CLI` please follow this guide:
1516
[Install the gcloud CLI](https://cloud.google.com/sdk/docs/install#mac)
1617

1718
Once it is installed, you can login to GCP from your terminal with this
1819
command: `gcloud auth login`.
1920

2021
## Step 1: Create a v6e TPU instance
2122

22-
We create a single VM. For Llama3.3-70B, at least 8 chips are required. If you need a different number of
23+
We create a single VM. For Llama3.1-8B, 1 chip is sufficient and for the 70B
24+
models, at least 8 chips are required. If you need a different number of
2325
chips, you can set a different value for `--topology` such as `1x1`,
2426
`2x4`, etc.
2527

@@ -53,7 +55,7 @@ export ZONE=your-tpu-zone
5355
export PROJECT=your-tpu-project
5456
export QR_ID=your-queued-resource-id # e.g. my-qr-request
5557

56-
# This command requests a v6e-8 (8 chips). Adjust accelerator-type for different sizes. For 1 chip, use --accelerator-type v6e-1.
58+
# This command requests a v6e-8 (8 chips). Adjust accelerator-type for different sizes. For 1 chip (Llama3.1-8B), use --accelerator-type v6e-1.
5759
gcloud alpha compute tpus queued-resources create $QR_ID \
5860
--node-id $TPU_NAME \
5961
--project $PROJECT --zone $ZONE \
@@ -69,25 +71,25 @@ gcloud alpha compute tpus queued-resources list --project $PROJECT --zone $ZONE
6971

7072
Once the state is `ACTIVE`, your TPU VM is ready and you can proceed to the next steps.
7173

72-
## Step 2: ssh to the instance
74+
## Step 2: SSH to the instance
7375

7476
```bash
7577
gcloud compute tpus tpu-vm ssh $TPU_NAME --project $PROJECT --zone=$ZONE
7678
```
7779

78-
## Step 3: Use the vllm docker image for TPU
80+
## Step 3: Use the latest vLLM Docker image for TPU
7981

8082
```bash
8183
export DOCKER_URI=vllm/vllm-tpu:nightly-20251129-28607fc-39e63de
8284
```
8385

84-
The docker image is pinged here for users to reproduce the [results below](#section-benchmarking).
86+
The docker image is pinned here for users to reproduce the [results below](#section-benchmarking).
8587

8688
To use the latest stable version, set `DOCKER_URI=vllm/vllm-tpu:latest`.
8789

8890
To use the latest nightly built image that has more recent features/improvements, set `DOCKER_URI=vllm/vllm-tpu:nightly`.
8991

90-
## Step 4: Run the docker container in the TPU instance
92+
## Step 4: Run the Docker container in the TPU instance
9193

9294
```bash
9395
sudo docker run -it --rm --name $USER-vllm --privileged --net=host \
@@ -96,7 +98,7 @@ sudo docker run -it --rm --name $USER-vllm --privileged --net=host \
9698
--entrypoint /bin/bash ${DOCKER_URI}
9799
```
98100

99-
> **Note:** 150GB should be sufficient for the 70B model. For the 8B model allocate at least 17GB for the weights.
101+
> **Note:** 150GB should be sufficient for the 70B models. For the 8B model allocate at least 17GB for the weights.
100102
101103
> **Note:** See [this guide](https://cloud.google.com/tpu/docs/attach-durable-block-storage) for attaching durable block storage to TPUs.
102104
@@ -115,6 +117,8 @@ export HF_TOKEN=<your HF token>
115117
Now we start the vllm server.
116118
Make sure you keep this terminal open for the entire duration of this experiment.
117119

120+
Here is the serving command for the 70B model:
121+
118122
```bash
119123
export MAX_MODEL_LEN=2048
120124
export TP=8 # number of chips
@@ -131,6 +135,14 @@ vllm serve meta-llama/Llama-3.3-70B-Instruct \
131135
--max-model-len $MAX_MODEL_LEN
132136
```
133137

138+
| Model | Input/Output Scenario | max-num-batched-tokens | max-num-seqs | tensor-parallel-size |
139+
|:--- | :--- | :--- | :--- | :--- |
140+
| Llama-3.x-70B-Instruct | Prefill Heavy | 2048 | 256 | 8 |
141+
| Llama-3.x-70B-Instruct | Decode Heavy/ Balanced | 512 | 256 | 8 |
142+
| Llama3.1-8B-Instruct | Prefill Heavy | 1024 | 128 | 1 |
143+
144+
145+
134146
It takes a few minutes depending on the model size to prepare the server.
135147
Once you see the below snippet in the logs, it means that the server is ready
136148
to serve requests or run benchmarks:
@@ -156,7 +168,7 @@ export PROJECT=your-tpu-project
156168
gcloud compute tpus tpu-vm ssh $TPU_NAME --project $PROJECT --zone=$ZONE
157169
```
158170

159-
## Step 8: access the running container
171+
## Step 8: Access the running container
160172

161173
```bash
162174
sudo docker exec -it $USER-vllm bash
@@ -177,7 +189,7 @@ curl http://localhost:8000/v1/completions \
177189
}'
178190
```
179191

180-
## Step 10: Preparing the test image
192+
## Step 10: Prepare the test image
181193

182194
You will need to install datasets as it's not available in the base vllm
183195
image.
@@ -186,7 +198,7 @@ image.
186198
pip install datasets
187199
```
188200

189-
## <a id="section-benchmarking"></a>Step 11: Run the benchmarking
201+
## <a id="section-benchmarking"></a>Step 11: Run the benchmark
190202

191203
Finally, we are ready to run the benchmark:
192204

@@ -270,4 +282,4 @@ Mean ITL (ms): 35.12
270282
Median ITL (ms): 30.73
271283
P99 ITL (ms): 47.03
272284
==================================================
273-
```
285+
```

0 commit comments

Comments
 (0)