|
| 1 | +# Serve Llama3.3 with vLLM on TPU VMs |
| 2 | + |
| 3 | +In this guide, we show how to serve |
| 4 | +[Llama3.3-70B](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct). |
| 5 | + |
| 6 | +> **Note:** Access to Llama models on Hugging Face requires accepting the Community License Agreement and awaiting approval before you can download and serve them. |
| 7 | +
|
| 8 | +## Step 0: Install `gcloud cli` |
| 9 | + |
| 10 | +You can reproduce this experiment from your dev environment |
| 11 | +(e.g. your laptop). |
| 12 | +You need to install `gcloud` locally to complete this tutorial. |
| 13 | + |
| 14 | +To install `gcloud cli` please follow this guide: |
| 15 | +[Install the gcloud CLI](https://cloud.google.com/sdk/docs/install#mac) |
| 16 | + |
| 17 | +Once it is installed, you can login to GCP from your terminal with this |
| 18 | +command: `gcloud auth login`. |
| 19 | + |
| 20 | +## Step 1: Create a v6e TPU instance |
| 21 | + |
| 22 | +We create a single VM. For Llama3.3-70B, at least 8 chips are required. If you need a different number of |
| 23 | +chips, you can set a different value for `--topology` such as `1x1`, |
| 24 | +`2x4`, etc. |
| 25 | + |
| 26 | +To learn more about topologies: [v6e VM Types](https://cloud.google.com/tpu/docs/v6e#vm-types). |
| 27 | + |
| 28 | +> **Note:** Acquiring on-demand TPUs can be challenging due to high demand. We recommend using [Queued Resources](https://cloud.google.com/tpu/docs/queued-resources) to ensure you get the required capacity. |
| 29 | +
|
| 30 | +### Option 1: Create an on-demand TPU VM |
| 31 | + |
| 32 | +This command attempts to create a TPU VM immediately. |
| 33 | + |
| 34 | +```bash |
| 35 | +export TPU_NAME=your-tpu-name |
| 36 | +export ZONE=your-tpu-zone |
| 37 | +export PROJECT=your-tpu-project |
| 38 | + |
| 39 | +# this command creates a tpu vm with 8 Trillium (v6e) chips - adjust it to suit your needs |
| 40 | +gcloud alpha compute tpus tpu-vm create $TPU_NAME \ |
| 41 | + --type v6e --topology 2x4 \ |
| 42 | + --project $PROJECT --zone $ZONE --version v2-alpha-tpuv6e |
| 43 | +``` |
| 44 | + |
| 45 | +### Option 2: Use Queued Resources (Recommended) |
| 46 | + |
| 47 | +With Queued Resources, you submit a request for TPUs and it gets fulfilled |
| 48 | +when capacity is available. |
| 49 | + |
| 50 | +```bash |
| 51 | +export TPU_NAME=your-tpu-name |
| 52 | +export ZONE=your-tpu-zone |
| 53 | +export PROJECT=your-tpu-project |
| 54 | +export QR_ID=your-queued-resource-id # e.g. my-qr-request |
| 55 | + |
| 56 | +# This command requests a v6e-8 (8 chips). Adjust accelerator-type for different sizes. For 1 chip, use --accelerator-type v6e-1. |
| 57 | +gcloud alpha compute tpus queued-resources create $QR_ID \ |
| 58 | + --node-id $TPU_NAME \ |
| 59 | + --project $PROJECT --zone $ZONE \ |
| 60 | + --accelerator-type v6e-8 \ |
| 61 | + --runtime-version v2-alpha-tpuv6e |
| 62 | +``` |
| 63 | + |
| 64 | +You can check the status of your request with: |
| 65 | + |
| 66 | +```bash |
| 67 | +gcloud alpha compute tpus queued-resources list --project $PROJECT --zone $ZONE |
| 68 | +``` |
| 69 | + |
| 70 | +Once the state is `ACTIVE`, your TPU VM is ready and you can proceed to the next steps. |
| 71 | + |
| 72 | +## Step 2: ssh to the instance |
| 73 | + |
| 74 | +```bash |
| 75 | +gcloud compute tpus tpu-vm ssh $TPU_NAME --project $PROJECT --zone=$ZONE |
| 76 | +``` |
| 77 | + |
| 78 | +## Step 3: Use the vllm docker image for TPU |
| 79 | + |
| 80 | +```bash |
| 81 | +export DOCKER_URI=vllm/vllm-tpu:nightly-20251129-28607fc-39e63de |
| 82 | +``` |
| 83 | + |
| 84 | +The docker image is pinged here for users to reproduce the [results below](#section-benchmarking). |
| 85 | + |
| 86 | +To use the latest stable version, set `DOCKER_URI=vllm/vllm-tpu:latest`. |
| 87 | + |
| 88 | +To use the latest nightly built image that has more recent features/improvements, set `DOCKER_URI=vllm/vllm-tpu:nightly`. |
| 89 | + |
| 90 | +## Step 4: Run the docker container in the TPU instance |
| 91 | + |
| 92 | +```bash |
| 93 | +sudo docker run -it --rm --name $USER-vllm --privileged --net=host \ |
| 94 | + -v /dev/shm:/dev/shm \ |
| 95 | + --shm-size 150gb \ |
| 96 | + -p 8000:8000 \ |
| 97 | + --entrypoint /bin/bash ${DOCKER_URI} |
| 98 | +``` |
| 99 | + |
| 100 | +> **Note:** 150GB should be sufficient for the 70B model. For the 8B model allocate at least 17GB for the weights. |
| 101 | +
|
| 102 | +> **Note:** See [this guide](https://cloud.google.com/tpu/docs/attach-durable-block-storage) for attaching durable block storage to TPUs. |
| 103 | +
|
| 104 | +## Step 5: Set up env variables |
| 105 | + |
| 106 | +Export your hugging face token along with other environment variables inside |
| 107 | +the container. |
| 108 | + |
| 109 | +```bash |
| 110 | +export HF_HOME=/dev/shm |
| 111 | +export HF_TOKEN=<your HF token> |
| 112 | +``` |
| 113 | + |
| 114 | +## Step 6: Serve the model |
| 115 | + |
| 116 | +Now we start the vllm server. |
| 117 | +Make sure you keep this terminal open for the entire duration of this experiment. |
| 118 | + |
| 119 | +```bash |
| 120 | +export MAX_MODEL_LEN=2048 |
| 121 | +export TP=8 # number of chips |
| 122 | + |
| 123 | +vllm serve meta-llama/Llama-3.3-70B-Instruct \ |
| 124 | + --seed 42 \ |
| 125 | + --disable-log-requests \ |
| 126 | + --no-enable-prefix-caching \ |
| 127 | + --async-scheduling \ |
| 128 | + --gpu-memory-utilization 0.98 \ |
| 129 | + --max-num-batched-tokens 512 \ |
| 130 | + --max-num-seqs 256 \ |
| 131 | + --tensor-parallel-size $TP \ |
| 132 | + --max-model-len $MAX_MODEL_LEN |
| 133 | +``` |
| 134 | + |
| 135 | +It takes a few minutes depending on the model size to prepare the server. |
| 136 | +Once you see the below snippet in the logs, it means that the server is ready |
| 137 | +to serve requests or run benchmarks: |
| 138 | + |
| 139 | +```bash |
| 140 | +INFO: Started server process [7] |
| 141 | +INFO: Waiting for application startup. |
| 142 | +INFO: Application startup complete. |
| 143 | +INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit) |
| 144 | +``` |
| 145 | + |
| 146 | +## Step 7: Prepare the test environment |
| 147 | + |
| 148 | +Open a new terminal to test the server and run the benchmark (keep the previous terminal open). |
| 149 | + |
| 150 | +First, we ssh into the TPU vm via the new terminal: |
| 151 | + |
| 152 | +```bash |
| 153 | +export TPU_NAME=your-tpu-name |
| 154 | +export ZONE=your-tpu-zone |
| 155 | +export PROJECT=your-tpu-project |
| 156 | + |
| 157 | +gcloud compute tpus tpu-vm ssh $TPU_NAME --project $PROJECT --zone=$ZONE |
| 158 | +``` |
| 159 | + |
| 160 | +## Step 8: access the running container |
| 161 | + |
| 162 | +```bash |
| 163 | +sudo docker exec -it $USER-vllm bash |
| 164 | +``` |
| 165 | + |
| 166 | +## Step 9: Test the server |
| 167 | + |
| 168 | +Let's submit a test request to the server. This helps us to see if the server is launched properly and we can see legitimate response from the model. |
| 169 | + |
| 170 | +```bash |
| 171 | +curl http://localhost:8000/v1/completions \ |
| 172 | + -H "Content-Type: application/json" \ |
| 173 | + -d '{ |
| 174 | + "model": "meta-llama/Llama-3.3-70B-Instruct", |
| 175 | + "prompt": "I love the mornings, because ", |
| 176 | + "max_tokens": 200, |
| 177 | + "temperature": 0 |
| 178 | + }' |
| 179 | +``` |
| 180 | + |
| 181 | +## Step 10: Preparing the test image |
| 182 | + |
| 183 | +You will need to install datasets as it's not available in the base vllm |
| 184 | +image. |
| 185 | + |
| 186 | +```bash |
| 187 | +pip install datasets |
| 188 | +``` |
| 189 | + |
| 190 | +## <a id="section-benchmarking"></a>Step 11: Run the benchmarking |
| 191 | + |
| 192 | +Finally, we are ready to run the benchmark: |
| 193 | + |
| 194 | +```bash |
| 195 | +export MAX_INPUT_LEN=1000 |
| 196 | +export MAX_OUTPUT_LEN=1000 |
| 197 | +export MAX_CONCURRENCY=128 |
| 198 | +export HF_TOKEN=<your HF token> |
| 199 | + |
| 200 | +cd /workspace/vllm |
| 201 | + |
| 202 | +vllm bench serve \ |
| 203 | + --backend vllm \ |
| 204 | + --model "meta-llama/Llama-3.3-70B-Instruct" \ |
| 205 | + --dataset-name random \ |
| 206 | + --ignore-eos \ |
| 207 | + --num-prompts 3000 \ |
| 208 | + --random-input-len=$MAX_INPUT_LEN \ |
| 209 | + --random-output-len=$MAX_OUTPUT_LEN \ |
| 210 | + --max-concurrency=$MAX_CONCURRENCY \ |
| 211 | + --seed 100 |
| 212 | +``` |
| 213 | + |
| 214 | +The snippet below is what you’d expect to see - the numbers vary based on the vllm version, the model size and the TPU instance type/size. |
| 215 | + |
| 216 | +With `MAX_CONCURRENCY=64`: |
| 217 | + |
| 218 | +```text |
| 219 | +============ Serving Benchmark Result ============ |
| 220 | +Successful requests: 3000 |
| 221 | +Failed requests: 0 |
| 222 | +Maximum request concurrency: 64 |
| 223 | +Benchmark duration (s): 1202.50 |
| 224 | +Total input tokens: 2997000 |
| 225 | +Total generated tokens: 3000000 |
| 226 | +Request throughput (req/s): 2.49 |
| 227 | +Output token throughput (tok/s): 2494.80 |
| 228 | +Peak output token throughput (tok/s): 2872.00 |
| 229 | +Peak concurrent requests: 77.00 |
| 230 | +Total Token throughput (tok/s): 4987.10 |
| 231 | +---------------Time to First Token---------------- |
| 232 | +Mean TTFT (ms): 251.04 |
| 233 | +Median TTFT (ms): 198.81 |
| 234 | +P99 TTFT (ms): 2501.65 |
| 235 | +-----Time per Output Token (excl. 1st token)------ |
| 236 | +Mean TPOT (ms): 25.35 |
| 237 | +Median TPOT (ms): 25.40 |
| 238 | +P99 TPOT (ms): 25.43 |
| 239 | +---------------Inter-token Latency---------------- |
| 240 | +Mean ITL (ms): 25.35 |
| 241 | +Median ITL (ms): 23.11 |
| 242 | +P99 ITL (ms): 40.75 |
| 243 | +================================================== |
| 244 | +``` |
| 245 | + |
| 246 | +With `MAX_CONCURRENCY=128`: |
| 247 | + |
| 248 | +```text |
| 249 | +============ Serving Benchmark Result ============ |
| 250 | +Successful requests: 3000 |
| 251 | +Failed requests: 0 |
| 252 | +Maximum request concurrency: 128 |
| 253 | +Benchmark duration (s): 846.39 |
| 254 | +Total input tokens: 2997000 |
| 255 | +Total generated tokens: 3000000 |
| 256 | +Request throughput (req/s): 3.54 |
| 257 | +Output token throughput (tok/s): 3544.46 |
| 258 | +Peak output token throughput (tok/s): 4352.00 |
| 259 | +Peak concurrent requests: 139.00 |
| 260 | +Total Token throughput (tok/s): 7085.38 |
| 261 | +---------------Time to First Token---------------- |
| 262 | +Mean TTFT (ms): 512.34 |
| 263 | +Median TTFT (ms): 229.09 |
| 264 | +P99 TTFT (ms): 8250.26 |
| 265 | +-----Time per Output Token (excl. 1st token)------ |
| 266 | +Mean TPOT (ms): 35.12 |
| 267 | +Median TPOT (ms): 35.43 |
| 268 | +P99 TPOT (ms): 35.49 |
| 269 | +---------------Inter-token Latency---------------- |
| 270 | +Mean ITL (ms): 35.12 |
| 271 | +Median ITL (ms): 30.73 |
| 272 | +P99 ITL (ms): 47.03 |
| 273 | +================================================== |
| 274 | +``` |
0 commit comments