|
| 1 | +# Serve Llama-3.1-8B (or any other model) with vLLM on TPU VMs. |
| 2 | + |
| 3 | +In this guide, we show how to serve Llama-3.1-8B ([deepseek-ai/DeepSeek-R1-Distill-Llama-8B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B)). You can host [any other supported model](https://docs.vllm.ai/en/latest/models/supported_models.html) based on your needs. |
| 4 | + |
| 5 | +## Step 0: Install `gcloud cli` |
| 6 | + |
| 7 | +You can reproduce this experiment from your dev environment (e.g. your laptop). You need to install `gcloud` locally to complete this tutorial. |
| 8 | + |
| 9 | +To install `gcloud cli` please follow this guide: [Install the gcloud CLI](https://cloud.google.com/sdk/docs/install#mac) |
| 10 | + |
| 11 | +Once it is installed, you can login to GCP from your terminal with this command: `gcloud auth login`. |
| 12 | + |
| 13 | +## Step 1: Create a v5e TPU instance |
| 14 | + |
| 15 | +We create a single VM with 4 v5e chips - to serve an 8b model <sup>1</sup> if you need larger instances, you can set a different value for `--topology` such as `2x2`, `2x4`, etc. |
| 16 | + |
| 17 | +<small>*1- Why 4 chips for an 8B model? we need at least 16GB of HBM for the weights (assuming the model is served in BF16 => 8B * 2 bytes = 16B bytes or 16 GB) and some room in HBM for KV Cache - given that each v5e has [16GB](https://cloud.google.com/tpu/docs/v5e) of HBM per chip, we provision 4 chips to accommodate for both the weights and the KV Cache*</small> |
| 18 | + |
| 19 | +To learn more about topologies: [v5e VM Types](https://cloud.google.com/tpu/docs/v5e#vm-types). |
| 20 | + |
| 21 | +```bash |
| 22 | +export TPU_NAME=your-tpu-name |
| 23 | +export ZONE=your-tpu-zone |
| 24 | +export PROJECT=your-tpu-project |
| 25 | + |
| 26 | +# this command creates a tpu vm with 4 v5e chips - adjust it to suit your needs |
| 27 | +gcloud alpha compute tpus tpu-vm create $TPU_NAME \ |
| 28 | + --type v5litepod --topology 2x2 \ |
| 29 | + --project $PROJECT --zone $ZONE --version v2-alpha-tpuv5-lite |
| 30 | +``` |
| 31 | + |
| 32 | +## Step 2: ssh to the instance |
| 33 | + |
| 34 | +```bash |
| 35 | +gcloud compute tpus tpu-vm ssh $TPU_NAME --project $PROJECT --zone=$ZONE |
| 36 | +``` |
| 37 | + |
| 38 | +## Step 3: Use the latest vllm docker image for TPU |
| 39 | +We use a pinned image but you can change it to `vllm/vllm-tpu:nightly` to get the latest TPU nightly image. |
| 40 | + |
| 41 | +```bash |
| 42 | +export DOCKER_URI=vllm/vllm-tpu:nightly |
| 43 | +``` |
| 44 | + |
| 45 | +## Step 4: Run the docker container in the TPU instance |
| 46 | + |
| 47 | +```bash |
| 48 | +sudo docker run -t --rm --name $USER-vllm --privileged --net=host -v /dev/shm:/dev/shm --shm-size 10gb -p 8000:8000 --entrypoint /bin/bash -it ${DOCKER_URI} |
| 49 | +``` |
| 50 | + |
| 51 | +## Step 5: Set up env variables |
| 52 | +Export your hugging face token along with other environment variables inside the container. |
| 53 | + |
| 54 | +```bash |
| 55 | +export HF_HOME=/dev/shm |
| 56 | +export HF_TOKEN=<your HF token> |
| 57 | +``` |
| 58 | + |
| 59 | +## Step 6: Serve the model |
| 60 | + |
| 61 | +Now we serve the vllm server. Make sure you keep this terminal open for the entire duration of this experiment. |
| 62 | + |
| 63 | +```bash |
| 64 | +export MAX_MODEL_LEN=4096 |
| 65 | +export TP=4 # number of chips |
| 66 | +# export RATIO=0.8 |
| 67 | +# export PREFIX_LEN=0 |
| 68 | + |
| 69 | +VLLM_USE_V1=1 vllm serve deepseek-ai/DeepSeek-R1-Distill-Llama-8B --seed 42 --disable-log-requests --gpu-memory-utilization 0.95 --max-num-batched-tokens 8192 --max-num-seqs 128 --tensor-parallel-size $TP --max-model-len $MAX_MODEL_LEN |
| 70 | +``` |
| 71 | + |
| 72 | +It takes a few minutes depending on the model size to prepare the server - once you see the below snippet in the logs, it means that the server is ready to serve requests or run benchmarks: |
| 73 | + |
| 74 | +```bash |
| 75 | +INFO: Started server process [7] |
| 76 | +INFO: Waiting for application startup. |
| 77 | +INFO: Application startup complete. |
| 78 | +INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit) |
| 79 | +``` |
| 80 | + |
| 81 | +## Step 7: Prepare the test environment |
| 82 | + |
| 83 | +Open a new terminal to test the server and run the benchmark (keep the previous terminal open). |
| 84 | + |
| 85 | +First, we ssh into the TPU vm via the new terminal: |
| 86 | + |
| 87 | +```bash |
| 88 | +export TPU_NAME=your-tpu-name |
| 89 | +export ZONE=your-tpu-zone |
| 90 | +export PROJECT=your-tpu-project |
| 91 | + |
| 92 | +gcloud compute tpus tpu-vm ssh $TPU_NAME --project $PROJECT --zone=$ZONE |
| 93 | +``` |
| 94 | + |
| 95 | +## Step 8: access the running container |
| 96 | + |
| 97 | +```bash |
| 98 | +sudo docker exec -it $USER-vllm bash |
| 99 | +``` |
| 100 | + |
| 101 | +## Step 9: Test the server. |
| 102 | + |
| 103 | +Let's submit a test request to the server. This helps us to see if the server is launched properly and we can see legitimate response from the model. |
| 104 | + |
| 105 | +```bash |
| 106 | +curl http://localhost:8000/v1/completions \ |
| 107 | + -H "Content-Type: application/json" \ |
| 108 | + -d '{ |
| 109 | + "model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B", |
| 110 | + "prompt": "I love the mornings, because ", |
| 111 | + "max_tokens": 200, |
| 112 | + "temperature": 0 |
| 113 | + }' |
| 114 | +``` |
| 115 | + |
| 116 | +## Step 9: Preparing the test image |
| 117 | + |
| 118 | +You might need to install datasets as it's not available in the base vllm image. |
| 119 | + |
| 120 | +```bash |
| 121 | +pip install datasets |
| 122 | +``` |
| 123 | + |
| 124 | +## Step 10: Run the benchmarking |
| 125 | + |
| 126 | +Finally, we are ready to run the benchmark: |
| 127 | + |
| 128 | +```bash |
| 129 | +export MAX_INPUT_LEN=1800 |
| 130 | +export MAX_OUTPUT_LEN=128 |
| 131 | +export HF_TOKEN=<your HF token> |
| 132 | + |
| 133 | +cd /workspace/vllm |
| 134 | + |
| 135 | +python benchmarks/benchmark_serving.py \ |
| 136 | + --backend vllm \ |
| 137 | + --model "deepseek-ai/DeepSeek-R1-Distill-Llama-8B" \ |
| 138 | + --dataset-name random \ |
| 139 | + --num-prompts 1000 \ |
| 140 | + --random-input-len=$MAX_INPUT_LEN \ |
| 141 | + --random-output-len=$MAX_OUTPUT_LEN \ |
| 142 | + --seed 100 |
| 143 | + # --random-range-ratio=$RATIO \ |
| 144 | + # --random-prefix-len=$PREFIX_LEN |
| 145 | +``` |
| 146 | + |
| 147 | +The snippet below is what you’d expect to see - the numbers vary based on the vllm version, the model size and the TPU instance type/size. |
| 148 | + |
| 149 | +```bash |
| 150 | +============ Serving Benchmark Result ============ |
| 151 | +Successful requests: xxxxxxx |
| 152 | +Benchmark duration (s): xxxxxxx |
| 153 | +Total input tokens: xxxxxxx |
| 154 | +Total generated tokens: xxxxxxx |
| 155 | +Request throughput (req/s): xxxxxxx |
| 156 | +Output token throughput (tok/s): xxxxxxx |
| 157 | +Total Token throughput (tok/s): xxxxxxx |
| 158 | +---------------Time to First Token---------------- |
| 159 | +Mean TTFT (ms): xxxxxxx |
| 160 | +Median TTFT (ms): xxxxxxx |
| 161 | +P99 TTFT (ms): xxxxxxx |
| 162 | +-----Time per Output Token (excl. 1st token)------ |
| 163 | +Mean TPOT (ms): xxxxxxx |
| 164 | +Median TPOT (ms): xxxxxxx |
| 165 | +P99 TPOT (ms): xxxxxxx |
| 166 | +---------------Inter-token Latency---------------- |
| 167 | +Mean ITL (ms): xxxxxxx |
| 168 | +Median ITL (ms): xxxxxxx |
| 169 | +P99 ITL (ms): xxxxxxx |
| 170 | +================================================== |
| 171 | +``` |
| 172 | + |
0 commit comments