added v5e and fixed a couple of typos

hosseinsarshar · hosseinsarshar · commit 3b4f0514d5bd · 2025-04-10T12:31:40.000-04:00
diff --git a/inference/trillium/vLLM/Llama3-8b/README.md b/inference/trillium/vLLM/Llama3-8b/README.md
@@ -12,7 +12,7 @@ Once it is installed, you can login to GCP from your terminal with this command:
 
 ## Step 1: Create a v6e TPU instance
 
-We create a single VM with 1 trillium chip as it's enought to serve an 8B parameter model - if you need larger instances, you can set a different value for `--topology` such as `2x2`, `4x2`, etc.
+We create a single VM with 1 trillium chip as it's enought to serve an 8B parameter model - if you need larger instances, you can set a different value for `--topology` such as `2x2`, `2x4`, etc.
 
 To learn more about topologies: [v6e VM Types](https://cloud.google.com/tpu/docs/v6e#vm-types).
 
@@ -60,7 +60,7 @@ Now we serve the vllm server. Make sure you keep this terminal open for the enti
 
 ```bash
 export MAX_MODEL_LEN=4096
-export TP=4 # number of chips
+export TP=1 # number of chips
 # export RATIO=0.8
 # export PREFIX_LEN=0
 
diff --git a/inference/trillium/vLLM/Qwen2.5-32B/README.md b/inference/trillium/vLLM/Qwen2.5-32B/README.md
@@ -12,7 +12,7 @@ Once it is installed, you can login to GCP from your terminal with this command:
 
 ## Step 1: Create a v6e TPU instance
 
-We create a single VM with 4 trillium chips - if you need a different number of chips, you can set a different value for `--topology` such as `1x1`, `4x2`, etc.
+We create a single VM with 4 trillium chips - if you need a different number of chips, you can set a different value for `--topology` such as `1x1`, `2x4`, etc.
 
 To learn more about topologies: [v6e VM Types](https://cloud.google.com/tpu/docs/v6e#vm-types).
 
@@ -21,7 +21,7 @@ export TPU_NAME=your-tpu-name
 export ZONE=your-tpu-zone 
 export PROJECT=your-tpu-project
 
-# this command creates a tpu vm with 1 Trillium (v6e) chips - adjust it to suit your needs
+# this command creates a tpu vm with 4 Trillium (v6e) chips - adjust it to suit your needs
 gcloud alpha compute tpus tpu-vm create $TPU_NAME \
     --type v6e --topology 2x2 \
     --project $PROJECT --zone $ZONE --version v2-alpha-tpuv6e
@@ -59,7 +59,7 @@ export HF_TOKEN=<your HF token>
 Now we serve the vllm server. Make sure you keep this terminal open for the entire duration of this experiment.
 
 ```bash
-export MAX_MODEL_LEN=2048
+export MAX_MODEL_LEN=4096
 export TP=4 # number of chips
 # export RATIO=0.8
 # export PREFIX_LEN=0
diff --git a/inference/trillium/vLLM/README.md b/inference/trillium/vLLM/README.md
@@ -0,0 +1,13 @@
+# Serve vLLM on Trillium TPUs (v6e):
+
+This repository provides examples demonstrating how to deploy and serve vLLM on Trillium TPUs using GCE (Google Compute Engine) for a select set of models.
+
+- [Llama3-8b](./Llama3-8b/README.md)
+- [Qwen2.5-32B](./Qwen2.5-32B/README.md)
+
+These models were chosen for demonstration purposes only. You can serve any model from this list: [vLLM Supported Models](https://docs.vllm.ai/en/latest/models/supported_models.html)
+
+If you are looking for GKE-based deployment, please refer to this documentation: [Serve an LLM using TPU Trillium on GKE with vLLM](https://cloud.google.com/kubernetes-engine/docs/tutorials/serve-vllm-tpu)
+
+To serve vLLM on v5e TPUs, please refer to this page: [Serve vLLM on v5e TPUs](../../v5e/vLLM/README.md)
+
diff --git a/inference/v5e/vLLM/Llama3-8b/README.md b/inference/v5e/vLLM/Llama3-8b/README.md
@@ -0,0 +1,172 @@
+# Serve Llama-3.1-8B (or any other model) with vLLM on TPU VMs.
+
+In this guide, we show how to serve Llama-3.1-8B ([deepseek-ai/DeepSeek-R1-Distill-Llama-8B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B)). You can host [any other supported model](https://docs.vllm.ai/en/latest/models/supported_models.html) based on your needs.
+
+## Step 0: Install `gcloud cli`
+
+You can reproduce this experiment from your dev environment (e.g. your laptop). You need to install `gcloud` locally to complete this tutorial.
+
+To install `gcloud cli` please follow this guide: [Install the gcloud CLI](https://cloud.google.com/sdk/docs/install#mac)
+
+Once it is installed, you can login to GCP from your terminal with this command: `gcloud auth login`.
+
+## Step 1: Create a v5e TPU instance
+
+We create a single VM with 4 v5e chips - to serve an 8b model <sup>1</sup> if you need larger instances, you can set a different value for `--topology` such as `2x2`, `2x4`, etc.
+
+<small>*1- Why 4 chips for an 8B model? we need at least 16GB of HBM for the weights (assuming the model is served in BF16 => 8B * 2 bytes = 16B bytes or 16 GB) and some room in HBM for KV Cache - given that each v5e has [16GB](https://cloud.google.com/tpu/docs/v5e) of HBM per chip, we provision 4 chips to accommodate for both the weights and the KV Cache*</small>
+
+To learn more about topologies: [v5e VM Types](https://cloud.google.com/tpu/docs/v5e#vm-types).
+
+```bash
+export TPU_NAME=your-tpu-name
+export ZONE=your-tpu-zone 
+export PROJECT=your-tpu-project
+
+# this command creates a tpu vm with 4 v5e chips - adjust it to suit your needs
+gcloud alpha compute tpus tpu-vm create $TPU_NAME \
+    --type v5litepod --topology 2x2 \
+    --project $PROJECT --zone $ZONE --version v2-alpha-tpuv5-lite
+```
+
+## Step 2: ssh to the instance
+
+```bash
+gcloud compute tpus tpu-vm ssh $TPU_NAME --project $PROJECT --zone=$ZONE
+```
+
+## Step 3: Use the latest vllm docker image for TPU
+We use a pinned image but you can change it to `vllm/vllm-tpu:nightly` to get the latest TPU nightly image.
+
+```bash
+export DOCKER_URI=vllm/vllm-tpu:nightly
+```
+
+## Step 4: Run the docker container in the TPU instance
+
+```bash
+sudo docker run -t --rm --name $USER-vllm --privileged --net=host -v /dev/shm:/dev/shm --shm-size 10gb -p 8000:8000 --entrypoint /bin/bash -it ${DOCKER_URI}
+```
+
+## Step 5: Set up env variables
+Export your hugging face token along with other environment variables inside the container.
+
+```bash
+export HF_HOME=/dev/shm
+export HF_TOKEN=<your HF token>
+```
+
+## Step 6: Serve the model
+
+Now we serve the vllm server. Make sure you keep this terminal open for the entire duration of this experiment.
+
+```bash
+export MAX_MODEL_LEN=4096
+export TP=4 # number of chips
+# export RATIO=0.8
+# export PREFIX_LEN=0
+
+VLLM_USE_V1=1 vllm serve deepseek-ai/DeepSeek-R1-Distill-Llama-8B --seed 42 --disable-log-requests --gpu-memory-utilization 0.95 --max-num-batched-tokens 8192 --max-num-seqs 128 --tensor-parallel-size $TP --max-model-len $MAX_MODEL_LEN
+```
+
+It takes a few minutes depending on the model size to prepare the server - once you see the below snippet in the logs, it means that the server is ready to serve requests or run benchmarks:
+
+```bash
+INFO:     Started server process [7]
+INFO:     Waiting for application startup.
+INFO:     Application startup complete.
+INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
+```
+
+## Step 7: Prepare the test environment
+
+Open a new terminal to test the server and run the benchmark (keep the previous terminal open).
+
+First, we ssh into the TPU vm via the new terminal:
+
+```bash
+export TPU_NAME=your-tpu-name
+export ZONE=your-tpu-zone
+export PROJECT=your-tpu-project
+
+gcloud compute tpus tpu-vm ssh $TPU_NAME --project $PROJECT --zone=$ZONE
+```
+
+## Step 8: access the running container
+
+```bash
+sudo docker exec -it $USER-vllm bash
+```
+
+## Step 9: Test the server.
+
+Let's submit a test request to the server. This helps us to see if the server is launched properly and we can see legitimate response from the model.
+
+```bash
+curl http://localhost:8000/v1/completions \
+    -H "Content-Type: application/json" \
+    -d '{
+        "model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
+        "prompt": "I love the mornings, because ",
+        "max_tokens": 200,
+        "temperature": 0
+    }'
+```
+
+## Step 9: Preparing the test image
+
+You might need to install datasets as it's not available in the base vllm image.
+
+```bash
+pip install datasets datasets
+```
+
+## Step 10:  Run the benchmarking
+
+Finally, we are ready to run the benchmark:
+
+```bash
+export MAX_INPUT_LEN=1800
+export MAX_OUTPUT_LEN=128
+export HF_TOKEN=<your HF token>
+
+cd /workspace/vllm
+
+python benchmarks/benchmark_serving.py \
+    --backend vllm \
+    --model "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"  \
+    --dataset-name random \
+    --num-prompts 1000 \
+    --random-input-len=$MAX_INPUT_LEN \
+    --random-output-len=$MAX_OUTPUT_LEN \
+    --seed 100
+    # --random-range-ratio=$RATIO \
+    # --random-prefix-len=$PREFIX_LEN
+```
+
+The snippet below is what you’d expect to see - the numbers vary based on the vllm version, the model size and the TPU instance type/size.
+
+```bash
+============ Serving Benchmark Result ============
+Successful requests:                     xxxxxxx
+Benchmark duration (s):                  xxxxxxx
+Total input tokens:                      xxxxxxx
+Total generated tokens:                  xxxxxxx
+Request throughput (req/s):              xxxxxxx
+Output token throughput (tok/s):         xxxxxxx
+Total Token throughput (tok/s):          xxxxxxx
+---------------Time to First Token----------------
+Mean TTFT (ms):                          xxxxxxx
+Median TTFT (ms):                        xxxxxxx
+P99 TTFT (ms):                           xxxxxxx
+-----Time per Output Token (excl. 1st token)------
+Mean TPOT (ms):                          xxxxxxx   
+Median TPOT (ms):                        xxxxxxx   
+P99 TPOT (ms):                           xxxxxxx   
+---------------Inter-token Latency----------------
+Mean ITL (ms):                           xxxxxxx   
+Median ITL (ms):                         xxxxxxx   
+P99 ITL (ms):                            xxxxxxx   
+==================================================
+```
+
diff --git a/inference/v5e/vLLM/Qwen2.5-32B/README.md b/inference/v5e/vLLM/Qwen2.5-32B/README.md
diff --git a/inference/v5e/vLLM/README.md b/inference/v5e/vLLM/README.md