From 00c8229d1d64bd703544e87c3f5e7b99fe7de6ca Mon Sep 17 00:00:00 2001 From: Harsh Shah Date: Tue, 9 Dec 2025 14:13:41 +0000 Subject: [PATCH 1/2] Consolidate Llama3.1 and Llama3.3 recipes in a single recipe --- inference/trillium/vLLM/Llama3.1/README.md | 230 ------------------ .../vLLM/{Llama3.3 => Llama3.x}/README.md | 39 +-- 2 files changed, 25 insertions(+), 244 deletions(-) delete mode 100644 inference/trillium/vLLM/Llama3.1/README.md rename inference/trillium/vLLM/{Llama3.3 => Llama3.x}/README.md (87%) diff --git a/inference/trillium/vLLM/Llama3.1/README.md b/inference/trillium/vLLM/Llama3.1/README.md deleted file mode 100644 index bb9d706..0000000 --- a/inference/trillium/vLLM/Llama3.1/README.md +++ /dev/null @@ -1,230 +0,0 @@ -# Serve Llama3.1 with vLLM on TPU VMs - -In this guide, we show how to serve -[Llama3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) and -[Llama3.1-70B](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct). - -> **Note:** Access to Llama models on Hugging Face requires accepting the Community License Agreement and awaiting approval before you can download and serve them. - -## Step 0: Install `gcloud cli` - -You can reproduce this experiment from your dev environment -(e.g. your laptop). -You need to install `gcloud` locally to complete this tutorial. - -To install `gcloud cli` please follow this guide: -[Install the gcloud CLI](https://cloud.google.com/sdk/docs/install#mac) - -Once it is installed, you can login to GCP from your terminal with this -command: `gcloud auth login`. - -## Step 1: Create a v6e TPU instance - -We create a single VM. For Llama3.1-8B, 1 chip is sufficient and for the 70B -model, at least 8 chips are required. If you need a different number of -chips, you can set a different value for `--topology` such as `1x1`, -`2x4`, etc. - -To learn more about topologies: [v6e VM Types](https://cloud.google.com/tpu/docs/v6e#vm-types). - -> **Note:** Acquiring on-demand TPUs can be challenging due to high demand. We recommend using [Queued Resources](https://cloud.google.com/tpu/docs/queued-resources) to ensure you get the required capacity. - -### Option 1: Create an on-demand TPU VM - -This command attempts to create a TPU VM immediately. - -```bash -export TPU_NAME=your-tpu-name -export ZONE=your-tpu-zone -export PROJECT=your-tpu-project - -# this command creates a tpu vm with 8 Trillium (v6e) chips - adjust it to suit your needs -gcloud alpha compute tpus tpu-vm create $TPU_NAME \ - --type v6e --topology 2x4 \ - --project $PROJECT --zone $ZONE --version v2-alpha-tpuv6e -``` - -### Option 2: Use Queued Resources (Recommended) - -With Queued Resources, you submit a request for TPUs and it gets fulfilled -when capacity is available. - -```bash -export TPU_NAME=your-tpu-name -export ZONE=your-tpu-zone -export PROJECT=your-tpu-project -export QR_ID=your-queued-resource-id # e.g. my-qr-request - -# This command requests a v6e-8 (8 chips). Adjust accelerator-type for different sizes. For 1 chip (Llama3.1-8B), use --accelerator-type v6e-1. -gcloud alpha compute tpus queued-resources create $QR_ID \ - --node-id $TPU_NAME \ - --project $PROJECT --zone $ZONE \ - --accelerator-type v6e-8 \ - --runtime-version v2-alpha-tpuv6e -``` - -You can check the status of your request with: - -```bash -gcloud alpha compute tpus queued-resources list --project $PROJECT --zone $ZONE -``` - -Once the state is `ACTIVE`, your TPU VM is ready and you can proceed to the next steps. - -## Step 2: ssh to the instance - -```bash -gcloud compute tpus tpu-vm ssh $TPU_NAME --project $PROJECT --zone=$ZONE -``` - -## Step 3: Use the latest vllm docker image for TPU - -```bash -export DOCKER_URI=vllm/vllm-tpu:latest -``` - -## Step 4: Run the docker container in the TPU instance - -```bash -sudo docker run -it --rm --name $USER-vllm --privileged --net=host \ - -v /dev/shm:/dev/shm \ - --shm-size 150gb \ - --entrypoint /bin/bash ${DOCKER_URI} -``` - -> **Note:** 150GB should be sufficient for the 70B model. For the 8B model allocate at least 17GB for the weights. - -> **Note:** See [this guide](https://cloud.google.com/tpu/docs/attach-durable-block-storage) for attaching durable block storage to TPUs. - -## Step 5: Set up env variables - -Export your hugging face token along with other environment variables inside -the container. - -```bash -export HF_HOME=/dev/shm -export HF_TOKEN= -``` - -## Step 6: Serve the model - -Now we start the vllm server. -Make sure you keep this terminal open for the entire duration of this experiment. - -```bash -export MAX_MODEL_LEN=4096 -export TP=8 # number of chips - -vllm serve meta-llama/Llama-3.1-70B-Instruct \ - --seed 42 \ - --disable-log-requests \ - --gpu-memory-utilization 0.98 \ - --max-num-batched-tokens 2048 \ - --max-num-seqs 256 \ - --tensor-parallel-size $TP \ - --max-model-len $MAX_MODEL_LEN -``` - -For the 8B model on a v6e-1 (1-chip) instance, we recommend `--max-num-batched-tokens 1024 --max-num-seqs 128`. - -It takes a few minutes depending on the model size to prepare the server. -Once you see the below snippet in the logs, it means that the server is ready -to serve requests or run benchmarks: - -```bash -INFO: Started server process [7] -INFO: Waiting for application startup. -INFO: Application startup complete. -INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit) -``` - -## Step 7: Prepare the test environment - -Open a new terminal to test the server and run the benchmark (keep the previous terminal open). - -First, we ssh into the TPU vm via the new terminal: - -```bash -export TPU_NAME=your-tpu-name -export ZONE=your-tpu-zone -export PROJECT=your-tpu-project - -gcloud compute tpus tpu-vm ssh $TPU_NAME --project $PROJECT --zone=$ZONE -``` - -## Step 8: access the running container - -```bash -sudo docker exec -it $USER-vllm bash -``` - -## Step 9: Test the server - -Let's submit a test request to the server. This helps us to see if the server is launched properly and we can see legitimate response from the model. - -```bash -curl http://localhost:8000/v1/completions \ - -H "Content-Type: application/json" \ - -d '{ - "model": "meta-llama/Llama-3.1-70B-Instruct", - "prompt": "I love the mornings, because ", - "max_tokens": 200, - "temperature": 0 - }' -``` - -## Step 10: Preparing the test image - -You will need to install datasets as it's not available in the base vllm -image. - -```bash -pip install datasets -``` - -## Step 11: Run the benchmarking - -Finally, we are ready to run the benchmark: - -```bash -export MAX_INPUT_LEN=1800 -export MAX_OUTPUT_LEN=128 -export HF_TOKEN= - -cd /workspace/vllm - -vllm bench serve \ - --backend vllm \ - --model "meta-llama/Llama-3.1-70B-Instruct" \ - --dataset-name random \ - --num-prompts 1000 \ - --random-input-len=$MAX_INPUT_LEN \ - --random-output-len=$MAX_OUTPUT_LEN \ - --seed 100 -``` - -The snippet below is what you’d expect to see - the numbers vary based on the vllm version, the model size and the TPU instance type/size. - -```bash -============ Serving Benchmark Result ============ -Successful requests: xxxxxxx -Benchmark duration (s): xxxxxxx -Total input tokens: xxxxxxx -Total generated tokens: xxxxxxx -Request throughput (req/s): xxxxxxx -Output token throughput (tok/s): xxxxxxx -Total Token throughput (tok/s): xxxxxxx ----------------Time to First Token---------------- -Mean TTFT (ms): xxxxxxx -Median TTFT (ms): xxxxxxx -P99 TTFT (ms): xxxxxxx ------Time per Output Token (excl. 1st token)------ -Mean TPOT (ms): xxxxxxx -Median TPOT (ms): xxxxxxx -P99 TPOT (ms): xxxxxxx ----------------Inter-token Latency---------------- -Mean ITL (ms): xxxxxxx -Median ITL (ms): xxxxxxx -P99 ITL (ms): xxxxxxx -================================================== -``` diff --git a/inference/trillium/vLLM/Llama3.3/README.md b/inference/trillium/vLLM/Llama3.x/README.md similarity index 87% rename from inference/trillium/vLLM/Llama3.3/README.md rename to inference/trillium/vLLM/Llama3.x/README.md index 5c1e988..5e87814 100644 --- a/inference/trillium/vLLM/Llama3.3/README.md +++ b/inference/trillium/vLLM/Llama3.x/README.md @@ -1,17 +1,18 @@ -# Serve Llama3.3 with vLLM on TPU VMs +# Serve Llama3.x with vLLM on TPU VMs In this guide, we show how to serve +[Llama3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct), [Llama3.3-70B](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct). > **Note:** Access to Llama models on Hugging Face requires accepting the Community License Agreement and awaiting approval before you can download and serve them. -## Step 0: Install `gcloud cli` +## Step 0: Install `gcloud CLI` You can reproduce this experiment from your dev environment (e.g. your laptop). You need to install `gcloud` locally to complete this tutorial. -To install `gcloud cli` please follow this guide: +To install `gcloud CLI` please follow this guide: [Install the gcloud CLI](https://cloud.google.com/sdk/docs/install#mac) Once it is installed, you can login to GCP from your terminal with this @@ -19,7 +20,8 @@ command: `gcloud auth login`. ## Step 1: Create a v6e TPU instance -We create a single VM. For Llama3.3-70B, at least 8 chips are required. If you need a different number of +We create a single VM. For Llama3.1-8B, 1 chip is sufficient and for the 70B +models, at least 8 chips are required. If you need a different number of chips, you can set a different value for `--topology` such as `1x1`, `2x4`, etc. @@ -53,7 +55,7 @@ export ZONE=your-tpu-zone export PROJECT=your-tpu-project export QR_ID=your-queued-resource-id # e.g. my-qr-request -# This command requests a v6e-8 (8 chips). Adjust accelerator-type for different sizes. For 1 chip, use --accelerator-type v6e-1. +# This command requests a v6e-8 (8 chips). Adjust accelerator-type for different sizes. For 1 chip (Llama3.1-8B), use --accelerator-type v6e-1. gcloud alpha compute tpus queued-resources create $QR_ID \ --node-id $TPU_NAME \ --project $PROJECT --zone $ZONE \ @@ -69,25 +71,25 @@ gcloud alpha compute tpus queued-resources list --project $PROJECT --zone $ZONE Once the state is `ACTIVE`, your TPU VM is ready and you can proceed to the next steps. -## Step 2: ssh to the instance +## Step 2: SSH to the instance ```bash gcloud compute tpus tpu-vm ssh $TPU_NAME --project $PROJECT --zone=$ZONE ``` -## Step 3: Use the vllm docker image for TPU +## Step 3: Use the latest vLLM Docker image for TPU ```bash export DOCKER_URI=vllm/vllm-tpu:nightly-20251129-28607fc-39e63de ``` -The docker image is pinged here for users to reproduce the [results below](#section-benchmarking). +The docker image is pinned here for users to reproduce the [results below](#section-benchmarking). To use the latest stable version, set `DOCKER_URI=vllm/vllm-tpu:latest`. To use the latest nightly built image that has more recent features/improvements, set `DOCKER_URI=vllm/vllm-tpu:nightly`. -## Step 4: Run the docker container in the TPU instance +## Step 4: Run the Docker container in the TPU instance ```bash sudo docker run -it --rm --name $USER-vllm --privileged --net=host \ @@ -96,7 +98,7 @@ sudo docker run -it --rm --name $USER-vllm --privileged --net=host \ --entrypoint /bin/bash ${DOCKER_URI} ``` -> **Note:** 150GB should be sufficient for the 70B model. For the 8B model allocate at least 17GB for the weights. +> **Note:** 150GB should be sufficient for the 70B models. For the 8B model allocate at least 17GB for the weights. > **Note:** See [this guide](https://cloud.google.com/tpu/docs/attach-durable-block-storage) for attaching durable block storage to TPUs. @@ -115,6 +117,8 @@ export HF_TOKEN= Now we start the vllm server. Make sure you keep this terminal open for the entire duration of this experiment. +Here is the serving command for the 70B model: + ```bash export MAX_MODEL_LEN=2048 export TP=8 # number of chips @@ -131,6 +135,13 @@ vllm serve meta-llama/Llama-3.3-70B-Instruct \ --max-model-len $MAX_MODEL_LEN ``` +| Model | Input/Output Scenario | max-num-batched-tokens | max-num-seqs | +| :--- | :--- | :--- | :--- | +| Llama-3.x-70B-Instruct | Prefill Heavy | 2048 | 256 | +| Llama-3.x-70B-Instruct | Decode Heavy/ Balanced | 512 | 256 | +| Llama3.1-8B-Instruct | Prefill Heavy | 1024 | 128 | + + It takes a few minutes depending on the model size to prepare the server. Once you see the below snippet in the logs, it means that the server is ready to serve requests or run benchmarks: @@ -156,7 +167,7 @@ export PROJECT=your-tpu-project gcloud compute tpus tpu-vm ssh $TPU_NAME --project $PROJECT --zone=$ZONE ``` -## Step 8: access the running container +## Step 8: Access the running container ```bash sudo docker exec -it $USER-vllm bash @@ -177,7 +188,7 @@ curl http://localhost:8000/v1/completions \ }' ``` -## Step 10: Preparing the test image +## Step 10: Prepare the test image You will need to install datasets as it's not available in the base vllm image. @@ -186,7 +197,7 @@ image. pip install datasets ``` -## Step 11: Run the benchmarking +## Step 11: Run the benchmark Finally, we are ready to run the benchmark: @@ -270,4 +281,4 @@ Mean ITL (ms): 35.12 Median ITL (ms): 30.73 P99 ITL (ms): 47.03 ================================================== -``` +``` \ No newline at end of file From d44ab3b447eebb1fdffe37fa5b3c8d90085ca219 Mon Sep 17 00:00:00 2001 From: Harsh Shah Date: Wed, 10 Dec 2025 08:25:36 +0000 Subject: [PATCH 2/2] Add a column for tensor parallel --- inference/trillium/vLLM/Llama3.x/README.md | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/inference/trillium/vLLM/Llama3.x/README.md b/inference/trillium/vLLM/Llama3.x/README.md index 5e87814..72cab6f 100644 --- a/inference/trillium/vLLM/Llama3.x/README.md +++ b/inference/trillium/vLLM/Llama3.x/README.md @@ -135,11 +135,12 @@ vllm serve meta-llama/Llama-3.3-70B-Instruct \ --max-model-len $MAX_MODEL_LEN ``` -| Model | Input/Output Scenario | max-num-batched-tokens | max-num-seqs | -| :--- | :--- | :--- | :--- | -| Llama-3.x-70B-Instruct | Prefill Heavy | 2048 | 256 | -| Llama-3.x-70B-Instruct | Decode Heavy/ Balanced | 512 | 256 | -| Llama3.1-8B-Instruct | Prefill Heavy | 1024 | 128 | +| Model | Input/Output Scenario | max-num-batched-tokens | max-num-seqs | tensor-parallel-size | +|:--- | :--- | :--- | :--- | :--- | +| Llama-3.x-70B-Instruct | Prefill Heavy | 2048 | 256 | 8 | +| Llama-3.x-70B-Instruct | Decode Heavy/ Balanced | 512 | 256 | 8 | +| Llama3.1-8B-Instruct | Prefill Heavy | 1024 | 128 | 1 | + It takes a few minutes depending on the model size to prepare the server.