Copybara import of gpu-recipes:

Copybara · Copybara · commit 0c28053183eb · 2025-01-29T23:33:36.000Z
- 54fe6d92f32a01bdcfe19c4d2829896ef1d37b9f Single node TRT-LLM Benchmarking of Llama 3.1 405B

GitOrigin-RevId: 54fe6d92f32a01bdcfe19c4d2829896ef1d37b9f
diff --git a/README.md b/README.md
@@ -30,15 +30,23 @@ Welcome to the reproducible benchmark recipes repository for GPUs! This reposito
 
 | Models           | GPU Machine Type | Framework | Workload Type       | Orchestrator | Link to the recipe |
 | ---------------- | ---------------- | --------- | ------------------- | ------------ | ------------------ |
-| **Llama-3.1-70B**     | [A3 Ultra (NVIDIA H200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a3-vms)    | MaxText  | Pre-training   | GKE          | [Link](./training/a3ultra/llama-3.1-70b/maxtext-pretraining-gke/README.md)
-| **Llama-3.1-70B**     | [A3 Ultra (NVIDIA H200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a3-vms)    | NeMo  | Pre-training   | GKE          | [Link](./training/a3ultra/llama-3.1-70b/nemo-pretraining-gke/README.md)
-| **Mixtral-8-7B**     | [A3 Ultra (NVIDIA H200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a3-vms)    | MaxText  | Pre-training   | GKE          | [Link](./training/a3ultra/mixtral-8x7b/maxtext-pretraining-gke/README.md)           
-| **Mixtral-8-7B**     | [A3 Ultra (NVIDIA H200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a3-vms)    | NeMo  | Pre-training   | GKE          | [Link](./training/a3ultra/mixtral-8x7b/nemo-pretraining-gke/README.md)            |
+| **Llama-3.1-70B**     | [A3 Ultra (NVIDIA H200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a3-ultra-vms)    | MaxText  | Pre-training   | GKE          | [Link](./training/a3ultra/llama-3.1-70b/maxtext-pretraining-gke/README.md)
+| **Llama-3.1-70B**     | [A3 Ultra (NVIDIA H200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a3-ultra-vms)    | NeMo  | Pre-training   | GKE          | [Link](./training/a3ultra/llama-3.1-70b/nemo-pretraining-gke/README.md)
+| **Mixtral-8-7B**     | [A3 Ultra (NVIDIA H200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a3-ultra-vms)    | MaxText  | Pre-training   | GKE          | [Link](./training/a3ultra/mixtral-8x7b/maxtext-pretraining-gke/README.md)
+| **Mixtral-8-7B**     | [A3 Ultra (NVIDIA H200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a3-ultra-vms)    | NeMo  | Pre-training   | GKE          | [Link](./training/a3ultra/mixtral-8x7b/nemo-pretraining-gke/README.md)            |
+
+
+### Inference benchmarks A3 Ultra
+
+| Models           | GPU Machine Type | Framework | Workload Type       | Orchestrator | Link to the recipe |
+| ---------------- | ---------------- | --------- | ------------------- | ------------ | ------------------ |
+| **Llama-3.1-405B**     | [A3 Ultra (NVIDIA H200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a3-ultra-vms)    | TensorRT-LLM  | Inference   | GKE          | [Link](./inference/a3ultra/llama-3.1-405b/trtllm-inference-gke/single-node/README.md)
 
 
 ## Repository structure
 
 * **[training/](./training)**: Contains recipes to reproduce training benchmarks with GPUs.
+* **[inference/](./inference)**: Contains recipes to reproduce inference benchmarks with GPUs.
 * **[src/](./src)**: Contains shared dependencies required to run benchmarks, such as Docker and Helm charts.
 * **[docs/](./docs)**: Contains supporting documentation for the recipes, such as explanation of benchmark methodologies or configurations.
 
diff --git a/inference/a3ultra/llama-3.1-405b/trtllm-inference-gke/single-node/README.md b/inference/a3ultra/llama-3.1-405b/trtllm-inference-gke/single-node/README.md
diff --git a/inference/a3ultra/llama-3.1-405b/trtllm-inference-gke/single-node/values.yaml b/inference/a3ultra/llama-3.1-405b/trtllm-inference-gke/single-node/values.yaml
@@ -0,0 +1,71 @@
+# Copyright 2025 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+clusterName:
+
+huggingface:
+  secretName: hf-secret
+  secretData:
+    token: "hf_api_token"
+
+model:
+  name: meta-llama/Llama-3.1-405B
+  tp_size: 8
+  pp_size: 1
+
+job:
+  ttlSecondsAfterFinished: 3600
+  image:
+    repository:
+    tag:
+  gpus: 8
+
+volumes:
+  ssdMountPath: "/ssd"
+  gcsMounts:
+    - bucketName:
+      mountPath: "/gcs"
+
+network:
+  subnetworks[]:
+
+benchmarks:
+  experiments:
+    # - isl: 1000
+    #   osl: 1000
+    #   num_requests: 3000
+    - isl: 128
+      osl: 128
+      num_requests: 30000
+    # - isl: 128
+    #   osl: 2048
+    #   num_requests: 3000
+    # - isl: 128
+    #   osl: 4096
+    #   num_requests: 1500
+    # - isl: 20000
+    #   osl: 2000
+    #   num_requests: 1000
+    # - isl: 2048
+    #   osl: 128
+    #   num_requests: 3000
+    # - isl: 2048
+    #   osl: 2048
+    #   num_requests: 1500
+    # - isl: 500
+    #   osl: 2000
+    #   num_requests: 3000
+    # - isl: 5000
+    #   osl: 500
+    #   num_requests: 1500
diff --git a/src/docker/trtllm-0.16.0/cloudbuild.yml b/src/docker/trtllm-0.16.0/cloudbuild.yml
@@ -0,0 +1,25 @@
+# Copyright 2025 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+steps:
+- name: 'gcr.io/cloud-builders/docker'
+  args:
+  - 'build'
+  - '--tag=${_ARTIFACT_REGISTRY}/${_TRT_LLM_IMAGE}:${_TRT_LLM_VERSION}'
+  - '--file=trtllm.Dockerfile'
+  - '.'
+  automapSubstitutions: true
+
+images:
+- '${_ARTIFACT_REGISTRY}/${_TRT_LLM_IMAGE}:${_TRT_LLM_VERSION}'
diff --git a/src/docker/trtllm-0.16.0/requirements.in b/src/docker/trtllm-0.16.0/requirements.in
@@ -0,0 +1 @@
+hf_transfer==0.1.9
diff --git a/src/docker/trtllm-0.16.0/requirements.txt b/src/docker/trtllm-0.16.0/requirements.txt
@@ -0,0 +1,33 @@
+#
+# This file is autogenerated by pip-compile with Python 3.11
+# by the following command:
+#
+#    pip-compile --generate-hashes requirements.in
+#
+hf-transfer==0.1.9 \
+    --hash=sha256:035572865dab29d17e783fbf1e84cf1cb24f3fcf8f1b17db1cfc7fdf139f02bf \
+    --hash=sha256:0d991376f0eac70a60f0cbc95602aa708a6f7c8617f28b4945c1431d67b8e3c8 \
+    --hash=sha256:16f208fc678911c37e11aa7b586bc66a37d02e636208f18b6bc53d29b5df40ad \
+    --hash=sha256:1a6bd16c667ebe89a069ca163060127a794fa3a3525292c900b8c8cc47985b0d \
+    --hash=sha256:2c7fc1b85f4d0f76e452765d7648c9f4bfd0aedb9ced2ae1ebfece2d8cfaf8e2 \
+    --hash=sha256:3a736dfbb2c84f5a2c975478ad200c0c8bfcb58a25a35db402678fb87ce17fa4 \
+    --hash=sha256:3ebc4ab9023414880c8b1d3c38174d1c9989eb5022d37e814fa91a3060123eb0 \
+    --hash=sha256:435cc3cdc8524ce57b074032b8fd76eed70a4224d2091232fa6a8cef8fd6803e \
+    --hash=sha256:504b8427fd785dd8546d53b9fafe6e436bd7a3adf76b9dce556507650a7b4567 \
+    --hash=sha256:57fd9880da1ee0f47250f735f791fab788f0aa1ee36afc49f761349869c8b4d9 \
+    --hash=sha256:5828057e313de59300dd1abb489444bc452efe3f479d3c55b31a8f680936ba42 \
+    --hash=sha256:5d561f0520f493c66b016d99ceabe69c23289aa90be38dd802d2aef279f15751 \
+    --hash=sha256:6e94e8822da79573c9b6ae4d6b2f847c59a7a06c5327d7db20751b68538dc4f6 \
+    --hash=sha256:8669dbcc7a3e2e8d61d42cd24da9c50d57770bd74b445c65123291ca842a7e7a \
+    --hash=sha256:8674026f21ed369aa2a0a4b46000aca850fc44cd2b54af33a172ce5325b4fc82 \
+    --hash=sha256:89a23f58b7b7effbc047b8ca286f131b17728c99a9f972723323003ffd1bb916 \
+    --hash=sha256:8fd0167c4407a3bc4cdd0307e65ada2294ec04f1813d8a69a5243e379b22e9d8 \
+    --hash=sha256:a5b366d34cd449fe9b20ef25941e6eef0460a2f74e7389f02e673e1f88ebd538 \
+    --hash=sha256:cdca9bfb89e6f8f281890cc61a8aff2d3cecaff7e1a4d275574d96ca70098557 \
+    --hash=sha256:d2fde99d502093ade3ab1b53f80da18480e9902aa960dab7f74fb1b9e5bc5746 \
+    --hash=sha256:dc7fff1345980d6c0ebb92c811d24afa4b98b3e07ed070c8e38cc91fd80478c5 \
+    --hash=sha256:e66acf91df4a8b72f60223059df3003062a5ae111757187ed1a06750a30e911b \
+    --hash=sha256:e6ac4eddcd99575ed3735ed911ddf9d1697e2bd13aa3f0ad7e3904dd4863842e \
+    --hash=sha256:ee8b10afedcb75f71091bcc197c526a6ebf5c58bbbadb34fdeee6160f55f619f \
+    --hash=sha256:fc6bd19e1cc177c66bdef15ef8636ad3bde79d5a4f608c158021153b4573509d
+    # via -r requirements.in
diff --git a/src/docker/trtllm-0.16.0/trtllm.Dockerfile b/src/docker/trtllm-0.16.0/trtllm.Dockerfile
@@ -0,0 +1,48 @@
+# Copyright 2025 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+FROM nvcr.io/nvidia/tritonserver:24.12-trtllm-python-py3
+WORKDIR /workspace
+
+# Copy the directories
+COPY --from=nvcr.io/nvidia/pytorch:24.11-py3 /usr/local/lib/python3.12/dist-packages/functorch /usr/local/lib/python3.12/dist-packages/functorch
+COPY --from=nvcr.io/nvidia/pytorch:24.11-py3 /usr/local/lib/python3.12/dist-packages/triton /usr/local/lib/python3.12/dist-packages/triton
+
+# GCSfuse components (used to provide shared storage, not intended for high performance)
+RUN apt update && apt install --yes --no-install-recommends \
+    ca-certificates \
+    curl \
+    gnupg \
+    cmake \
+  && echo "deb https://packages.cloud.google.com/apt gcsfuse-buster main" \
+    | tee /etc/apt/sources.list.d/gcsfuse.list \
+  && echo "deb https://packages.cloud.google.com/apt cloud-sdk main" \
+    | tee -a /etc/apt/sources.list.d/google-cloud-sdk.list \
+  && curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add - \
+  && apt-get update \
+  && apt-get install --yes gcsfuse \
+  && apt-get install --yes google-cloud-cli \
+  && apt-get clean && rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/* \
+  && mkdir /gcs
+
+RUN git clone -b v0.16.0 https://github.com/triton-inference-server/tensorrtllm_backend.git && \
+    cd tensorrtllm_backend && \
+    git submodule update --init --recursive && \
+    git lfs install && \
+    git lfs pull
+
+COPY requirements.txt /workspace/requirements.txt
+RUN pip install --no-cache-dir -r requirements.txt
+
+ENTRYPOINT [ "/bin/bash" ]
diff --git a/src/helm-charts/a3ultra/trtllm-inference/single-node/Chart.yaml b/src/helm-charts/a3ultra/trtllm-inference/single-node/Chart.yaml
@@ -0,0 +1,20 @@
+# Copyright 2025 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+apiVersion: v2
+name: trtllm-llama-3-1-405b-inference
+description: trtllm-llama-3-1-405b-inference
+type: application
+version: 0.1.0
+appVersion: "1.16.0"
diff --git a/src/helm-charts/a3ultra/trtllm-inference/single-node/templates/benchmark-configmap.yaml b/src/helm-charts/a3ultra/trtllm-inference/single-node/templates/benchmark-configmap.yaml
@@ -0,0 +1,65 @@
+# Copyright 2025 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: {{ .Release.Name }}-benchmark-script
+data:
+  run_trtllm_bench.sh: |-
+    #!/bin/bash
+
+    # Function to run benchmarks
+    run_benchmark() {
+        local model_name=$1
+        local isl=$2
+        local osl=$3
+        local num_requests=$4
+        local tp_size=$5
+
+        echo "Running benchmark for $model_name with ISL=$isl, OSL=$osl, TP=$tp_size"
+
+        dataset_file="/ssd/token-norm-dist_${model_name##*/}_${isl}_${osl}_tp${tp_size}.json"
+        output_file="/ssd/output_${model_name##*/}_isl${isl}_osl${osl}_tp${tp_size}.txt"
+
+        python3 /workspace/tensorrtllm_backend/tensorrt_llm/benchmarks/cpp/prepare_dataset.py --tokenizer=$model_name --stdout token-norm-dist \
+          --num-requests=$num_requests --input-mean=$isl --output-mean=$osl \
+          --input-stdev=0 --output-stdev=0 > $dataset_file
+
+        pp_size=1
+
+        trtllm-bench --model $model_name --model_path /ssd/${model_name} --workspace /ssd build \
+          --tp_size $tp_size --quantization FP8 --dataset $dataset_file
+
+        engine_dir="/ssd/${model_name}/tp_${tp_size}_pp_${pp_size}"
+
+        # Save throughput output to a file
+        trtllm-bench --model $model_name --model_path /ssd/${model_name} throughput \
+          --dataset $dataset_file --engine_dir $engine_dir \
+          --kv_cache_free_gpu_mem_fraction 0.95 > $output_file
+
+        cat $output_file
+        gsutil cp $output_file /gcs/benchmark_logs/
+
+        rm -rf $engine_dir
+        rm -f $dataset_file
+    }
+
+    # Generated benchmark executions
+    model_name="{{ .Values.model.name }}"
+    tp_size={{ .Values.model.tp_size }}
+
+    {{- range .Values.benchmarks.experiments }}
+    run_benchmark "$model_name" {{ .isl }} {{ .osl }} {{ .num_requests }} $tp_size
+    {{- end }}
diff --git a/src/helm-charts/a3ultra/trtllm-inference/single-node/templates/benchmark-model.yaml b/src/helm-charts/a3ultra/trtllm-inference/single-node/templates/benchmark-model.yaml