@@ -14,64 +14,66 @@ Follow the detailed instructions [here](https://github.com/AI-Hypercomputer/infe
1414
1515* Create an artifact repository:
1616
17- ``` bash
18- gcloud artifacts repositories create ai-benchmark --location=us-central1 --repository-format=docker
19- ```
17+ ``` bash
18+ gcloud artifacts repositories create ai-benchmark --location=us-central1 --repository-format=docker
19+ ```
2020
2121* Prepare datasets for [ Infinity-Instruct] ( https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct ) and [ billsum] ( (https://huggingface.co/datasets/FiscalNote/billsum) ) :
2222
23- ``` bash
24- pip install datasets transformers numpy pandas tqdm matplotlib
25- python datasets/import_dataset.py --hf_token YOUR_TOKEN
26- ```
23+ ``` bash
24+ pip install datasets transformers numpy pandas tqdm matplotlib
25+ python datasets/import_dataset.py --hf_token YOUR_TOKEN
26+ ```
2727
2828* Build the benchmark Docker image:
2929
30- ``` bash
31- docker build -t inference-benchmark .
32- ```
30+ ``` bash
31+ docker build -t inference-benchmark .
32+ ```
3333
3434* Push the Docker image to your artifact registry:
3535
36- ``` bash
37- docker tag inference-benchmark us-central1-docker.pkg.dev/{project-name}/ai-benchmark/inference-benchmark
38- docker push us-central1-docker.pkg.dev/{project-name}/ai-benchmark/inference-benchmark
39- ```
36+ ``` bash
37+ docker tag inference-benchmark us-central1-docker.pkg.dev/{project-name}/ai-benchmark/inference-benchmark
38+ docker push us-central1-docker.pkg.dev/{project-name}/ai-benchmark/inference-benchmark
39+ ```
4040
4141## Conduct Regression Tests
4242
43- Run benchmarks using the configurations below, which are optimized for NVIDIA H100 GPUs (80 GB). Adjust configurations for other hardware as necessary.
43+ Run benchmarks using the configurations below, which are optimized for NVIDIA H100 GPUs (80 GB). Adjust configurations for other hardware as necessary.
4444
4545### Test Case 1: Single Workload
4646
47- - ** Dataset:** ` billsum_conversations.json ` (created from [ HuggingFace billsum dataset] ( https://huggingface.co/datasets/FiscalNote/billsum ) ).
48- * This dataset features long prompts, making it prefill-heavy and ideal for testing scenarios that emphasize initial token generation.
49- - ** Model:** [ Llama 3 (8B)] ( https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct ) (* critical* )
47+ - ** Dataset:** ` billsum_conversations.json ` (created from [ HuggingFace billsum dataset] ( https://huggingface.co/datasets/FiscalNote/billsum ) ).
48+ * This dataset features long prompts, making it prefill-heavy and ideal for testing scenarios that emphasize initial token generation.*
49+ - ** Model:** [ Llama 3 (8B)] ( https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct ) (* critical* )
5050- ** Replicas:** 10 (vLLM)
51- - ** Request Rates:** 300–350 (increments of 10)
51+ - ** Request Rates:** 300–350 QPS (increments of 10)
5252
53- Refer to example manifest:
53+ Refer to example manifest:
5454` ./config/manifests/regression-testing/single-workload-regression.yaml `
5555
5656### Test Case 2: Multi-LoRA
5757
58- - ** Dataset:** ` Infinity-Instruct_conversations.json ` (created from [ HuggingFace Infinity-Instruct dataset] ( https://huggingface.co/datasets/BAAI/Infinity-Instruct ) ).
59- * This dataset has long outputs, making it decode-heavy and useful for testing scenarios focusing on sustained token generation.
60- - ** Model:** [ Llama 3 (8B)] ( https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct )
61- - ** LoRA Adapters:** 15 adapters (` nvidia/llama-3.1-nemoguard-8b-topic-control ` , rank 8, critical)
62- - ** Hardware:** NVIDIA H100 GPUs (80 GB)
63- - ** Traffic Distribution:** 60% (first 5 adapters, each 12%), 30% (next 5, each 6%), 10% (last 5, each 2%) simulating prod/dev/test tiers
58+ - ** Dataset:** ` Infinity-Instruct_conversations.json ` (created from [ HuggingFace Infinity-Instruct dataset] ( https://huggingface.co/datasets/BAAI/Infinity-Instruct ) ).
59+ * This dataset has long outputs, making it decode-heavy and useful for testing scenarios focusing on sustained token generation.*
60+ - ** Model:** [ Llama 3 (8B)] ( https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct )
61+ - ** LoRA Adapters:** 15 adapters (` nvidia/llama-3.1-nemoguard-8b-topic-control ` , rank 8, critical)
62+ - ** Traffic Distribution:**
63+ - 60 % on first 5 adapters (12 % each)
64+ - 30 % on next 5 adapters (6 % each)
65+ - 10 % on last 5 adapters (2 % each)
6466- ** Max LoRA:** 3
6567- ** Replicas:** 10 (vLLM)
66- - ** Request Rates:** 20–200 (increments of 20)
68+ - ** Request Rates:** 20–200 QPS (increments of 20)
6769
68- Optionally, you can also run benchmarks using the ` ShareGPT ` dataset for additional coverage.
70+ Optionally, you can also run benchmarks against the ` ShareGPT ` dataset for additional coverage.
6971
70- Update deployments for multi-LoRA support:
71- - vLLM Deployment: ` ./config/manifests/regression-testing/vllm/multi-lora-deployment.yaml `
72+ Update deployments for multi-LoRA support:
73+ - vLLM Deployment: ` ./config/manifests/regression-testing/vllm/multi-lora-deployment.yaml `
7274- InferenceModel: ` ./config/manifests/inferencemodel.yaml `
7375
74- Refer to example manifest:
76+ Refer to example manifest:
7577` ./config/manifests/regression-testing/multi-lora-regression.yaml `
7678
7779### Execute Benchmarks
@@ -80,15 +82,15 @@ Benchmark in two phases: before and after applying your changes:
8082
8183- ** Before changes:**
8284
83- ``` bash
84- benchmark_id=' regression-before' ./tools/benchmark/download-benchmark-results.bash
85- ```
85+ ``` bash
86+ benchmark_id=' regression-before' ./tools/benchmark/download-benchmark-results.bash
87+ ```
8688
8789- ** After changes:**
8890
89- ``` bash
90- benchmark_id=' regression-after' ./tools/benchmark/download-benchmark-results.bash
91- ```
91+ ``` bash
92+ benchmark_id=' regression-after' ./tools/benchmark/download-benchmark-results.bash
93+ ```
9294
9395## Analyze Benchmark Results
9496
@@ -97,7 +99,36 @@ Use the provided Jupyter notebook (`./tools/benchmark/benchmark.ipynb`) to analy
9799- Update benchmark IDs to ` regression-before ` and ` regression-after ` .
98100- Compare latency and throughput metrics, performing regression analysis.
99101- Check R² values specifically:
100- - ** Prompts Attempted/Succeeded:** Expect R² ≈ 1
101- - ** Output Tokens per Minute, P90 per Output Token Latency, P90 Latency:** Expect R² close to 1 (allow minor variance).
102+ - ** Prompts Attempted/Succeeded:** Expect R² ≈ 1
103+ - ** Output Tokens per Minute, P90 per Output Token Latency, P90 Latency:** Expect R² close to 1 (allow minor variance).
102104
103- Identify significant deviations, investigate causes, and confirm performance meets expected standards.
105+ Identify significant deviations, investigate causes, and confirm performance meets expected standards.
106+
107+ # Nightly Benchmarking
108+
109+ To catch regressions early, we run a fully automated benchmark suite every night against the ** latest ` main ` image** of the Gateway API. This pipeline uses LPG and the same manifests as above, but against two standard datasets:
110+
111+ 1 . ** Prefill-Heavy** (` billsum_conversations.json ` )
112+ Emphasizes TTFT performance.
113+ 2 . ** Decode-Heavy** (` Infinity-Instruct_conversations.json ` )
114+ Stresses sustained TPOT behavior.
115+ 3 . ** Multi-LoRA** (` billsum_conversations.json ` )
116+ Uses 15 adapters with the traffic split defined above to capture complex adapter-loading and lora affinity scenarios.
117+
118+ ** How it works** :
119+
120+ - The benchmarking runs are triggered every 6 hours.
121+ - It provisions a GKE cluster with several NVIDIA H100 (80 GB) GPUs, deploys N vLLM server replicas along with the latest Gateway API extension and monitoring manifests, then launches the benchmarking script.
122+ - In the step above we deploy the latest Endpoint Picker from the ` main ` branch’s latest Docker image:
123+ ```
124+ us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/epp:main
125+ ```
126+ - It sequentially launches three benchmark runs (as described above) using the existing regression manifests.
127+ - Results are uploaded to a central GCS bucket.
128+ - A Looker Studio dashboard automatically refreshes to display key metrics:
129+ https://lookerstudio.google.com/u/0/reporting/c7ceeda6-6d5e-4688-bcad-acd076acfba6/page/6S4MF
130+ - After the benchmarking runs are complete it tears down the cluster.
131+
132+ ** Alerting** :
133+
134+ - If any regression is detected oncall is setup (internally in GKE) for further investigation.
0 commit comments