@@ -14,64 +14,66 @@ Follow the detailed instructions [here](https://github.com/AI-Hypercomputer/infe
14
14
15
15
* Create an artifact repository:
16
16
17
- ``` bash
18
- gcloud artifacts repositories create ai-benchmark --location=us-central1 --repository-format=docker
19
- ```
17
+ ``` bash
18
+ gcloud artifacts repositories create ai-benchmark --location=us-central1 --repository-format=docker
19
+ ```
20
20
21
21
* Prepare datasets for [ Infinity-Instruct] ( https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct ) and [ billsum] ( (https://huggingface.co/datasets/FiscalNote/billsum) ) :
22
22
23
- ``` bash
24
- pip install datasets transformers numpy pandas tqdm matplotlib
25
- python datasets/import_dataset.py --hf_token YOUR_TOKEN
26
- ```
23
+ ``` bash
24
+ pip install datasets transformers numpy pandas tqdm matplotlib
25
+ python datasets/import_dataset.py --hf_token YOUR_TOKEN
26
+ ```
27
27
28
28
* Build the benchmark Docker image:
29
29
30
- ``` bash
31
- docker build -t inference-benchmark .
32
- ```
30
+ ``` bash
31
+ docker build -t inference-benchmark .
32
+ ```
33
33
34
34
* Push the Docker image to your artifact registry:
35
35
36
- ``` bash
37
- docker tag inference-benchmark us-central1-docker.pkg.dev/{project-name}/ai-benchmark/inference-benchmark
38
- docker push us-central1-docker.pkg.dev/{project-name}/ai-benchmark/inference-benchmark
39
- ```
36
+ ``` bash
37
+ docker tag inference-benchmark us-central1-docker.pkg.dev/{project-name}/ai-benchmark/inference-benchmark
38
+ docker push us-central1-docker.pkg.dev/{project-name}/ai-benchmark/inference-benchmark
39
+ ```
40
40
41
41
## Conduct Regression Tests
42
42
43
- Run benchmarks using the configurations below, which are optimized for NVIDIA H100 GPUs (80 GB). Adjust configurations for other hardware as necessary.
43
+ Run benchmarks using the configurations below, which are optimized for NVIDIA H100 GPUs (80 GB). Adjust configurations for other hardware as necessary.
44
44
45
45
### Test Case 1: Single Workload
46
46
47
- - ** Dataset:** ` billsum_conversations.json ` (created from [ HuggingFace billsum dataset] ( https://huggingface.co/datasets/FiscalNote/billsum ) ).
48
- * This dataset features long prompts, making it prefill-heavy and ideal for testing scenarios that emphasize initial token generation.
49
- - ** Model:** [ Llama 3 (8B)] ( https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct ) (* critical* )
47
+ - ** Dataset:** ` billsum_conversations.json ` (created from [ HuggingFace billsum dataset] ( https://huggingface.co/datasets/FiscalNote/billsum ) ).
48
+ * This dataset features long prompts, making it prefill-heavy and ideal for testing scenarios that emphasize initial token generation.*
49
+ - ** Model:** [ Llama 3 (8B)] ( https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct ) (* critical* )
50
50
- ** Replicas:** 10 (vLLM)
51
- - ** Request Rates:** 300–350 (increments of 10)
51
+ - ** Request Rates:** 300–350 QPS (increments of 10)
52
52
53
- Refer to example manifest:
53
+ Refer to example manifest:
54
54
` ./config/manifests/regression-testing/single-workload-regression.yaml `
55
55
56
56
### Test Case 2: Multi-LoRA
57
57
58
- - ** Dataset:** ` Infinity-Instruct_conversations.json ` (created from [ HuggingFace Infinity-Instruct dataset] ( https://huggingface.co/datasets/BAAI/Infinity-Instruct ) ).
59
- * This dataset has long outputs, making it decode-heavy and useful for testing scenarios focusing on sustained token generation.
60
- - ** Model:** [ Llama 3 (8B)] ( https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct )
61
- - ** LoRA Adapters:** 15 adapters (` nvidia/llama-3.1-nemoguard-8b-topic-control ` , rank 8, critical)
62
- - ** Hardware:** NVIDIA H100 GPUs (80 GB)
63
- - ** Traffic Distribution:** 60% (first 5 adapters, each 12%), 30% (next 5, each 6%), 10% (last 5, each 2%) simulating prod/dev/test tiers
58
+ - ** Dataset:** ` Infinity-Instruct_conversations.json ` (created from [ HuggingFace Infinity-Instruct dataset] ( https://huggingface.co/datasets/BAAI/Infinity-Instruct ) ).
59
+ * This dataset has long outputs, making it decode-heavy and useful for testing scenarios focusing on sustained token generation.*
60
+ - ** Model:** [ Llama 3 (8B)] ( https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct )
61
+ - ** LoRA Adapters:** 15 adapters (` nvidia/llama-3.1-nemoguard-8b-topic-control ` , rank 8, critical)
62
+ - ** Traffic Distribution:**
63
+ - 60 % on first 5 adapters (12 % each)
64
+ - 30 % on next 5 adapters (6 % each)
65
+ - 10 % on last 5 adapters (2 % each)
64
66
- ** Max LoRA:** 3
65
67
- ** Replicas:** 10 (vLLM)
66
- - ** Request Rates:** 20–200 (increments of 20)
68
+ - ** Request Rates:** 20–200 QPS (increments of 20)
67
69
68
- Optionally, you can also run benchmarks using the ` ShareGPT ` dataset for additional coverage.
70
+ Optionally, you can also run benchmarks against the ` ShareGPT ` dataset for additional coverage.
69
71
70
- Update deployments for multi-LoRA support:
71
- - vLLM Deployment: ` ./config/manifests/regression-testing/vllm/multi-lora-deployment.yaml `
72
+ Update deployments for multi-LoRA support:
73
+ - vLLM Deployment: ` ./config/manifests/regression-testing/vllm/multi-lora-deployment.yaml `
72
74
- InferenceModel: ` ./config/manifests/inferencemodel.yaml `
73
75
74
- Refer to example manifest:
76
+ Refer to example manifest:
75
77
` ./config/manifests/regression-testing/multi-lora-regression.yaml `
76
78
77
79
### Execute Benchmarks
@@ -80,15 +82,15 @@ Benchmark in two phases: before and after applying your changes:
80
82
81
83
- ** Before changes:**
82
84
83
- ``` bash
84
- benchmark_id=' regression-before' ./tools/benchmark/download-benchmark-results.bash
85
- ```
85
+ ``` bash
86
+ benchmark_id=' regression-before' ./tools/benchmark/download-benchmark-results.bash
87
+ ```
86
88
87
89
- ** After changes:**
88
90
89
- ``` bash
90
- benchmark_id=' regression-after' ./tools/benchmark/download-benchmark-results.bash
91
- ```
91
+ ``` bash
92
+ benchmark_id=' regression-after' ./tools/benchmark/download-benchmark-results.bash
93
+ ```
92
94
93
95
## Analyze Benchmark Results
94
96
@@ -97,7 +99,36 @@ Use the provided Jupyter notebook (`./tools/benchmark/benchmark.ipynb`) to analy
97
99
- Update benchmark IDs to ` regression-before ` and ` regression-after ` .
98
100
- Compare latency and throughput metrics, performing regression analysis.
99
101
- Check R² values specifically:
100
- - ** Prompts Attempted/Succeeded:** Expect R² ≈ 1
101
- - ** Output Tokens per Minute, P90 per Output Token Latency, P90 Latency:** Expect R² close to 1 (allow minor variance).
102
+ - ** Prompts Attempted/Succeeded:** Expect R² ≈ 1
103
+ - ** Output Tokens per Minute, P90 per Output Token Latency, P90 Latency:** Expect R² close to 1 (allow minor variance).
102
104
103
- Identify significant deviations, investigate causes, and confirm performance meets expected standards.
105
+ Identify significant deviations, investigate causes, and confirm performance meets expected standards.
106
+
107
+ # Nightly Benchmarking
108
+
109
+ To catch regressions early, we run a fully automated benchmark suite every night against the ** latest ` main ` image** of the Gateway API. This pipeline uses LPG and the same manifests as above, but against two standard datasets:
110
+
111
+ 1 . ** Prefill-Heavy** (` billsum_conversations.json ` )
112
+ Emphasizes TTFT performance.
113
+ 2 . ** Decode-Heavy** (` Infinity-Instruct_conversations.json ` )
114
+ Stresses sustained TPOT behavior.
115
+ 3 . ** Multi-LoRA** (` billsum_conversations.json ` )
116
+ Uses 15 adapters with the traffic split defined above to capture complex adapter-loading and lora affinity scenarios.
117
+
118
+ ** How it works** :
119
+
120
+ - The benchmarking runs are triggered every 6 hours.
121
+ - It provisions a GKE cluster with several NVIDIA H100 (80 GB) GPUs, deploys N vLLM server replicas along with the latest Gateway API extension and monitoring manifests, then launches the benchmarking script.
122
+ - In the step above we deploy the latest Endpoint Picker from the ` main ` branch’s latest Docker image:
123
+ ```
124
+ us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/epp:main
125
+ ```
126
+ - It sequentially launches three benchmark runs (as described above) using the existing regression manifests.
127
+ - Results are uploaded to a central GCS bucket.
128
+ - A Looker Studio dashboard automatically refreshes to display key metrics:
129
+ https://lookerstudio.google.com/u/0/reporting/c7ceeda6-6d5e-4688-bcad-acd076acfba6/page/6S4MF
130
+ - After the benchmarking runs are complete it tears down the cluster.
131
+
132
+ ** Alerting** :
133
+
134
+ - If any regression is detected oncall is setup (internally in GKE) for further investigation.
0 commit comments