Skip to content

Commit f33ff43

Browse files
authored
Add nightly benchmarking documentation (#1234)
* add nightly benchmarking documentation * move nightly benchmarking documentation to the end of index.md * update regression testing index.md * update regression testing index.md * update regression testing index.md * update regression testing index.md * update regression testing index.md
1 parent 62f9431 commit f33ff43

File tree

1 file changed

+71
-40
lines changed
  • site-src/performance/regression-testing

1 file changed

+71
-40
lines changed

site-src/performance/regression-testing/index.md

Lines changed: 71 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -14,64 +14,66 @@ Follow the detailed instructions [here](https://github.com/AI-Hypercomputer/infe
1414

1515
* Create an artifact repository:
1616

17-
```bash
18-
gcloud artifacts repositories create ai-benchmark --location=us-central1 --repository-format=docker
19-
```
17+
```bash
18+
gcloud artifacts repositories create ai-benchmark --location=us-central1 --repository-format=docker
19+
```
2020

2121
* Prepare datasets for [Infinity-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) and [billsum]((https://huggingface.co/datasets/FiscalNote/billsum)):
2222

23-
```bash
24-
pip install datasets transformers numpy pandas tqdm matplotlib
25-
python datasets/import_dataset.py --hf_token YOUR_TOKEN
26-
```
23+
```bash
24+
pip install datasets transformers numpy pandas tqdm matplotlib
25+
python datasets/import_dataset.py --hf_token YOUR_TOKEN
26+
```
2727

2828
* Build the benchmark Docker image:
2929

30-
```bash
31-
docker build -t inference-benchmark .
32-
```
30+
```bash
31+
docker build -t inference-benchmark .
32+
```
3333

3434
* Push the Docker image to your artifact registry:
3535

36-
```bash
37-
docker tag inference-benchmark us-central1-docker.pkg.dev/{project-name}/ai-benchmark/inference-benchmark
38-
docker push us-central1-docker.pkg.dev/{project-name}/ai-benchmark/inference-benchmark
39-
```
36+
```bash
37+
docker tag inference-benchmark us-central1-docker.pkg.dev/{project-name}/ai-benchmark/inference-benchmark
38+
docker push us-central1-docker.pkg.dev/{project-name}/ai-benchmark/inference-benchmark
39+
```
4040

4141
## Conduct Regression Tests
4242

43-
Run benchmarks using the configurations below, which are optimized for NVIDIA H100 GPUs (80 GB). Adjust configurations for other hardware as necessary.
43+
Run benchmarks using the configurations below, which are optimized for NVIDIA H100 GPUs (80GB). Adjust configurations for other hardware as necessary.
4444

4545
### Test Case 1: Single Workload
4646

47-
- **Dataset:** `billsum_conversations.json` (created from [HuggingFace billsum dataset](https://huggingface.co/datasets/FiscalNote/billsum)).
48-
* This dataset features long prompts, making it prefill-heavy and ideal for testing scenarios that emphasize initial token generation.
49-
- **Model:** [Llama 3 (8B)](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) (*critical*)
47+
- **Dataset:** `billsum_conversations.json` (created from [HuggingFace billsum dataset](https://huggingface.co/datasets/FiscalNote/billsum)).
48+
*This dataset features long prompts, making it prefill-heavy and ideal for testing scenarios that emphasize initial token generation.*
49+
- **Model:** [Llama 3 (8B)](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) (*critical*)
5050
- **Replicas:** 10 (vLLM)
51-
- **Request Rates:** 300–350 (increments of 10)
51+
- **Request Rates:** 300–350 QPS (increments of10)
5252

53-
Refer to example manifest:
53+
Refer to example manifest:
5454
`./config/manifests/regression-testing/single-workload-regression.yaml`
5555

5656
### Test Case 2: Multi-LoRA
5757

58-
- **Dataset:** `Infinity-Instruct_conversations.json` (created from [HuggingFace Infinity-Instruct dataset](https://huggingface.co/datasets/BAAI/Infinity-Instruct)).
59-
* This dataset has long outputs, making it decode-heavy and useful for testing scenarios focusing on sustained token generation.
60-
- **Model:** [Llama 3 (8B)](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)
61-
- **LoRA Adapters:** 15 adapters (`nvidia/llama-3.1-nemoguard-8b-topic-control`, rank 8, critical)
62-
- **Hardware:** NVIDIA H100 GPUs (80 GB)
63-
- **Traffic Distribution:** 60% (first 5 adapters, each 12%), 30% (next 5, each 6%), 10% (last 5, each 2%) simulating prod/dev/test tiers
58+
- **Dataset:** `Infinity-Instruct_conversations.json` (created from [HuggingFace Infinity-Instruct dataset](https://huggingface.co/datasets/BAAI/Infinity-Instruct)).
59+
*This dataset has long outputs, making it decode-heavy and useful for testing scenarios focusing on sustained token generation.*
60+
- **Model:** [Llama 3 (8B)](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)
61+
- **LoRA Adapters:** 15 adapters (`nvidia/llama-3.1-nemoguard-8b-topic-control`, rank 8, critical)
62+
- **Traffic Distribution:**
63+
- 60 % on first 5 adapters (12 % each)
64+
- 30 % on next 5 adapters (6 % each)
65+
- 10 % on last 5 adapters (2 % each)
6466
- **Max LoRA:** 3
6567
- **Replicas:** 10 (vLLM)
66-
- **Request Rates:** 20–200 (increments of 20)
68+
- **Request Rates:** 20–200 QPS (increments of20)
6769

68-
Optionally, you can also run benchmarks using the `ShareGPT` dataset for additional coverage.
70+
Optionally, you can also run benchmarks against the `ShareGPT` dataset for additional coverage.
6971

70-
Update deployments for multi-LoRA support:
71-
- vLLM Deployment: `./config/manifests/regression-testing/vllm/multi-lora-deployment.yaml`
72+
Update deployments for multi-LoRA support:
73+
- vLLM Deployment: `./config/manifests/regression-testing/vllm/multi-lora-deployment.yaml`
7274
- InferenceModel: `./config/manifests/inferencemodel.yaml`
7375

74-
Refer to example manifest:
76+
Refer to example manifest:
7577
`./config/manifests/regression-testing/multi-lora-regression.yaml`
7678

7779
### Execute Benchmarks
@@ -80,15 +82,15 @@ Benchmark in two phases: before and after applying your changes:
8082

8183
- **Before changes:**
8284

83-
```bash
84-
benchmark_id='regression-before' ./tools/benchmark/download-benchmark-results.bash
85-
```
85+
```bash
86+
benchmark_id='regression-before' ./tools/benchmark/download-benchmark-results.bash
87+
```
8688

8789
- **After changes:**
8890

89-
```bash
90-
benchmark_id='regression-after' ./tools/benchmark/download-benchmark-results.bash
91-
```
91+
```bash
92+
benchmark_id='regression-after' ./tools/benchmark/download-benchmark-results.bash
93+
```
9294

9395
## Analyze Benchmark Results
9496

@@ -97,7 +99,36 @@ Use the provided Jupyter notebook (`./tools/benchmark/benchmark.ipynb`) to analy
9799
- Update benchmark IDs to `regression-before` and `regression-after`.
98100
- Compare latency and throughput metrics, performing regression analysis.
99101
- Check R² values specifically:
100-
- **Prompts Attempted/Succeeded:** Expect R² ≈ 1
101-
- **Output Tokens per Minute, P90 per Output Token Latency, P90 Latency:** Expect R² close to 1 (allow minor variance).
102+
- **Prompts Attempted/Succeeded:** Expect R² ≈1
103+
- **Output Tokens per Minute, P90 per Output Token Latency, P90 Latency:** Expect R² close to1 (allow minor variance).
102104

103-
Identify significant deviations, investigate causes, and confirm performance meets expected standards.
105+
Identify significant deviations, investigate causes, and confirm performance meets expected standards.
106+
107+
# Nightly Benchmarking
108+
109+
To catch regressions early, we run a fully automated benchmark suite every night against the **latest `main` image** of the Gateway API. This pipeline uses LPG and the same manifests as above, but against two standard datasets:
110+
111+
1. **Prefill-Heavy** (`billsum_conversations.json`)
112+
Emphasizes TTFT performance.
113+
2. **Decode-Heavy** (`Infinity-Instruct_conversations.json`)
114+
Stresses sustained TPOT behavior.
115+
3. **Multi-LoRA** (`billsum_conversations.json`)
116+
Uses 15 adapters with the traffic split defined above to capture complex adapter-loading and lora affinity scenarios.
117+
118+
**How it works**:
119+
120+
- The benchmarking runs are triggered every 6 hours.
121+
- It provisions a GKE cluster with several NVIDIA H100 (80 GB) GPUs, deploys N vLLM server replicas along with the latest Gateway API extension and monitoring manifests, then launches the benchmarking script.
122+
- In the step above we deploy the latest Endpoint Picker from the `main` branch’s latest Docker image:
123+
```
124+
us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/epp:main
125+
```
126+
- It sequentially launches three benchmark runs (as described above) using the existing regression manifests.
127+
- Results are uploaded to a central GCS bucket.
128+
- A Looker Studio dashboard automatically refreshes to display key metrics:
129+
https://lookerstudio.google.com/u/0/reporting/c7ceeda6-6d5e-4688-bcad-acd076acfba6/page/6S4MF
130+
- After the benchmarking runs are complete it tears down the cluster.
131+
132+
**Alerting**:
133+
134+
- If any regression is detected oncall is setup (internally in GKE) for further investigation.

0 commit comments

Comments
 (0)