You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: .buildkite/nightly-benchmarks/README.md
+25-29Lines changed: 25 additions & 29 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,7 +7,7 @@ This directory contains two sets of benchmark for vllm.
7
7
- Performance benchmark: benchmark vllm's performance under various workload, for **developers** to gain clarity on whether their PR improves/degrades vllm's performance
8
8
- Nightly benchmark: compare vllm's performance against alternatives (tgi, trt-llm and lmdeploy), for **the public** to know when to choose vllm.
9
9
10
-
See [vLLM performance dashboard](https://perf.vllm.ai) for the latest performance benchmark results and [vLLM GitHub README](https://github.com/vllm-project/vllm/blob/main/README.md) for latest nightly benchmark results.
10
+
See [vLLM performance dashboard](https://hud.pytorch.org/benchmark/llms?repoName=vllm-project%2Fvllm) for the latest performance benchmark results and [vLLM GitHub README](https://github.com/vllm-project/vllm/blob/main/README.md) for latest nightly benchmark results.
11
11
12
12
## Performance benchmark quick overview
13
13
@@ -28,6 +28,7 @@ See [vLLM performance dashboard](https://perf.vllm.ai) for the latest performanc
28
28
## Trigger the benchmark
29
29
30
30
Performance benchmark will be triggered when:
31
+
31
32
- A PR being merged into vllm.
32
33
- Every commit for those PRs with `perf-benchmarks` label AND `ready` label.
-`REMOTE_PORT`: Port for the remote vLLM service to benchmark. Default value is empty string.
47
49
48
50
Nightly benchmark will be triggered when:
51
+
49
52
- Every commit for those PRs with `perf-benchmarks` label and `nightly-benchmarks` label.
50
53
51
54
## Performance benchmark details
52
55
53
56
See [performance-benchmarks-descriptions.md](performance-benchmarks-descriptions.md) for detailed descriptions, and use `tests/latency-tests.json`, `tests/throughput-tests.json`, `tests/serving-tests.json` to configure the test cases.
54
57
> NOTE: For Intel® Xeon® Processors, use `tests/latency-tests-cpu.json`, `tests/throughput-tests-cpu.json`, `tests/serving-tests-cpu.json` instead.
58
+
>
55
59
### Latency test
56
60
57
61
Here is an example of one test inside `latency-tests.json`:
@@ -74,21 +78,21 @@ Here is an example of one test inside `latency-tests.json`:
74
78
In this example:
75
79
76
80
- The `test_name` attributes is a unique identifier for the test. In `latency-tests.json`, it must start with `latency_`.
77
-
- The `parameters` attribute control the command line arguments to be used for `benchmark_latency.py`. Note that please use underline `_` instead of the dash `-` when specifying the command line arguments, and `run-performance-benchmarks.sh` will convert the underline to dash when feeding the arguments to `benchmark_latency.py`. For example, the corresponding command line arguments for `benchmark_latency.py` will be `--model meta-llama/Meta-Llama-3-8B --tensor-parallel-size 1 --load-format dummy --num-iters-warmup 5 --num-iters 15`
81
+
- The `parameters` attribute control the command line arguments to be used for `vllm bench latency`. Note that please use underline `_` instead of the dash `-` when specifying the command line arguments, and `run-performance-benchmarks.sh` will convert the underline to dash when feeding the arguments to `vllm bench latency`. For example, the corresponding command line arguments for `vllm bench latency` will be `--model meta-llama/Meta-Llama-3-8B --tensor-parallel-size 1 --load-format dummy --num-iters-warmup 5 --num-iters 15`
78
82
79
83
Note that the performance numbers are highly sensitive to the value of the parameters. Please make sure the parameters are set correctly.
80
84
81
85
WARNING: The benchmarking script will save json results by itself, so please do not configure `--output-json` parameter in the json file.
82
86
83
87
### Throughput test
84
88
85
-
The tests are specified in `throughput-tests.json`. The syntax is similar to `latency-tests.json`, except for that the parameters will be fed forward to `benchmark_throughput.py`.
89
+
The tests are specified in `throughput-tests.json`. The syntax is similar to `latency-tests.json`, except for that the parameters will be fed forward to `vllm bench throughput`.
86
90
87
91
The number of this test is also stable -- a slight change on the value of this number might vary the performance numbers by a lot.
88
92
89
93
### Serving test
90
94
91
-
We test the throughput by using `benchmark_serving.py` with request rate = inf to cover the online serving overhead. The corresponding parameters are in `serving-tests.json`, and here is an example:
95
+
We test the throughput by using `vllm bench serve` with request rate = inf to cover the online serving overhead. The corresponding parameters are in `serving-tests.json`, and here is an example:
92
96
93
97
```json
94
98
[
@@ -100,7 +104,6 @@ We test the throughput by using `benchmark_serving.py` with request rate = inf t
100
104
"tensor_parallel_size": 1,
101
105
"swap_space": 16,
102
106
"disable_log_stats": "",
103
-
"disable_log_requests": "",
104
107
"load_format": "dummy"
105
108
},
106
109
"client_parameters": {
@@ -118,8 +121,8 @@ Inside this example:
118
121
119
122
- The `test_name` attribute is also a unique identifier for the test. It must start with `serving_`.
120
123
- The `server-parameters` includes the command line arguments for vLLM server.
121
-
- The `client-parameters` includes the command line arguments for `benchmark_serving.py`.
122
-
- The `qps_list` controls the list of qps for test. It will be used to configure the `--request-rate` parameter in `benchmark_serving.py`
124
+
- The `client-parameters` includes the command line arguments for `vllm bench serve`.
125
+
- The `qps_list` controls the list of qps for test. It will be used to configure the `--request-rate` parameter in `vllm bench serve`
123
126
124
127
The number of this test is less stable compared to the delay and latency benchmarks (due to randomized sharegpt dataset sampling inside `benchmark_serving.py`), but a large change on this number (e.g. 5% change) still vary the output greatly.
125
128
@@ -135,27 +138,20 @@ The raw benchmarking results (in the format of json files) are in the `Artifacts
135
138
136
139
The `compare-json-results.py` helps to compare benchmark results JSON files converted using `convert-results-json-to-markdown.py`.
137
140
When run, benchmark script generates results under `benchmark/results` folder, along with the `benchmark_results.md` and `benchmark_results.json`.
138
-
`compare-json-results.py` compares two `benchmark_results.json` files and provides performance ratio e.g. for Output Tput, Median TTFT and Median TPOT.
141
+
`compare-json-results.py` compares two `benchmark_results.json` files and provides performance ratio e.g. for Output Tput, Median TTFT and Median TPOT.
142
+
If only one benchmark_results.json is passed, `compare-json-results.py` compares different TP and PP configurations in the benchmark_results.json instead.
139
143
140
-
Here is an example using the script to compare result_a and result_b without detail test name.
|| Model | Dataset Name | Input Len | Output Len | # of max concurrency | qps | results_a/benchmark_results.json | results_b/benchmark_results.json | perf_ratio |
@@ -164,9 +160,9 @@ See [nightly-descriptions.md](nightly-descriptions.md) for the detailed descript
164
160
### Workflow
165
161
166
162
- The [nightly-pipeline.yaml](nightly-pipeline.yaml) specifies the docker containers for different LLM serving engines.
167
-
- Inside each container, we run [run-nightly-suite.sh](run-nightly-suite.sh), which will probe the serving engine of the current container.
168
-
- The `run-nightly-suite.sh` will redirect the request to `tests/run-[llm serving engine name]-nightly.sh`, which parses the workload described in [nightly-tests.json](tests/nightly-tests.json) and performs the benchmark.
169
-
- At last, we run [scripts/plot-nightly-results.py](scripts/plot-nightly-results.py) to collect and plot the final benchmarking results, and update the results to buildkite.
163
+
- Inside each container, we run [scripts/run-nightly-benchmarks.sh](scripts/run-nightly-benchmarks.sh), which will probe the serving engine of the current container.
164
+
- The `scripts/run-nightly-benchmarks.sh` will parse the workload described in [nightly-tests.json](tests/nightly-tests.json) and launch the right benchmark for the specified serving engine via `scripts/launch-server.sh`.
165
+
- At last, we run [scripts/summary-nightly-results.py](scripts/summary-nightly-results.py) to collect and plot the final benchmarking results, and update the results to buildkite.
170
166
171
167
### Nightly tests
172
168
@@ -176,6 +172,6 @@ In [nightly-tests.json](tests/nightly-tests.json), we include the command line a
176
172
177
173
The docker containers for benchmarking are specified in `nightly-pipeline.yaml`.
178
174
179
-
WARNING: the docker versions are HARD-CODED and SHOULD BE ALIGNED WITH `nightly-descriptions.md`. The docker versions need to be hard-coded as there are several version-specific bug fixes inside `tests/run-[llm serving engine name]-nightly.sh`.
175
+
WARNING: the docker versions are HARD-CODED and SHOULD BE ALIGNED WITH `nightly-descriptions.md`. The docker versions need to be hard-coded as there are several version-specific bug fixes inside `scripts/run-nightly-benchmarks.sh` and `scripts/launch-server.sh`.
180
176
181
177
WARNING: populating `trt-llm` to latest version is not easy, as it requires updating several protobuf files in [tensorrt-demo](https://github.com/neuralmagic/tensorrt-demo.git).
0 commit comments