Skip to content

Commit 43ada39

Browse files
authored
INFERENG-1814 Sync v0.10.1.1 - CUDA (red-hat-data-services#268)
## What's new - Sync up to v0.10.1.1 tag upstream: https://github.com/vllm-project/vllm/releases/tag/v0.10.1.1 - Include cherry picks requested in https://issues.redhat.com/browse/INFERENG-1800 neuralmagic/nm-vllm-ent@362a3a5 neuralmagic/nm-vllm-ent@698e377 - Update Dockerfile with DeepGEMM installation from https://issues.redhat.com/browse/INFERENG-1823 neuralmagic/nm-vllm-ent@88ce821 - Note: this targets only CUDA. ROCM will be done in a separate PR. ## Accept-sync CUDA: https://github.com/neuralmagic/nm-cicd/actions/runs/17277658249 Image: quay.io/vllm/automation-vllm:cuda-17277658249 nm-cicd PR: neuralmagic/nm-cicd#245
2 parents 39b8acc + 3255598 commit 43ada39

File tree

1,128 files changed

+69598
-33129
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

1,128 files changed

+69598
-33129
lines changed

.buildkite/nightly-benchmarks/README.md

Lines changed: 25 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ This directory contains two sets of benchmark for vllm.
77
- Performance benchmark: benchmark vllm's performance under various workload, for **developers** to gain clarity on whether their PR improves/degrades vllm's performance
88
- Nightly benchmark: compare vllm's performance against alternatives (tgi, trt-llm and lmdeploy), for **the public** to know when to choose vllm.
99

10-
See [vLLM performance dashboard](https://perf.vllm.ai) for the latest performance benchmark results and [vLLM GitHub README](https://github.com/vllm-project/vllm/blob/main/README.md) for latest nightly benchmark results.
10+
See [vLLM performance dashboard](https://hud.pytorch.org/benchmark/llms?repoName=vllm-project%2Fvllm) for the latest performance benchmark results and [vLLM GitHub README](https://github.com/vllm-project/vllm/blob/main/README.md) for latest nightly benchmark results.
1111

1212
## Performance benchmark quick overview
1313

@@ -28,6 +28,7 @@ See [vLLM performance dashboard](https://perf.vllm.ai) for the latest performanc
2828
## Trigger the benchmark
2929

3030
Performance benchmark will be triggered when:
31+
3132
- A PR being merged into vllm.
3233
- Every commit for those PRs with `perf-benchmarks` label AND `ready` label.
3334

@@ -38,6 +39,7 @@ bash .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh
3839
```
3940

4041
Runtime environment variables:
42+
4143
- `ON_CPU`: set the value to '1' on Intel® Xeon® Processors. Default value is 0.
4244
- `SERVING_JSON`: JSON file to use for the serving tests. Default value is empty string (use default file).
4345
- `LATENCY_JSON`: JSON file to use for the latency tests. Default value is empty string (use default file).
@@ -46,12 +48,14 @@ Runtime environment variables:
4648
- `REMOTE_PORT`: Port for the remote vLLM service to benchmark. Default value is empty string.
4749

4850
Nightly benchmark will be triggered when:
51+
4952
- Every commit for those PRs with `perf-benchmarks` label and `nightly-benchmarks` label.
5053

5154
## Performance benchmark details
5255

5356
See [performance-benchmarks-descriptions.md](performance-benchmarks-descriptions.md) for detailed descriptions, and use `tests/latency-tests.json`, `tests/throughput-tests.json`, `tests/serving-tests.json` to configure the test cases.
5457
> NOTE: For Intel® Xeon® Processors, use `tests/latency-tests-cpu.json`, `tests/throughput-tests-cpu.json`, `tests/serving-tests-cpu.json` instead.
58+
>
5559
### Latency test
5660

5761
Here is an example of one test inside `latency-tests.json`:
@@ -74,21 +78,21 @@ Here is an example of one test inside `latency-tests.json`:
7478
In this example:
7579

7680
- The `test_name` attributes is a unique identifier for the test. In `latency-tests.json`, it must start with `latency_`.
77-
- The `parameters` attribute control the command line arguments to be used for `benchmark_latency.py`. Note that please use underline `_` instead of the dash `-` when specifying the command line arguments, and `run-performance-benchmarks.sh` will convert the underline to dash when feeding the arguments to `benchmark_latency.py`. For example, the corresponding command line arguments for `benchmark_latency.py` will be `--model meta-llama/Meta-Llama-3-8B --tensor-parallel-size 1 --load-format dummy --num-iters-warmup 5 --num-iters 15`
81+
- The `parameters` attribute control the command line arguments to be used for `vllm bench latency`. Note that please use underline `_` instead of the dash `-` when specifying the command line arguments, and `run-performance-benchmarks.sh` will convert the underline to dash when feeding the arguments to `vllm bench latency`. For example, the corresponding command line arguments for `vllm bench latency` will be `--model meta-llama/Meta-Llama-3-8B --tensor-parallel-size 1 --load-format dummy --num-iters-warmup 5 --num-iters 15`
7882

7983
Note that the performance numbers are highly sensitive to the value of the parameters. Please make sure the parameters are set correctly.
8084

8185
WARNING: The benchmarking script will save json results by itself, so please do not configure `--output-json` parameter in the json file.
8286

8387
### Throughput test
8488

85-
The tests are specified in `throughput-tests.json`. The syntax is similar to `latency-tests.json`, except for that the parameters will be fed forward to `benchmark_throughput.py`.
89+
The tests are specified in `throughput-tests.json`. The syntax is similar to `latency-tests.json`, except for that the parameters will be fed forward to `vllm bench throughput`.
8690

8791
The number of this test is also stable -- a slight change on the value of this number might vary the performance numbers by a lot.
8892

8993
### Serving test
9094

91-
We test the throughput by using `benchmark_serving.py` with request rate = inf to cover the online serving overhead. The corresponding parameters are in `serving-tests.json`, and here is an example:
95+
We test the throughput by using `vllm bench serve` with request rate = inf to cover the online serving overhead. The corresponding parameters are in `serving-tests.json`, and here is an example:
9296

9397
```json
9498
[
@@ -100,7 +104,6 @@ We test the throughput by using `benchmark_serving.py` with request rate = inf t
100104
"tensor_parallel_size": 1,
101105
"swap_space": 16,
102106
"disable_log_stats": "",
103-
"disable_log_requests": "",
104107
"load_format": "dummy"
105108
},
106109
"client_parameters": {
@@ -118,8 +121,8 @@ Inside this example:
118121

119122
- The `test_name` attribute is also a unique identifier for the test. It must start with `serving_`.
120123
- The `server-parameters` includes the command line arguments for vLLM server.
121-
- The `client-parameters` includes the command line arguments for `benchmark_serving.py`.
122-
- The `qps_list` controls the list of qps for test. It will be used to configure the `--request-rate` parameter in `benchmark_serving.py`
124+
- The `client-parameters` includes the command line arguments for `vllm bench serve`.
125+
- The `qps_list` controls the list of qps for test. It will be used to configure the `--request-rate` parameter in `vllm bench serve`
123126

124127
The number of this test is less stable compared to the delay and latency benchmarks (due to randomized sharegpt dataset sampling inside `benchmark_serving.py`), but a large change on this number (e.g. 5% change) still vary the output greatly.
125128

@@ -135,27 +138,20 @@ The raw benchmarking results (in the format of json files) are in the `Artifacts
135138

136139
The `compare-json-results.py` helps to compare benchmark results JSON files converted using `convert-results-json-to-markdown.py`.
137140
When run, benchmark script generates results under `benchmark/results` folder, along with the `benchmark_results.md` and `benchmark_results.json`.
138-
`compare-json-results.py` compares two `benchmark_results.json` files and provides performance ratio e.g. for Output Tput, Median TTFT and Median TPOT.
141+
`compare-json-results.py` compares two `benchmark_results.json` files and provides performance ratio e.g. for Output Tput, Median TTFT and Median TPOT.
142+
If only one benchmark_results.json is passed, `compare-json-results.py` compares different TP and PP configurations in the benchmark_results.json instead.
139143

140-
Here is an example using the script to compare result_a and result_b without detail test name.
141-
`python3 compare-json-results.py -f results_a/benchmark_results.json -f results_b/benchmark_results.json --ignore_test_name`
144+
Here is an example using the script to compare result_a and result_b with Model, Dataset name, input/output lenght, max concurrency and qps.
145+
`python3 compare-json-results.py -f results_a/benchmark_results.json -f results_b/benchmark_results.json`
142146

143-
| | results_a/benchmark_results.json | results_b/benchmark_results.json | perf_ratio |
144-
|----|----------------------------------------|----------------------------------------|----------|
145-
| 0 | 142.633982 | 156.526018 | 1.097396 |
146-
| 1 | 241.620334 | 294.018783 | 1.216863 |
147-
| 2 | 218.298905 | 262.664916 | 1.203235 |
148-
| 3 | 242.743860 | 299.816190 | 1.235113 |
147+
| | Model | Dataset Name | Input Len | Output Len | # of max concurrency | qps | results_a/benchmark_results.json | results_b/benchmark_results.json | perf_ratio |
148+
|----|---------------------------------------|--------|-----|-----|------|-----|-----------|----------|----------|
149+
| 0 | meta-llama/Meta-Llama-3.1-8B-Instruct | random | 128 | 128 | 1000 | 1 | 142.633982 | 156.526018 | 1.097396 |
150+
| 1 | meta-llama/Meta-Llama-3.1-8B-Instruct | random | 128 | 128 | 1000 | inf| 241.620334 | 294.018783 | 1.216863 |
149151

150-
Here is an example using the script to compare result_a and result_b with detail test name.
151-
`python3 compare-json-results.py -f results_a/benchmark_results.json -f results_b/benchmark_results.json`
152-
| | results_a/benchmark_results.json_name | results_a/benchmark_results.json | results_b/benchmark_results.json_name | results_b/benchmark_results.json | perf_ratio |
153-
|---|---------------------------------------------|----------------------------------------|---------------------------------------------|----------------------------------------|----------|
154-
| 0 | serving_llama8B_tp1_sharegpt_qps_1 | 142.633982 | serving_llama8B_tp1_sharegpt_qps_1 | 156.526018 | 1.097396 |
155-
| 1 | serving_llama8B_tp1_sharegpt_qps_16 | 241.620334 | serving_llama8B_tp1_sharegpt_qps_16 | 294.018783 | 1.216863 |
156-
| 2 | serving_llama8B_tp1_sharegpt_qps_4 | 218.298905 | serving_llama8B_tp1_sharegpt_qps_4 | 262.664916 | 1.203235 |
157-
| 3 | serving_llama8B_tp1_sharegpt_qps_inf | 242.743860 | serving_llama8B_tp1_sharegpt_qps_inf | 299.816190 | 1.235113 |
158-
| 4 | serving_llama8B_tp2_random_1024_128_qps_1 | 96.613390 | serving_llama8B_tp4_random_1024_128_qps_1 | 108.404853 | 1.122048 |
152+
A comparison diagram will be generated below the table.
153+
Here is an example to compare between 96c/results_gnr_96c_091_tp2pp3 and 128c/results_gnr_128c_091_tp2pp3
154+
<img width="1886" height="828" alt="image" src="https://github.com/user-attachments/assets/c02a43ef-25d0-4fd6-90e5-2169a28682dd" />
159155

160156
## Nightly test details
161157

@@ -164,9 +160,9 @@ See [nightly-descriptions.md](nightly-descriptions.md) for the detailed descript
164160
### Workflow
165161

166162
- The [nightly-pipeline.yaml](nightly-pipeline.yaml) specifies the docker containers for different LLM serving engines.
167-
- Inside each container, we run [run-nightly-suite.sh](run-nightly-suite.sh), which will probe the serving engine of the current container.
168-
- The `run-nightly-suite.sh` will redirect the request to `tests/run-[llm serving engine name]-nightly.sh`, which parses the workload described in [nightly-tests.json](tests/nightly-tests.json) and performs the benchmark.
169-
- At last, we run [scripts/plot-nightly-results.py](scripts/plot-nightly-results.py) to collect and plot the final benchmarking results, and update the results to buildkite.
163+
- Inside each container, we run [scripts/run-nightly-benchmarks.sh](scripts/run-nightly-benchmarks.sh), which will probe the serving engine of the current container.
164+
- The `scripts/run-nightly-benchmarks.sh` will parse the workload described in [nightly-tests.json](tests/nightly-tests.json) and launch the right benchmark for the specified serving engine via `scripts/launch-server.sh`.
165+
- At last, we run [scripts/summary-nightly-results.py](scripts/summary-nightly-results.py) to collect and plot the final benchmarking results, and update the results to buildkite.
170166

171167
### Nightly tests
172168

@@ -176,6 +172,6 @@ In [nightly-tests.json](tests/nightly-tests.json), we include the command line a
176172

177173
The docker containers for benchmarking are specified in `nightly-pipeline.yaml`.
178174

179-
WARNING: the docker versions are HARD-CODED and SHOULD BE ALIGNED WITH `nightly-descriptions.md`. The docker versions need to be hard-coded as there are several version-specific bug fixes inside `tests/run-[llm serving engine name]-nightly.sh`.
175+
WARNING: the docker versions are HARD-CODED and SHOULD BE ALIGNED WITH `nightly-descriptions.md`. The docker versions need to be hard-coded as there are several version-specific bug fixes inside `scripts/run-nightly-benchmarks.sh` and `scripts/launch-server.sh`.
180176

181177
WARNING: populating `trt-llm` to latest version is not easy, as it requires updating several protobuf files in [tensorrt-demo](https://github.com/neuralmagic/tensorrt-demo.git).
Lines changed: 11 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
# Nightly benchmark annotation
12

23
## Description
34

@@ -13,15 +14,15 @@ Please download the visualization scripts in the post
1314

1415
- Find the docker we use in `benchmarking pipeline`
1516
- Deploy the docker, and inside the docker:
16-
- Download `nightly-benchmarks.zip`.
17-
- In the same folder, run the following code:
18-
19-
```bash
20-
export HF_TOKEN=<your HF token>
21-
apt update
22-
apt install -y git
23-
unzip nightly-benchmarks.zip
24-
VLLM_SOURCE_CODE_LOC=./ bash .buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh
25-
```
17+
- Download `nightly-benchmarks.zip`.
18+
- In the same folder, run the following code:
19+
20+
```bash
21+
export HF_TOKEN=<your HF token>
22+
apt update
23+
apt install -y git
24+
unzip nightly-benchmarks.zip
25+
VLLM_SOURCE_CODE_LOC=./ bash .buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh
26+
```
2627

2728
And the results will be inside `./benchmarks/results`.

.buildkite/nightly-benchmarks/nightly-descriptions.md

Lines changed: 17 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -13,25 +13,25 @@ Latest reproduction guilde: [github issue link](https://github.com/vllm-project/
1313
## Setup
1414

1515
- Docker images:
16-
- vLLM: `vllm/vllm-openai:v0.6.2`
17-
- SGLang: `lmsysorg/sglang:v0.3.2-cu121`
18-
- LMDeploy: `openmmlab/lmdeploy:v0.6.1-cu12`
19-
- TensorRT-LLM: `nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3`
20-
- *NOTE: we uses r24.07 as the current implementation only works for this version. We are going to bump this up.*
21-
- Check [nightly-pipeline.yaml](nightly-pipeline.yaml) for the concrete docker images, specs and commands we use for the benchmark.
16+
- vLLM: `vllm/vllm-openai:v0.6.2`
17+
- SGLang: `lmsysorg/sglang:v0.3.2-cu121`
18+
- LMDeploy: `openmmlab/lmdeploy:v0.6.1-cu12`
19+
- TensorRT-LLM: `nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3`
20+
- *NOTE: we uses r24.07 as the current implementation only works for this version. We are going to bump this up.*
21+
- Check [nightly-pipeline.yaml](nightly-pipeline.yaml) for the concrete docker images, specs and commands we use for the benchmark.
2222
- Hardware
23-
- 8x Nvidia A100 GPUs
23+
- 8x Nvidia A100 GPUs
2424
- Workload:
25-
- Dataset
26-
- ShareGPT dataset
27-
- Prefill-heavy dataset (in average 462 input tokens, 16 tokens as output)
28-
- Decode-heavy dataset (in average 462 input tokens, 256 output tokens)
29-
- Check [nightly-tests.json](tests/nightly-tests.json) for the concrete configuration of datasets we use.
30-
- Models: llama-3 8B, llama-3 70B.
31-
- We do not use llama 3.1 as it is incompatible with trt-llm r24.07. ([issue](https://github.com/NVIDIA/TensorRT-LLM/issues/2105)).
32-
- Average QPS (query per second): 2, 4, 8, 16, 32 and inf.
33-
- Queries are randomly sampled, and arrival patterns are determined via Poisson process, but all with fixed random seed.
34-
- Evaluation metrics: Throughput (higher the better), TTFT (time to the first token, lower the better), ITL (inter-token latency, lower the better).
25+
- Dataset
26+
- ShareGPT dataset
27+
- Prefill-heavy dataset (in average 462 input tokens, 16 tokens as output)
28+
- Decode-heavy dataset (in average 462 input tokens, 256 output tokens)
29+
- Check [nightly-tests.json](tests/nightly-tests.json) for the concrete configuration of datasets we use.
30+
- Models: llama-3 8B, llama-3 70B.
31+
- We do not use llama 3.1 as it is incompatible with trt-llm r24.07. ([issue](https://github.com/NVIDIA/TensorRT-LLM/issues/2105)).
32+
- Average QPS (query per second): 2, 4, 8, 16, 32 and inf.
33+
- Queries are randomly sampled, and arrival patterns are determined via Poisson process, but all with fixed random seed.
34+
- Evaluation metrics: Throughput (higher the better), TTFT (time to the first token, lower the better), ITL (inter-token latency, lower the better).
3535

3636
## Known issues
3737

.buildkite/nightly-benchmarks/performance-benchmarks-descriptions.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
# Performance benchmarks descriptions
12

23
## Latency tests
34

0 commit comments

Comments
 (0)