Skip to content

Commit b6ddf62

Browse files
authored
Merge pull request #613 from ROCm/upstream_merge_2025_07_29
Upstream merge 2025 07 29
2 parents 7545048 + 4fe15a8 commit b6ddf62

File tree

1,187 files changed

+92327
-51278
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

1,187 files changed

+92327
-51278
lines changed

.buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,6 @@ while getopts "m:b:l:f:t:" OPT; do
4646
done
4747

4848
lm_eval --model vllm \
49-
--model_args "pretrained=$MODEL,tensor_parallel_size=$TP_SIZE,distributed_executor_backend=ray,trust_remote_code=true,max_model_len=4096" \
49+
--model_args "pretrained=$MODEL,tensor_parallel_size=$TP_SIZE,add_bos_token=true,trust_remote_code=true,max_model_len=4096" \
5050
--tasks gsm8k --num_fewshot "$FEWSHOT" --limit "$LIMIT" \
5151
--batch_size "$BATCH_SIZE"

.buildkite/lm-eval-harness/test_lm_eval_correctness.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,12 +18,14 @@
1818

1919
def launch_lm_eval(eval_config, tp_size):
2020
trust_remote_code = eval_config.get("trust_remote_code", False)
21+
max_model_len = eval_config.get("max_model_len", 4096)
2122
model_args = (
2223
f"pretrained={eval_config['model_name']},"
2324
f"tensor_parallel_size={tp_size},"
2425
f"enforce_eager=true,"
2526
f"add_bos_token=true,"
26-
f"trust_remote_code={trust_remote_code}"
27+
f"trust_remote_code={trust_remote_code},"
28+
f"max_model_len={max_model_len}"
2729
)
2830
results = lm_eval.simple_evaluate(
2931
model="vllm",

.buildkite/nightly-benchmarks/README.md

Lines changed: 45 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ See [vLLM performance dashboard](https://perf.vllm.ai) for the latest performanc
1111

1212
## Performance benchmark quick overview
1313

14-
**Benchmarking Coverage**: latency, throughput and fix-qps serving on A100 (the support for FP8 benchmark on H100 is coming!), with different models.
14+
**Benchmarking Coverage**: latency, throughput and fix-qps serving on A100 (the support for FP8 benchmark on H100 is coming!) and Intel® Xeon® Processors, with different models.
1515

1616
**Benchmarking Duration**: about 1hr.
1717

@@ -31,13 +31,27 @@ Performance benchmark will be triggered when:
3131
- A PR being merged into vllm.
3232
- Every commit for those PRs with `perf-benchmarks` label AND `ready` label.
3333

34+
Manually Trigger the benchmark
35+
36+
```bash
37+
bash .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh
38+
```
39+
40+
Runtime environment variables:
41+
- `ON_CPU`: set the value to '1' on Intel® Xeon® Processors. Default value is 0.
42+
- `SERVING_JSON`: JSON file to use for the serving tests. Default value is empty string (use default file).
43+
- `LATENCY_JSON`: JSON file to use for the latency tests. Default value is empty string (use default file).
44+
- `THROUGHPUT_JSON`: JSON file to use for the throughout tests. Default value is empty string (use default file).
45+
- `REMOTE_HOST`: IP for the remote vLLM service to benchmark. Default value is empty string.
46+
- `REMOTE_PORT`: Port for the remote vLLM service to benchmark. Default value is empty string.
47+
3448
Nightly benchmark will be triggered when:
3549
- Every commit for those PRs with `perf-benchmarks` label and `nightly-benchmarks` label.
3650

3751
## Performance benchmark details
3852

3953
See [performance-benchmarks-descriptions.md](performance-benchmarks-descriptions.md) for detailed descriptions, and use `tests/latency-tests.json`, `tests/throughput-tests.json`, `tests/serving-tests.json` to configure the test cases.
40-
54+
> NOTE: For Intel® Xeon® Processors, use `tests/latency-tests-cpu.json`, `tests/throughput-tests-cpu.json`, `tests/serving-tests-cpu.json` instead.
4155
### Latency test
4256

4357
Here is an example of one test inside `latency-tests.json`:
@@ -60,21 +74,21 @@ Here is an example of one test inside `latency-tests.json`:
6074
In this example:
6175

6276
- The `test_name` attributes is a unique identifier for the test. In `latency-tests.json`, it must start with `latency_`.
63-
- The `parameters` attribute control the command line arguments to be used for `benchmark_latency.py`. Note that please use underline `_` instead of the dash `-` when specifying the command line arguments, and `run-performance-benchmarks.sh` will convert the underline to dash when feeding the arguments to `benchmark_latency.py`. For example, the corresponding command line arguments for `benchmark_latency.py` will be `--model meta-llama/Meta-Llama-3-8B --tensor-parallel-size 1 --load-format dummy --num-iters-warmup 5 --num-iters 15`
77+
- The `parameters` attribute control the command line arguments to be used for `vllm bench latency`. Note that please use underline `_` instead of the dash `-` when specifying the command line arguments, and `run-performance-benchmarks.sh` will convert the underline to dash when feeding the arguments to `vllm bench latency`. For example, the corresponding command line arguments for `vllm bench latency` will be `--model meta-llama/Meta-Llama-3-8B --tensor-parallel-size 1 --load-format dummy --num-iters-warmup 5 --num-iters 15`
6478

6579
Note that the performance numbers are highly sensitive to the value of the parameters. Please make sure the parameters are set correctly.
6680

6781
WARNING: The benchmarking script will save json results by itself, so please do not configure `--output-json` parameter in the json file.
6882

6983
### Throughput test
7084

71-
The tests are specified in `throughput-tests.json`. The syntax is similar to `latency-tests.json`, except for that the parameters will be fed forward to `benchmark_throughput.py`.
85+
The tests are specified in `throughput-tests.json`. The syntax is similar to `latency-tests.json`, except for that the parameters will be fed forward to `vllm bench throughput`.
7286

7387
The number of this test is also stable -- a slight change on the value of this number might vary the performance numbers by a lot.
7488

7589
### Serving test
7690

77-
We test the throughput by using `benchmark_serving.py` with request rate = inf to cover the online serving overhead. The corresponding parameters are in `serving-tests.json`, and here is an example:
91+
We test the throughput by using `vllm bench serve` with request rate = inf to cover the online serving overhead. The corresponding parameters are in `serving-tests.json`, and here is an example:
7892

7993
```json
8094
[
@@ -104,8 +118,8 @@ Inside this example:
104118

105119
- The `test_name` attribute is also a unique identifier for the test. It must start with `serving_`.
106120
- The `server-parameters` includes the command line arguments for vLLM server.
107-
- The `client-parameters` includes the command line arguments for `benchmark_serving.py`.
108-
- The `qps_list` controls the list of qps for test. It will be used to configure the `--request-rate` parameter in `benchmark_serving.py`
121+
- The `client-parameters` includes the command line arguments for `vllm bench serve`.
122+
- The `qps_list` controls the list of qps for test. It will be used to configure the `--request-rate` parameter in `vllm bench serve`
109123

110124
The number of this test is less stable compared to the delay and latency benchmarks (due to randomized sharegpt dataset sampling inside `benchmark_serving.py`), but a large change on this number (e.g. 5% change) still vary the output greatly.
111125

@@ -119,6 +133,30 @@ If you do not see the table, please wait till the benchmark finish running.
119133
The json version of the table (together with the json version of the benchmark) will be also attached to the markdown file.
120134
The raw benchmarking results (in the format of json files) are in the `Artifacts` tab of the benchmarking.
121135

136+
The `compare-json-results.py` helps to compare benchmark results JSON files converted using `convert-results-json-to-markdown.py`.
137+
When run, benchmark script generates results under `benchmark/results` folder, along with the `benchmark_results.md` and `benchmark_results.json`.
138+
`compare-json-results.py` compares two `benchmark_results.json` files and provides performance ratio e.g. for Output Tput, Median TTFT and Median TPOT.
139+
140+
Here is an example using the script to compare result_a and result_b without detail test name.
141+
`python3 compare-json-results.py -f results_a/benchmark_results.json -f results_b/benchmark_results.json --ignore_test_name`
142+
143+
| | results_a/benchmark_results.json | results_b/benchmark_results.json | perf_ratio |
144+
|----|----------------------------------------|----------------------------------------|----------|
145+
| 0 | 142.633982 | 156.526018 | 1.097396 |
146+
| 1 | 241.620334 | 294.018783 | 1.216863 |
147+
| 2 | 218.298905 | 262.664916 | 1.203235 |
148+
| 3 | 242.743860 | 299.816190 | 1.235113 |
149+
150+
Here is an example using the script to compare result_a and result_b with detail test name.
151+
`python3 compare-json-results.py -f results_a/benchmark_results.json -f results_b/benchmark_results.json`
152+
| | results_a/benchmark_results.json_name | results_a/benchmark_results.json | results_b/benchmark_results.json_name | results_b/benchmark_results.json | perf_ratio |
153+
|---|---------------------------------------------|----------------------------------------|---------------------------------------------|----------------------------------------|----------|
154+
| 0 | serving_llama8B_tp1_sharegpt_qps_1 | 142.633982 | serving_llama8B_tp1_sharegpt_qps_1 | 156.526018 | 1.097396 |
155+
| 1 | serving_llama8B_tp1_sharegpt_qps_16 | 241.620334 | serving_llama8B_tp1_sharegpt_qps_16 | 294.018783 | 1.216863 |
156+
| 2 | serving_llama8B_tp1_sharegpt_qps_4 | 218.298905 | serving_llama8B_tp1_sharegpt_qps_4 | 262.664916 | 1.203235 |
157+
| 3 | serving_llama8B_tp1_sharegpt_qps_inf | 242.743860 | serving_llama8B_tp1_sharegpt_qps_inf | 299.816190 | 1.235113 |
158+
| 4 | serving_llama8B_tp2_random_1024_128_qps_1 | 96.613390 | serving_llama8B_tp4_random_1024_128_qps_1 | 108.404853 | 1.122048 |
159+
122160
## Nightly test details
123161

124162
See [nightly-descriptions.md](nightly-descriptions.md) for the detailed description on test workload, models and docker containers of benchmarking other llm engines.

.buildkite/nightly-benchmarks/performance-benchmarks-descriptions.md

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,8 @@
44
- Input length: 32 tokens.
55
- Output length: 128 tokens.
66
- Batch size: fixed (8).
7-
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
7+
- GPU Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
8+
- CPU Models: llama-3.1 8B.
89
- Evaluation metrics: end-to-end latency (mean, median, p99).
910

1011
{latency_tests_markdown_table}
@@ -14,7 +15,8 @@
1415
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
1516
- Output length: the corresponding output length of these 200 prompts.
1617
- Batch size: dynamically determined by vllm to achieve maximum throughput.
17-
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
18+
- GPU Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
19+
- CPU Models: llama-3.1 8B.
1820
- Evaluation metrics: throughput.
1921

2022
{throughput_tests_markdown_table}
@@ -25,12 +27,18 @@
2527
- Output length: the corresponding output length of these 200 prompts.
2628
- Batch size: dynamically determined by vllm and the arrival pattern of the requests.
2729
- **Average QPS (query per second)**: 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
28-
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
29-
- We also added a speculative decoding test for llama-3 70B, under QPS 2
30+
- GPU Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
31+
- We also added a speculative decoding test for llama-3 70B on GPU, under QPS 2
32+
- CPU Models: llama-3.1 8B.
3033
- Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).
34+
- For CPU, we added random dataset tests to benchmark fixed input/output length with 100 prompts.
3135

3236
{serving_tests_markdown_table}
3337

38+
## Platform Information
39+
40+
{platform_markdown_table}
41+
3442
## json version of the benchmarking tables
3543

3644
This section contains the data of the markdown tables above in JSON format.
Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
# SPDX-License-Identifier: Apache-2.0
2+
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
3+
import argparse
4+
5+
import pandas as pd
6+
7+
8+
def compare_data_columns(
9+
files, name_column, data_column, drop_column, ignore_test_name=False
10+
):
11+
print("\ncompare_data_column: " + data_column)
12+
frames = []
13+
compare_frames = []
14+
for file in files:
15+
data_df = pd.read_json(file)
16+
serving_df = data_df.dropna(subset=[drop_column], ignore_index=True)
17+
if ignore_test_name is False:
18+
serving_df = serving_df.rename(columns={name_column: file + "_name"})
19+
frames.append(serving_df[file + "_name"])
20+
serving_df = serving_df.rename(columns={data_column: file})
21+
frames.append(serving_df[file])
22+
compare_frames.append(serving_df[file])
23+
if len(compare_frames) >= 2:
24+
# Compare numbers among two files
25+
ratio_df = compare_frames[1] / compare_frames[0]
26+
frames.append(ratio_df)
27+
compare_frames.pop(1)
28+
29+
concat_df = pd.concat(frames, axis=1)
30+
return concat_df
31+
32+
33+
if __name__ == "__main__":
34+
parser = argparse.ArgumentParser()
35+
parser.add_argument(
36+
"-f", "--file", action="append", type=str, help="input file name"
37+
)
38+
parser.add_argument(
39+
"--ignore_test_name", action="store_true", help="ignore_test_name or not"
40+
)
41+
args = parser.parse_args()
42+
files = args.file
43+
print("comparing : " + ", ".join(files))
44+
45+
drop_column = "P99"
46+
name_column = "Test name"
47+
data_cols_to_compare = ["Output Tput (tok/s)", "Median TTFT (ms)", "Median"]
48+
html_msgs_for_data_cols = [
49+
"Compare Output Tokens /n",
50+
"Median TTFT /n",
51+
"Median TPOT /n",
52+
]
53+
ignore_test_name = args.ignore_test_name
54+
with open("perf_comparison.html", "w") as text_file:
55+
for i in range(len(data_cols_to_compare)):
56+
output_df = compare_data_columns(
57+
files,
58+
name_column,
59+
data_cols_to_compare[i],
60+
drop_column,
61+
ignore_test_name=ignore_test_name,
62+
)
63+
print(output_df)
64+
html = output_df.to_html()
65+
text_file.write(html_msgs_for_data_cols[i])
66+
text_file.write(html)

.buildkite/nightly-benchmarks/scripts/convert-results-json-to-markdown.py

Lines changed: 56 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,11 @@
33

44
import json
55
import os
6+
from importlib import util
67
from pathlib import Path
78

89
import pandas as pd
10+
import psutil
911
from tabulate import tabulate
1012

1113
results_folder = Path("results/")
@@ -29,28 +31,30 @@
2931
throughput_results_column_mapping = {
3032
"test_name": "Test name",
3133
"gpu_type": "GPU",
32-
# "num_requests": "# of req.",
33-
# "total_num_tokens": "Total # of tokens",
34-
# "elapsed_time": "Elapsed time (s)",
34+
"num_requests": "# of req.",
35+
"total_num_tokens": "Total # of tokens",
36+
"elapsed_time": "Elapsed time (s)",
3537
"requests_per_second": "Tput (req/s)",
36-
# "tokens_per_second": "Tput (tok/s)",
38+
"tokens_per_second": "Tput (tok/s)",
3739
}
3840

3941
# serving results and the keys that will be printed into markdown
4042
serving_results = []
4143
serving_column_mapping = {
4244
"test_name": "Test name",
4345
"gpu_type": "GPU",
44-
# "completed": "# of req.",
46+
"completed": "# of req.",
4547
"request_throughput": "Tput (req/s)",
46-
# "input_throughput": "Input Tput (tok/s)",
47-
# "output_throughput": "Output Tput (tok/s)",
48+
"total_token_throughput": "Total Token Tput (tok/s)",
49+
"output_throughput": "Output Tput (tok/s)",
50+
"total_input_tokens": "Total input tokens",
51+
"total_output_tokens": "Total output tokens",
4852
"mean_ttft_ms": "Mean TTFT (ms)",
4953
"median_ttft_ms": "Median TTFT (ms)",
5054
"p99_ttft_ms": "P99 TTFT (ms)",
51-
# "mean_tpot_ms": "Mean TPOT (ms)",
52-
# "median_tpot_ms": "Median",
53-
# "p99_tpot_ms": "P99",
55+
"mean_tpot_ms": "Mean TPOT (ms)",
56+
"median_tpot_ms": "Median",
57+
"p99_tpot_ms": "P99",
5458
"mean_itl_ms": "Mean ITL (ms)",
5559
"median_itl_ms": "Median ITL (ms)",
5660
"p99_itl_ms": "P99 ITL (ms)",
@@ -75,14 +79,28 @@ def results_to_json(latency, throughput, serving):
7579
)
7680

7781

82+
def get_size_with_unit(bytes, suffix="B"):
83+
"""
84+
Scale bytes to its proper format
85+
e.g:
86+
1253656 => '1.20MB'
87+
1253656678 => '1.17GB'
88+
"""
89+
factor = 1024
90+
for unit in ["", "K", "M", "G", "T", "P"]:
91+
if bytes < factor:
92+
return f"{bytes:.2f}{unit}{suffix}"
93+
bytes /= factor
94+
95+
7896
if __name__ == "__main__":
7997
# collect results
8098
for test_file in results_folder.glob("*.json"):
8199
with open(test_file) as f:
82100
raw_result = json.loads(f.read())
83101

84102
if "serving" in str(test_file):
85-
# this result is generated via `benchmark_serving.py`
103+
# this result is generated via `vllm bench serve` command
86104

87105
# attach the benchmarking command to raw_result
88106
try:
@@ -102,7 +120,7 @@ def results_to_json(latency, throughput, serving):
102120
continue
103121

104122
elif "latency" in f.name:
105-
# this result is generated via `benchmark_latency.py`
123+
# this result is generated via `vllm bench latency` command
106124

107125
# attach the benchmarking command to raw_result
108126
try:
@@ -130,7 +148,7 @@ def results_to_json(latency, throughput, serving):
130148
continue
131149

132150
elif "throughput" in f.name:
133-
# this result is generated via `benchmark_throughput.py`
151+
# this result is generated via `vllm bench throughput` command
134152

135153
# attach the benchmarking command to raw_result
136154
try:
@@ -155,6 +173,27 @@ def results_to_json(latency, throughput, serving):
155173
serving_results = pd.DataFrame.from_dict(serving_results)
156174
throughput_results = pd.DataFrame.from_dict(throughput_results)
157175

176+
svmem = psutil.virtual_memory()
177+
platform_data = {
178+
"Physical cores": [psutil.cpu_count(logical=False)],
179+
"Total cores": [psutil.cpu_count(logical=True)],
180+
"Total Memory": [get_size_with_unit(svmem.total)],
181+
}
182+
183+
if util.find_spec("numa") is not None:
184+
from numa import info
185+
186+
platform_data["Total NUMA nodes"] = [info.get_num_configured_nodes()]
187+
188+
if util.find_spec("cpuinfo") is not None:
189+
from cpuinfo import get_cpu_info
190+
191+
platform_data["CPU Brand"] = [get_cpu_info()["brand_raw"]]
192+
193+
platform_results = pd.DataFrame.from_dict(
194+
platform_data, orient="index", columns=["Platform Info"]
195+
)
196+
158197
raw_results_json = results_to_json(
159198
latency_results, throughput_results, serving_results
160199
)
@@ -200,6 +239,9 @@ def results_to_json(latency, throughput, serving):
200239
throughput_md_table = tabulate(
201240
throughput_results, headers="keys", tablefmt="pipe", showindex=False
202241
)
242+
platform_md_table = tabulate(
243+
platform_results, headers="keys", tablefmt="pipe", showindex=True
244+
)
203245

204246
# document the result
205247
with open(results_folder / "benchmark_results.md", "w") as f:
@@ -211,6 +253,7 @@ def results_to_json(latency, throughput, serving):
211253
latency_tests_markdown_table=latency_md_table,
212254
throughput_tests_markdown_table=throughput_md_table,
213255
serving_tests_markdown_table=serving_md_table,
256+
platform_markdown_table=platform_md_table,
214257
benchmarking_results_in_json_string=processed_results_json,
215258
)
216259
f.write(results)

0 commit comments

Comments
 (0)