Skip to content

Commit 2fd76c4

Browse files
committed
Add qwen example for smaller GPU testing
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
1 parent 03df2ca commit 2fd76c4

File tree

11 files changed

+1303
-0
lines changed

11 files changed

+1303
-0
lines changed
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
# Benchmark results
2+
results/
3+
4+
# Generated data
5+
data/*.pkl
6+
7+
# Logs
8+
*.log
9+
benchmark_output.log
10+
11+
# Python cache
12+
__pycache__/
13+
*.pyc
14+
*.pyo
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
# Quick Start - Qwen2.5-0.5B
2+
3+
Use the wrapper script from the repo root:
4+
5+
```bash
6+
# vLLM offline
7+
bash examples/08_Qwen2.5-0.5B_Example/run_benchmark.sh vllm offline
8+
9+
# SGLang offline
10+
bash examples/08_Qwen2.5-0.5B_Example/run_benchmark.sh sglang offline
11+
12+
# Online concurrency sweep
13+
bash examples/08_Qwen2.5-0.5B_Example/run_benchmark.sh vllm online
14+
```
15+
16+
Outputs:
17+
18+
- vLLM offline: `results/qwen_offline_benchmark/`
19+
- vLLM online: `results/qwen_online_benchmark/concurrency_sweep/`
20+
- SGLang offline: `results/qwen_sglang_offline_benchmark/`
21+
- SGLang online: `results/qwen_sglang_online_benchmark/concurrency_sweep/`
22+
23+
Summarize an online sweep:
24+
25+
```bash
26+
python scripts/concurrency_sweep/summarize.py \
27+
results/qwen_online_benchmark/concurrency_sweep/
28+
```
29+
30+
For manual setup, server commands, and config details, see [README.md](README.md).
Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
# Qwen2.5-0.5B-Instruct Benchmark Example
2+
3+
This example benchmarks `Qwen/Qwen2.5-0.5B-Instruct` against either a vLLM or
4+
SGLang server. It is intended as a small-GPU example that works on typical
5+
8-16 GB cards.
6+
7+
## Requirements
8+
9+
- Python 3.12+
10+
- Docker with NVIDIA GPU support
11+
- NVIDIA GPU with at least 8 GB VRAM
12+
13+
## Fastest Path
14+
15+
From the repo root:
16+
17+
```bash
18+
# Offline benchmark with vLLM
19+
bash examples/08_Qwen2.5-0.5B_Example/run_benchmark.sh vllm offline
20+
21+
# Offline benchmark with SGLang
22+
bash examples/08_Qwen2.5-0.5B_Example/run_benchmark.sh sglang offline
23+
24+
# Online concurrency sweep
25+
bash examples/08_Qwen2.5-0.5B_Example/run_benchmark.sh vllm online
26+
```
27+
28+
The script prepares the dataset, starts or reuses a container, waits for the
29+
server, and runs the benchmark.
30+
31+
## Manual Flow
32+
33+
If you do not want to use `run_benchmark.sh`, the minimum manual flow is:
34+
35+
```bash
36+
python3.12 -m venv .venv
37+
source .venv/bin/activate
38+
pip install -e ".[test]"
39+
40+
python examples/08_Qwen2.5-0.5B_Example/prepare_dataset.py
41+
```
42+
43+
Start one server:
44+
45+
```bash
46+
# vLLM
47+
docker run --runtime nvidia --gpus all \
48+
-v ~/.cache/huggingface:/root/.cache/huggingface \
49+
-e PYTORCH_ALLOC_CONF=expandable_segments:True \
50+
-p 8000:8000 \
51+
--ipc=host \
52+
--name vllm-qwen \
53+
-d \
54+
vllm/vllm-openai:latest \
55+
--model Qwen/Qwen2.5-0.5B-Instruct \
56+
--gpu-memory-utilization 0.85
57+
58+
# SGLang
59+
docker run --runtime nvidia --gpus all --net host \
60+
-v ~/.cache/huggingface:/root/.cache/huggingface \
61+
--ipc=host \
62+
--name sglang-qwen \
63+
-d \
64+
lmsysorg/sglang:latest \
65+
python3 -m sglang.launch_server \
66+
--model-path Qwen/Qwen2.5-0.5B-Instruct \
67+
--host 0.0.0.0 \
68+
--port 30000 \
69+
--mem-fraction-static 0.9 \
70+
--attention-backend flashinfer
71+
```
72+
73+
Run one benchmark:
74+
75+
```bash
76+
# vLLM offline
77+
inference-endpoint benchmark from-config \
78+
-c examples/08_Qwen2.5-0.5B_Example/offline_qwen_benchmark.yaml
79+
80+
# SGLang offline
81+
inference-endpoint benchmark from-config \
82+
-c examples/08_Qwen2.5-0.5B_Example/sglang_offline_qwen_benchmark.yaml
83+
84+
# vLLM online sweep
85+
python scripts/concurrency_sweep/run.py \
86+
--config examples/08_Qwen2.5-0.5B_Example/online_qwen_benchmark.yaml
87+
88+
# SGLang online sweep
89+
python scripts/concurrency_sweep/run.py \
90+
--config examples/08_Qwen2.5-0.5B_Example/sglang_online_qwen_benchmark.yaml
91+
```
92+
93+
## Files
94+
95+
- `offline_qwen_benchmark.yaml`: vLLM offline benchmark
96+
- `online_qwen_benchmark.yaml`: vLLM online concurrency sweep
97+
- `sglang_offline_qwen_benchmark.yaml`: SGLang offline benchmark
98+
- `sglang_online_qwen_benchmark.yaml`: SGLang online concurrency sweep
99+
- `prepare_dataset.py`: converts `tests/datasets/dummy_1k.pkl` into the example dataset
100+
- `run_benchmark.sh`: wrapper that automates dataset prep, container startup, and benchmark execution
101+
102+
## Results
103+
104+
- vLLM offline: `results/qwen_offline_benchmark/`
105+
- vLLM online: `results/qwen_online_benchmark/concurrency_sweep/`
106+
- SGLang offline: `results/qwen_sglang_offline_benchmark/`
107+
- SGLang online: `results/qwen_sglang_online_benchmark/concurrency_sweep/`
108+
109+
To summarize an online sweep:
110+
111+
```bash
112+
python scripts/concurrency_sweep/summarize.py \
113+
results/qwen_online_benchmark/concurrency_sweep/
114+
```
115+
116+
## Notes
117+
118+
- The online sweep defaults to `1 2 4 8 16 32 64 128 256 512 1024`.
119+
- Use `scripts/concurrency_sweep/run.py --concurrency ... --duration-ms ...` to shorten or customize the sweep.
120+
- If vLLM runs out of memory at higher concurrency, lower `--gpu-memory-utilization`.
Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
name: "qwen-0.5b-offline-benchmark"
2+
version: "1.0"
3+
type: "offline"
4+
5+
model_params:
6+
name: "Qwen/Qwen2.5-0.5B-Instruct"
7+
temperature: 1.0
8+
max_new_tokens: 100
9+
top_p: 1.0
10+
streaming: "on"
11+
12+
datasets:
13+
- name: "qwen-perf-test"
14+
type: "performance"
15+
path: "examples/08_Qwen2.5-0.5B_Example/data/test_dataset.pkl"
16+
samples: 1000
17+
18+
settings:
19+
runtime:
20+
min_duration_ms: 100
21+
max_duration_ms: 60000
22+
scheduler_random_seed: 42
23+
dataloader_random_seed: 42
24+
25+
client:
26+
workers: 1
27+
max_connections: 100
28+
warmup_connections: 0
29+
record_worker_events: false
30+
31+
metrics:
32+
collect:
33+
- "throughput"
34+
- "latency"
35+
- "ttft"
36+
- "tpot"
37+
38+
endpoint_config:
39+
endpoints:
40+
- "http://localhost:8000"
41+
api_key: null
42+
43+
report_dir: "results/qwen_offline_benchmark/"
Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
name: "qwen-0.5b-online-benchmark"
2+
version: "1.0"
3+
type: "online"
4+
5+
model_params:
6+
name: "Qwen/Qwen2.5-0.5B-Instruct"
7+
temperature: 0.7
8+
max_new_tokens: 128
9+
top_p: 0.95
10+
streaming: "on"
11+
12+
datasets:
13+
- name: "qwen-perf-test"
14+
type: "performance"
15+
path: "examples/08_Qwen2.5-0.5B_Example/data/test_dataset.pkl"
16+
samples: 500
17+
18+
settings:
19+
runtime:
20+
min_duration_ms: 600000
21+
max_duration_ms: 600000
22+
scheduler_random_seed: 42
23+
dataloader_random_seed: 42
24+
25+
load_pattern:
26+
type: "concurrency"
27+
target_concurrency: 1
28+
29+
client:
30+
workers: 1
31+
max_connections: 2048
32+
warmup_connections: 0
33+
record_worker_events: false
34+
35+
metrics:
36+
collect:
37+
- "throughput"
38+
- "latency"
39+
- "ttft"
40+
- "tpot"
41+
42+
endpoint_config:
43+
endpoints:
44+
- "http://localhost:8000"
45+
api_key: null
46+
47+
report_dir: "results/qwen_online_benchmark/"
Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
#!/usr/bin/env python3
2+
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3+
# SPDX-License-Identifier: Apache-2.0
4+
#
5+
# Licensed under the Apache License, Version 2.0 (the "License");
6+
# you may not use this file except in compliance with the License.
7+
# You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing, software
12+
# distributed under the License is distributed on an "AS IS" BASIS,
13+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
# See the License for the specific language governing permissions and
15+
# limitations under the License.
16+
17+
"""
18+
Prepare test dataset for Qwen benchmark.
19+
20+
This script creates a test dataset with the 'prompt' column required by
21+
the inference-endpoint benchmarking tool.
22+
"""
23+
24+
import pickle
25+
import sys
26+
from pathlib import Path
27+
28+
29+
def prepare_dataset(
30+
input_path: str = "tests/datasets/dummy_1k.pkl",
31+
output_dir: str = "examples/08_Qwen2.5-0.5B_Example/data",
32+
output_filename: str = "test_dataset.pkl",
33+
) -> None:
34+
"""
35+
Prepare the test dataset by renaming columns to match expected format.
36+
37+
Args:
38+
input_path: Path to the input dataset
39+
output_dir: Directory to save the output dataset
40+
output_filename: Name of the output file
41+
"""
42+
print(f"Loading dataset from: {input_path}")
43+
44+
# Load the original dataset
45+
try:
46+
with open(input_path, "rb") as f:
47+
data = pickle.load(f)
48+
except FileNotFoundError:
49+
print(f"ERROR: Input dataset not found at {input_path}")
50+
print("Make sure you're running from the repository root directory")
51+
sys.exit(1)
52+
53+
print(f"Loaded dataset with {len(data)} samples")
54+
print(f"Original columns: {data.columns.tolist()}")
55+
56+
# Rename text_input to prompt
57+
if "text_input" in data.columns:
58+
data = data.rename(columns={"text_input": "prompt"})
59+
print("Renamed 'text_input' to 'prompt'")
60+
elif "prompt" not in data.columns:
61+
print("ERROR: Dataset must have 'text_input' or 'prompt' column")
62+
sys.exit(1)
63+
64+
print(f"Final columns: {data.columns.tolist()}")
65+
66+
# Create output directory
67+
output_path = Path(output_dir)
68+
output_path.mkdir(parents=True, exist_ok=True)
69+
70+
# Save the dataset
71+
full_output_path = output_path / output_filename
72+
with open(full_output_path, "wb") as f:
73+
pickle.dump(data, f)
74+
75+
print(f"✅ Dataset saved to: {full_output_path}")
76+
print(f" Samples: {len(data)}")
77+
print(f" Columns: {data.columns.tolist()}")
78+
79+
80+
if __name__ == "__main__":
81+
# Allow custom input path as command-line argument
82+
input_path = sys.argv[1] if len(sys.argv) > 1 else "tests/datasets/dummy_1k.pkl"
83+
prepare_dataset(input_path=input_path)

0 commit comments

Comments
 (0)