11# Qwen2.5-0.5B-Instruct Benchmark Example
22
3- This example benchmarks ` Qwen/Qwen2.5-0.5B-Instruct ` against either a vLLM or
4- SGLang server. It is intended as a small-GPU example that works on typical
5- 8-16 GB cards.
3+ Benchmarks ` Qwen/Qwen2.5-0.5B-Instruct ` with offline (max-throughput) and online
4+ (concurrency sweep) load patterns. Designed for small GPUs (8–16 GB VRAM).
5+
6+ Supported inference servers: ** vLLM** , ** SGLang** , ** TRT-LLM** .
7+
8+ ---
69
710## Requirements
811
912- Python 3.12+
10- - Docker with NVIDIA GPU support
13+ - Docker with NVIDIA GPU support ( ` --runtime nvidia ` )
1114- NVIDIA GPU with at least 8 GB VRAM
1215
13- ## Fastest Path
14-
15- From the repo root:
16-
17- ``` bash
18- # Offline benchmark with vLLM
19- bash examples/08_Qwen2.5-0.5B_Example/run_benchmark.sh vllm offline
20-
21- # Offline benchmark with SGLang
22- bash examples/08_Qwen2.5-0.5B_Example/run_benchmark.sh sglang offline
23-
24- # Online concurrency sweep
25- bash examples/08_Qwen2.5-0.5B_Example/run_benchmark.sh vllm online
26- ```
27-
28- The script prepares the dataset, starts or reuses a container, waits for the
29- server, and runs the benchmark.
16+ ---
3017
31- ## Manual Flow
18+ ## Step 1 — Install and prepare dataset
3219
33- If you do not want to use ` run_benchmark.sh ` , the minimum manual flow is:
20+ Run all commands from the ** repository root ** .
3421
3522``` bash
3623python3.12 -m venv .venv
@@ -40,10 +27,18 @@ pip install -e ".[test]"
4027python examples/08_Qwen2.5-0.5B_Example/prepare_dataset.py
4128```
4229
43- Start one server:
30+ This converts ` tests/datasets/dummy_1k.pkl ` into
31+ ` examples/08_Qwen2.5-0.5B_Example/data/test_dataset.pkl ` .
32+
33+ ---
34+
35+ ## Step 2 — Start the inference server
36+
37+ Pick one backend. The server must be fully ready before running benchmarks.
38+
39+ ### vLLM (port 8000)
4440
4541``` bash
46- # vLLM
4742docker run --runtime nvidia --gpus all \
4843 -v ~ /.cache/huggingface:/root/.cache/huggingface \
4944 -e PYTORCH_ALLOC_CONF=expandable_segments:True \
@@ -54,8 +49,11 @@ docker run --runtime nvidia --gpus all \
5449 vllm/vllm-openai:latest \
5550 --model Qwen/Qwen2.5-0.5B-Instruct \
5651 --gpu-memory-utilization 0.85
52+ ```
5753
58- # SGLang
54+ ### SGLang (port 30000)
55+
56+ ``` bash
5957docker run --runtime nvidia --gpus all --net host \
6058 -v ~ /.cache/huggingface:/root/.cache/huggingface \
6159 --ipc=host \
@@ -70,51 +68,175 @@ docker run --runtime nvidia --gpus all --net host \
7068 --attention-backend flashinfer
7169```
7270
73- Run one benchmark:
71+ ### TRT-LLM (port 8000)
7472
7573``` bash
76- # vLLM offline
77- inference-endpoint benchmark from-config \
78- -c examples/08_Qwen2.5-0.5B_Example/offline_qwen_benchmark.yaml
74+ docker run --runtime nvidia --gpus all \
75+ -v ~ /.cache/huggingface:/root/.cache/huggingface \
76+ -p 8000:8000 \
77+ --ipc=host \
78+ --name trtllm-qwen \
79+ -d \
80+ nvcr.io/nvidia/tritonserver:latest \
81+ # Add your TRT-LLM engine launch arguments here
82+ ```
7983
80- # SGLang offline
81- inference-endpoint benchmark from-config \
82- -c examples/08_Qwen2.5-0.5B_Example/sglang_offline_qwen_benchmark.yaml
84+ > ** Note:** No pre-built TRT-LLM config is provided. Use
85+ > ` examples/08_Qwen2.5-0.5B_Example/online_qwen_benchmark.yaml ` as a template and
86+ > point ` endpoint_config.endpoints ` at ` http://localhost:8000 ` .
87+
88+ ---
89+
90+ ## Step 3 — Wait for the server to be ready
8391
84- # vLLM online sweep
92+ Poll until the health endpoint responds:
93+
94+ ``` bash
95+ # vLLM / TRT-LLM (port 8000)
96+ until curl -sf http://localhost:8000/v1/models > /dev/null; do
97+ echo " Waiting for server..." ; sleep 5
98+ done
99+ echo " Server ready."
100+
101+ # SGLang (port 30000)
102+ until curl -sf http://localhost:30000/health > /dev/null; do
103+ echo " Waiting for server..." ; sleep 5
104+ done
105+ echo " Server ready."
106+ ```
107+
108+ ---
109+
110+ ## Step 4 — Run the concurrency sweep
111+
112+ Choose the config that matches your server. The sweep script overrides
113+ ` load_pattern ` and ` report_dir ` for each concurrency level, leaving all other
114+ settings (model, dataset, endpoint) from the config file.
115+
116+ ``` bash
117+ # vLLM
85118python scripts/concurrency_sweep/run.py \
86119 --config examples/08_Qwen2.5-0.5B_Example/online_qwen_benchmark.yaml
87120
88- # SGLang online sweep
121+ # SGLang
89122python scripts/concurrency_sweep/run.py \
90123 --config examples/08_Qwen2.5-0.5B_Example/sglang_online_qwen_benchmark.yaml
124+
125+ # TRT-LLM (use the vLLM config or a custom one pointing at port 8000)
126+ python scripts/concurrency_sweep/run.py \
127+ --config examples/08_Qwen2.5-0.5B_Example/online_qwen_benchmark.yaml
128+ ```
129+
130+ ** Common options:**
131+
132+ | Flag | Default | Description |
133+ | ---| ---| ---|
134+ | ` --concurrency N [N ...] ` | ` 1 2 4 8 16 32 64 128 256 512 1024 ` | Concurrency levels to test |
135+ | ` --duration-ms MS ` | ` 600000 ` (10 min) | Duration per run |
136+ | ` --output-dir DIR ` | from ` report_dir ` in config | Root directory for sweep output |
137+ | ` --timeout-seconds S ` | ` 720 ` (12 min) | Per-run subprocess timeout |
138+ | ` --verbose ` | off | Stream output live to the terminal (useful for debugging) |
139+
140+ Example — quick 3-minute sweep at a few concurrency levels:
141+
142+ ``` bash
143+ python scripts/concurrency_sweep/run.py \
144+ --config examples/08_Qwen2.5-0.5B_Example/online_qwen_benchmark.yaml \
145+ --concurrency 1 4 16 64 \
146+ --duration-ms 180000 \
147+ --verbose
91148```
92149
93- ## Files
150+ Results land in subdirectories under the config's ` report_dir ` :
94151
95- - ` offline_qwen_benchmark.yaml ` : vLLM offline benchmark
96- - ` online_qwen_benchmark.yaml ` : vLLM online concurrency sweep
97- - ` sglang_offline_qwen_benchmark.yaml ` : SGLang offline benchmark
98- - ` sglang_online_qwen_benchmark.yaml ` : SGLang online concurrency sweep
99- - ` prepare_dataset.py ` : converts ` tests/datasets/dummy_1k.pkl ` into the example dataset
100- - ` run_benchmark.sh ` : wrapper that automates dataset prep, container startup, and benchmark execution
152+ ```
153+ results/qwen_online_benchmark/concurrency_sweep/
154+ concurrency_1/ benchmark.log result_summary.json
155+ concurrency_4/ benchmark.log result_summary.json
156+ ...
157+ summary.json summary.csv
158+ ```
159+
160+ If a run fails, check the per-run log:
101161
102- ## Results
162+ ``` bash
163+ cat results/qwen_online_benchmark/concurrency_sweep/concurrency_64/benchmark.log
164+ ```
103165
104- - vLLM offline: ` results/qwen_offline_benchmark/ `
105- - vLLM online: ` results/qwen_online_benchmark/concurrency_sweep/ `
106- - SGLang offline: ` results/qwen_sglang_offline_benchmark/ `
107- - SGLang online: ` results/qwen_sglang_online_benchmark/concurrency_sweep/ `
166+ ---
108167
109- To summarize an online sweep:
168+ ## Step 5 — Summarize results and generate plots
110169
111170``` bash
171+ # vLLM
112172python scripts/concurrency_sweep/summarize.py \
113173 results/qwen_online_benchmark/concurrency_sweep/
174+
175+ # SGLang
176+ python scripts/concurrency_sweep/summarize.py \
177+ results/qwen_sglang_online_benchmark/concurrency_sweep/
178+ ```
179+
180+ This prints formatted tables to stdout and writes three files into the sweep
181+ directory:
182+
183+ | File | Contents |
184+ | ---| ---|
185+ | ` metrics_summary.csv ` | All metrics in CSV form |
186+ | ` metrics_summary.md ` | Markdown tables with throughput, latency, TTFT, TPOT |
187+ | ` metrics_summary.png ` | Line plots of TPS, TTFT P99, and TPOT P50 vs concurrency |
188+
189+ Pass ` --no-save ` to print tables only without writing files.
190+
191+ ---
192+
193+ ## Step 6 — Stop the server
194+
195+ ``` bash
196+ docker stop vllm-qwen # or sglang-qwen / trtllm-qwen
197+ docker rm vllm-qwen
114198```
115199
116- ## Notes
200+ ---
201+
202+ ## Offline (max-throughput) benchmark
203+
204+ For a single offline run (no sweep):
205+
206+ ``` bash
207+ # vLLM
208+ inference-endpoint benchmark from-config \
209+ -c examples/08_Qwen2.5-0.5B_Example/offline_qwen_benchmark.yaml
210+
211+ # SGLang
212+ inference-endpoint benchmark from-config \
213+ -c examples/08_Qwen2.5-0.5B_Example/sglang_offline_qwen_benchmark.yaml
214+ ```
215+
216+ Results: ` results/qwen_offline_benchmark/ ` or ` results/qwen_sglang_offline_benchmark/ ` .
217+
218+ ---
219+
220+ ## Automated wrapper
221+
222+ ` run_benchmark.sh ` automates Steps 2–4 (dataset prep, container start, benchmark):
223+
224+ ``` bash
225+ bash examples/08_Qwen2.5-0.5B_Example/run_benchmark.sh vllm offline
226+ bash examples/08_Qwen2.5-0.5B_Example/run_benchmark.sh vllm online
227+ bash examples/08_Qwen2.5-0.5B_Example/run_benchmark.sh sglang offline
228+ bash examples/08_Qwen2.5-0.5B_Example/run_benchmark.sh sglang online
229+ ```
230+
231+ ---
232+
233+ ## Config files
117234
118- - The online sweep defaults to ` 1 2 4 8 16 32 64 128 256 512 1024 ` .
119- - Use ` scripts/concurrency_sweep/run.py --concurrency ... --duration-ms ... ` to shorten or customize the sweep.
120- - If vLLM runs out of memory at higher concurrency, lower ` --gpu-memory-utilization ` .
235+ | File | Server | Mode |
236+ | ---| ---| ---|
237+ | ` offline_qwen_benchmark.yaml ` | vLLM (` :8000 ` ) | Offline |
238+ | ` online_qwen_benchmark.yaml ` | vLLM (` :8000 ` ) | Online sweep |
239+ | ` sglang_offline_qwen_benchmark.yaml ` | SGLang (` :30000 ` ) | Offline |
240+ | ` sglang_online_qwen_benchmark.yaml ` | SGLang (` :30000 ` ) | Online sweep |
241+ | ` prepare_dataset.py ` | — | Converts ` dummy_1k.pkl ` to example dataset |
242+ | ` run_benchmark.sh ` | vLLM / SGLang | Automated wrapper |
0 commit comments