Skip to content

Commit 644f755

Browse files
committed
Fix matplotlib + shell
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
1 parent 2fd76c4 commit 644f755

File tree

5 files changed

+309
-92
lines changed

5 files changed

+309
-92
lines changed
Lines changed: 61 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,30 +1,76 @@
1-
# Quick Start - Qwen2.5-0.5B
1+
# Quick Start Qwen2.5-0.5B
22

3-
Use the wrapper script from the repo root:
3+
All commands run from the **repository root**.
4+
5+
## Setup
46

57
```bash
6-
# vLLM offline
7-
bash examples/08_Qwen2.5-0.5B_Example/run_benchmark.sh vllm offline
8+
python3.12 -m venv .venv && source .venv/bin/activate
9+
pip install -e ".[test]"
10+
python examples/08_Qwen2.5-0.5B_Example/prepare_dataset.py
11+
```
812

9-
# SGLang offline
10-
bash examples/08_Qwen2.5-0.5B_Example/run_benchmark.sh sglang offline
13+
## Option A — Automated (vLLM or SGLang)
1114

12-
# Online concurrency sweep
15+
```bash
16+
bash examples/08_Qwen2.5-0.5B_Example/run_benchmark.sh vllm offline
1317
bash examples/08_Qwen2.5-0.5B_Example/run_benchmark.sh vllm online
18+
bash examples/08_Qwen2.5-0.5B_Example/run_benchmark.sh sglang online
19+
```
20+
21+
## Option B — Manual step-by-step
22+
23+
**1. Start server** (pick one):
24+
25+
```bash
26+
# vLLM
27+
docker run --runtime nvidia --gpus all -v ~/.cache/huggingface:/root/.cache/huggingface \
28+
-e PYTORCH_ALLOC_CONF=expandable_segments:True -p 8000:8000 --ipc=host \
29+
--name vllm-qwen -d vllm/vllm-openai:latest \
30+
--model Qwen/Qwen2.5-0.5B-Instruct --gpu-memory-utilization 0.85
31+
32+
# SGLang
33+
docker run --runtime nvidia --gpus all --net host \
34+
-v ~/.cache/huggingface:/root/.cache/huggingface --ipc=host \
35+
--name sglang-qwen -d lmsysorg/sglang:latest \
36+
python3 -m sglang.launch_server --model-path Qwen/Qwen2.5-0.5B-Instruct \
37+
--host 0.0.0.0 --port 30000 --mem-fraction-static 0.9 --attention-backend flashinfer
38+
```
39+
40+
**2. Wait for ready:**
41+
42+
```bash
43+
until curl -sf http://localhost:8000/v1/models > /dev/null; do sleep 5; done # vLLM
44+
until curl -sf http://localhost:30000/health > /dev/null; do sleep 5; done # SGLang
1445
```
1546

16-
Outputs:
47+
**3. Run concurrency sweep:**
1748

18-
- vLLM offline: `results/qwen_offline_benchmark/`
19-
- vLLM online: `results/qwen_online_benchmark/concurrency_sweep/`
20-
- SGLang offline: `results/qwen_sglang_offline_benchmark/`
21-
- SGLang online: `results/qwen_sglang_online_benchmark/concurrency_sweep/`
49+
```bash
50+
python scripts/concurrency_sweep/run.py \
51+
--config examples/08_Qwen2.5-0.5B_Example/online_qwen_benchmark.yaml # vLLM
52+
# or: --config examples/08_Qwen2.5-0.5B_Example/sglang_online_qwen_benchmark.yaml
53+
54+
# Add --verbose to stream output live; add --concurrency / --duration-ms to customize
55+
```
2256

23-
Summarize an online sweep:
57+
**4. Summarize and plot:**
2458

2559
```bash
2660
python scripts/concurrency_sweep/summarize.py \
27-
results/qwen_online_benchmark/concurrency_sweep/
61+
results/qwen_online_benchmark/concurrency_sweep/ # vLLM
62+
# or: results/qwen_sglang_online_benchmark/concurrency_sweep/
2863
```
2964

30-
For manual setup, server commands, and config details, see [README.md](README.md).
65+
Writes `metrics_summary.csv`, `metrics_summary.md`, and `metrics_summary.png`.
66+
67+
**5. Stop server:**
68+
69+
```bash
70+
docker stop vllm-qwen && docker rm vllm-qwen
71+
# or: docker stop sglang-qwen && docker rm sglang-qwen
72+
```
73+
74+
---
75+
76+
For TRT-LLM setup, config customization, and output file locations, see [README.md](README.md).
Lines changed: 174 additions & 52 deletions
Original file line numberDiff line numberDiff line change
@@ -1,36 +1,23 @@
11
# Qwen2.5-0.5B-Instruct Benchmark Example
22

3-
This example benchmarks `Qwen/Qwen2.5-0.5B-Instruct` against either a vLLM or
4-
SGLang server. It is intended as a small-GPU example that works on typical
5-
8-16 GB cards.
3+
Benchmarks `Qwen/Qwen2.5-0.5B-Instruct` with offline (max-throughput) and online
4+
(concurrency sweep) load patterns. Designed for small GPUs (8–16 GB VRAM).
5+
6+
Supported inference servers: **vLLM**, **SGLang**, **TRT-LLM**.
7+
8+
---
69

710
## Requirements
811

912
- Python 3.12+
10-
- Docker with NVIDIA GPU support
13+
- Docker with NVIDIA GPU support (`--runtime nvidia`)
1114
- NVIDIA GPU with at least 8 GB VRAM
1215

13-
## Fastest Path
14-
15-
From the repo root:
16-
17-
```bash
18-
# Offline benchmark with vLLM
19-
bash examples/08_Qwen2.5-0.5B_Example/run_benchmark.sh vllm offline
20-
21-
# Offline benchmark with SGLang
22-
bash examples/08_Qwen2.5-0.5B_Example/run_benchmark.sh sglang offline
23-
24-
# Online concurrency sweep
25-
bash examples/08_Qwen2.5-0.5B_Example/run_benchmark.sh vllm online
26-
```
27-
28-
The script prepares the dataset, starts or reuses a container, waits for the
29-
server, and runs the benchmark.
16+
---
3017

31-
## Manual Flow
18+
## Step 1 — Install and prepare dataset
3219

33-
If you do not want to use `run_benchmark.sh`, the minimum manual flow is:
20+
Run all commands from the **repository root**.
3421

3522
```bash
3623
python3.12 -m venv .venv
@@ -40,10 +27,18 @@ pip install -e ".[test]"
4027
python examples/08_Qwen2.5-0.5B_Example/prepare_dataset.py
4128
```
4229

43-
Start one server:
30+
This converts `tests/datasets/dummy_1k.pkl` into
31+
`examples/08_Qwen2.5-0.5B_Example/data/test_dataset.pkl`.
32+
33+
---
34+
35+
## Step 2 — Start the inference server
36+
37+
Pick one backend. The server must be fully ready before running benchmarks.
38+
39+
### vLLM (port 8000)
4440

4541
```bash
46-
# vLLM
4742
docker run --runtime nvidia --gpus all \
4843
-v ~/.cache/huggingface:/root/.cache/huggingface \
4944
-e PYTORCH_ALLOC_CONF=expandable_segments:True \
@@ -54,8 +49,11 @@ docker run --runtime nvidia --gpus all \
5449
vllm/vllm-openai:latest \
5550
--model Qwen/Qwen2.5-0.5B-Instruct \
5651
--gpu-memory-utilization 0.85
52+
```
5753

58-
# SGLang
54+
### SGLang (port 30000)
55+
56+
```bash
5957
docker run --runtime nvidia --gpus all --net host \
6058
-v ~/.cache/huggingface:/root/.cache/huggingface \
6159
--ipc=host \
@@ -70,51 +68,175 @@ docker run --runtime nvidia --gpus all --net host \
7068
--attention-backend flashinfer
7169
```
7270

73-
Run one benchmark:
71+
### TRT-LLM (port 8000)
7472

7573
```bash
76-
# vLLM offline
77-
inference-endpoint benchmark from-config \
78-
-c examples/08_Qwen2.5-0.5B_Example/offline_qwen_benchmark.yaml
74+
docker run --runtime nvidia --gpus all \
75+
-v ~/.cache/huggingface:/root/.cache/huggingface \
76+
-p 8000:8000 \
77+
--ipc=host \
78+
--name trtllm-qwen \
79+
-d \
80+
nvcr.io/nvidia/tritonserver:latest \
81+
# Add your TRT-LLM engine launch arguments here
82+
```
7983

80-
# SGLang offline
81-
inference-endpoint benchmark from-config \
82-
-c examples/08_Qwen2.5-0.5B_Example/sglang_offline_qwen_benchmark.yaml
84+
> **Note:** No pre-built TRT-LLM config is provided. Use
85+
> `examples/08_Qwen2.5-0.5B_Example/online_qwen_benchmark.yaml` as a template and
86+
> point `endpoint_config.endpoints` at `http://localhost:8000`.
87+
88+
---
89+
90+
## Step 3 — Wait for the server to be ready
8391

84-
# vLLM online sweep
92+
Poll until the health endpoint responds:
93+
94+
```bash
95+
# vLLM / TRT-LLM (port 8000)
96+
until curl -sf http://localhost:8000/v1/models > /dev/null; do
97+
echo "Waiting for server..."; sleep 5
98+
done
99+
echo "Server ready."
100+
101+
# SGLang (port 30000)
102+
until curl -sf http://localhost:30000/health > /dev/null; do
103+
echo "Waiting for server..."; sleep 5
104+
done
105+
echo "Server ready."
106+
```
107+
108+
---
109+
110+
## Step 4 — Run the concurrency sweep
111+
112+
Choose the config that matches your server. The sweep script overrides
113+
`load_pattern` and `report_dir` for each concurrency level, leaving all other
114+
settings (model, dataset, endpoint) from the config file.
115+
116+
```bash
117+
# vLLM
85118
python scripts/concurrency_sweep/run.py \
86119
--config examples/08_Qwen2.5-0.5B_Example/online_qwen_benchmark.yaml
87120

88-
# SGLang online sweep
121+
# SGLang
89122
python scripts/concurrency_sweep/run.py \
90123
--config examples/08_Qwen2.5-0.5B_Example/sglang_online_qwen_benchmark.yaml
124+
125+
# TRT-LLM (use the vLLM config or a custom one pointing at port 8000)
126+
python scripts/concurrency_sweep/run.py \
127+
--config examples/08_Qwen2.5-0.5B_Example/online_qwen_benchmark.yaml
128+
```
129+
130+
**Common options:**
131+
132+
| Flag | Default | Description |
133+
|---|---|---|
134+
| `--concurrency N [N ...]` | `1 2 4 8 16 32 64 128 256 512 1024` | Concurrency levels to test |
135+
| `--duration-ms MS` | `600000` (10 min) | Duration per run |
136+
| `--output-dir DIR` | from `report_dir` in config | Root directory for sweep output |
137+
| `--timeout-seconds S` | `720` (12 min) | Per-run subprocess timeout |
138+
| `--verbose` | off | Stream output live to the terminal (useful for debugging) |
139+
140+
Example — quick 3-minute sweep at a few concurrency levels:
141+
142+
```bash
143+
python scripts/concurrency_sweep/run.py \
144+
--config examples/08_Qwen2.5-0.5B_Example/online_qwen_benchmark.yaml \
145+
--concurrency 1 4 16 64 \
146+
--duration-ms 180000 \
147+
--verbose
91148
```
92149

93-
## Files
150+
Results land in subdirectories under the config's `report_dir`:
94151

95-
- `offline_qwen_benchmark.yaml`: vLLM offline benchmark
96-
- `online_qwen_benchmark.yaml`: vLLM online concurrency sweep
97-
- `sglang_offline_qwen_benchmark.yaml`: SGLang offline benchmark
98-
- `sglang_online_qwen_benchmark.yaml`: SGLang online concurrency sweep
99-
- `prepare_dataset.py`: converts `tests/datasets/dummy_1k.pkl` into the example dataset
100-
- `run_benchmark.sh`: wrapper that automates dataset prep, container startup, and benchmark execution
152+
```
153+
results/qwen_online_benchmark/concurrency_sweep/
154+
concurrency_1/ benchmark.log result_summary.json
155+
concurrency_4/ benchmark.log result_summary.json
156+
...
157+
summary.json summary.csv
158+
```
159+
160+
If a run fails, check the per-run log:
101161

102-
## Results
162+
```bash
163+
cat results/qwen_online_benchmark/concurrency_sweep/concurrency_64/benchmark.log
164+
```
103165

104-
- vLLM offline: `results/qwen_offline_benchmark/`
105-
- vLLM online: `results/qwen_online_benchmark/concurrency_sweep/`
106-
- SGLang offline: `results/qwen_sglang_offline_benchmark/`
107-
- SGLang online: `results/qwen_sglang_online_benchmark/concurrency_sweep/`
166+
---
108167

109-
To summarize an online sweep:
168+
## Step 5 — Summarize results and generate plots
110169

111170
```bash
171+
# vLLM
112172
python scripts/concurrency_sweep/summarize.py \
113173
results/qwen_online_benchmark/concurrency_sweep/
174+
175+
# SGLang
176+
python scripts/concurrency_sweep/summarize.py \
177+
results/qwen_sglang_online_benchmark/concurrency_sweep/
178+
```
179+
180+
This prints formatted tables to stdout and writes three files into the sweep
181+
directory:
182+
183+
| File | Contents |
184+
|---|---|
185+
| `metrics_summary.csv` | All metrics in CSV form |
186+
| `metrics_summary.md` | Markdown tables with throughput, latency, TTFT, TPOT |
187+
| `metrics_summary.png` | Line plots of TPS, TTFT P99, and TPOT P50 vs concurrency |
188+
189+
Pass `--no-save` to print tables only without writing files.
190+
191+
---
192+
193+
## Step 6 — Stop the server
194+
195+
```bash
196+
docker stop vllm-qwen # or sglang-qwen / trtllm-qwen
197+
docker rm vllm-qwen
114198
```
115199

116-
## Notes
200+
---
201+
202+
## Offline (max-throughput) benchmark
203+
204+
For a single offline run (no sweep):
205+
206+
```bash
207+
# vLLM
208+
inference-endpoint benchmark from-config \
209+
-c examples/08_Qwen2.5-0.5B_Example/offline_qwen_benchmark.yaml
210+
211+
# SGLang
212+
inference-endpoint benchmark from-config \
213+
-c examples/08_Qwen2.5-0.5B_Example/sglang_offline_qwen_benchmark.yaml
214+
```
215+
216+
Results: `results/qwen_offline_benchmark/` or `results/qwen_sglang_offline_benchmark/`.
217+
218+
---
219+
220+
## Automated wrapper
221+
222+
`run_benchmark.sh` automates Steps 2–4 (dataset prep, container start, benchmark):
223+
224+
```bash
225+
bash examples/08_Qwen2.5-0.5B_Example/run_benchmark.sh vllm offline
226+
bash examples/08_Qwen2.5-0.5B_Example/run_benchmark.sh vllm online
227+
bash examples/08_Qwen2.5-0.5B_Example/run_benchmark.sh sglang offline
228+
bash examples/08_Qwen2.5-0.5B_Example/run_benchmark.sh sglang online
229+
```
230+
231+
---
232+
233+
## Config files
117234

118-
- The online sweep defaults to `1 2 4 8 16 32 64 128 256 512 1024`.
119-
- Use `scripts/concurrency_sweep/run.py --concurrency ... --duration-ms ...` to shorten or customize the sweep.
120-
- If vLLM runs out of memory at higher concurrency, lower `--gpu-memory-utilization`.
235+
| File | Server | Mode |
236+
|---|---|---|
237+
| `offline_qwen_benchmark.yaml` | vLLM (`:8000`) | Offline |
238+
| `online_qwen_benchmark.yaml` | vLLM (`:8000`) | Online sweep |
239+
| `sglang_offline_qwen_benchmark.yaml` | SGLang (`:30000`) | Offline |
240+
| `sglang_online_qwen_benchmark.yaml` | SGLang (`:30000`) | Online sweep |
241+
| `prepare_dataset.py` || Converts `dummy_1k.pkl` to example dataset |
242+
| `run_benchmark.sh` | vLLM / SGLang | Automated wrapper |

examples/08_Qwen2.5-0.5B_Example/run_benchmark.sh

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -129,13 +129,13 @@ RETRY_COUNT=0
129129

130130
# Different ready indicators for vLLM vs SGLang
131131
if [[ "$SERVER_TYPE" == "vllm" ]]; then
132-
READY_PATTERN="Uvicorn running\|Application startup complete"
132+
READY_PATTERN="Uvicorn running|Application startup complete"
133133
else
134-
READY_PATTERN="Uvicorn running\|Server is ready"
134+
READY_PATTERN="Uvicorn running|Server is ready"
135135
fi
136136

137137
while [ $RETRY_COUNT -lt $MAX_RETRIES ]; do
138-
if docker logs ${CONTAINER_NAME} 2>&1 | grep -q "$READY_PATTERN"; then
138+
if docker logs ${CONTAINER_NAME} 2>&1 | grep -qE "$READY_PATTERN"; then
139139
echo "✅ Server is ready!"
140140
break
141141
fi

0 commit comments

Comments
 (0)