Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions examples/08_Qwen2.5-0.5B_Example/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Benchmark results
results/

# Generated data
data/*.pkl

# Logs
*.log
benchmark_output.log

# Python cache
__pycache__/
*.pyc
*.pyo
76 changes: 76 additions & 0 deletions examples/08_Qwen2.5-0.5B_Example/QUICKSTART.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# Quick Start — Qwen2.5-0.5B

All commands run from the **repository root**.

## Setup

```bash
python3.12 -m venv .venv && source .venv/bin/activate
pip install -e ".[test]"
python examples/08_Qwen2.5-0.5B_Example/prepare_dataset.py
```

## Option A — Automated (vLLM or SGLang)

```bash
bash examples/08_Qwen2.5-0.5B_Example/run_benchmark.sh vllm offline
bash examples/08_Qwen2.5-0.5B_Example/run_benchmark.sh vllm online
bash examples/08_Qwen2.5-0.5B_Example/run_benchmark.sh sglang online
```

## Option B — Manual step-by-step

**1. Start server** (pick one):

```bash
# vLLM
docker run --runtime nvidia --gpus all -v ~/.cache/huggingface:/root/.cache/huggingface \
-e PYTORCH_ALLOC_CONF=expandable_segments:True -p 8000:8000 --ipc=host \
--name vllm-qwen -d vllm/vllm-openai:latest \
--model Qwen/Qwen2.5-0.5B-Instruct --gpu-memory-utilization 0.85

# SGLang
docker run --runtime nvidia --gpus all --net host \
-v ~/.cache/huggingface:/root/.cache/huggingface --ipc=host \
--name sglang-qwen -d lmsysorg/sglang:latest \
python3 -m sglang.launch_server --model-path Qwen/Qwen2.5-0.5B-Instruct \
--host 0.0.0.0 --port 30000 --mem-fraction-static 0.9 --attention-backend flashinfer
```

**2. Wait for ready:**

```bash
until curl -sf http://localhost:8000/v1/models > /dev/null; do sleep 5; done # vLLM
until curl -sf http://localhost:30000/health > /dev/null; do sleep 5; done # SGLang
```

**3. Run concurrency sweep:**

```bash
python scripts/concurrency_sweep/run.py \
--config examples/08_Qwen2.5-0.5B_Example/online_qwen_benchmark.yaml # vLLM
# or: --config examples/08_Qwen2.5-0.5B_Example/sglang_online_qwen_benchmark.yaml

# Add --verbose to stream output live; add --concurrency / --duration-ms to customize
```

**4. Summarize and plot:**

```bash
python scripts/concurrency_sweep/summarize.py \
results/qwen_online_benchmark/concurrency_sweep/ # vLLM
# or: results/qwen_sglang_online_benchmark/concurrency_sweep/
```

Writes `metrics_summary.csv`, `metrics_summary.md`, and `metrics_summary.png`.

**5. Stop server:**

```bash
docker stop vllm-qwen && docker rm vllm-qwen
# or: docker stop sglang-qwen && docker rm sglang-qwen
```

---

For TRT-LLM setup, config customization, and output file locations, see [README.md](README.md).
242 changes: 242 additions & 0 deletions examples/08_Qwen2.5-0.5B_Example/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,242 @@
# Qwen2.5-0.5B-Instruct Benchmark Example

Benchmarks `Qwen/Qwen2.5-0.5B-Instruct` with offline (max-throughput) and online
(concurrency sweep) load patterns. Designed for small GPUs (8–16 GB VRAM).

Supported inference servers: **vLLM**, **SGLang**, **TRT-LLM**.

---

## Requirements

- Python 3.12+
- Docker with NVIDIA GPU support (`--runtime nvidia`)
- NVIDIA GPU with at least 8 GB VRAM

---

## Step 1 — Install and prepare dataset

Run all commands from the **repository root**.

```bash
python3.12 -m venv .venv
source .venv/bin/activate
pip install -e ".[test]"

python examples/08_Qwen2.5-0.5B_Example/prepare_dataset.py
```

This converts `tests/datasets/dummy_1k.pkl` into
`examples/08_Qwen2.5-0.5B_Example/data/test_dataset.pkl`.

---

## Step 2 — Start the inference server

Pick one backend. The server must be fully ready before running benchmarks.

### vLLM (port 8000)

```bash
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-e PYTORCH_ALLOC_CONF=expandable_segments:True \
-p 8000:8000 \
--ipc=host \
--name vllm-qwen \
-d \
vllm/vllm-openai:latest \
--model Qwen/Qwen2.5-0.5B-Instruct \
--gpu-memory-utilization 0.85
```

### SGLang (port 30000)

```bash
docker run --runtime nvidia --gpus all --net host \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--ipc=host \
--name sglang-qwen \
-d \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server \
--model-path Qwen/Qwen2.5-0.5B-Instruct \
--host 0.0.0.0 \
--port 30000 \
--mem-fraction-static 0.9 \
--attention-backend flashinfer
```

### TRT-LLM (port 8000)

```bash
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
--ipc=host \
--name trtllm-qwen \
-d \
nvcr.io/nvidia/tritonserver:latest \
# Add your TRT-LLM engine launch arguments here
```

> **Note:** No pre-built TRT-LLM config is provided. Use
> `examples/08_Qwen2.5-0.5B_Example/online_qwen_benchmark.yaml` as a template and
> point `endpoint_config.endpoints` at `http://localhost:8000`.

---

## Step 3 — Wait for the server to be ready

Poll until the health endpoint responds:

```bash
# vLLM / TRT-LLM (port 8000)
until curl -sf http://localhost:8000/v1/models > /dev/null; do
echo "Waiting for server..."; sleep 5
done
echo "Server ready."

# SGLang (port 30000)
until curl -sf http://localhost:30000/health > /dev/null; do
echo "Waiting for server..."; sleep 5
done
echo "Server ready."
```

---

## Step 4 — Run the concurrency sweep

Choose the config that matches your server. The sweep script overrides
`load_pattern` and `report_dir` for each concurrency level, leaving all other
settings (model, dataset, endpoint) from the config file.

```bash
# vLLM
python scripts/concurrency_sweep/run.py \
--config examples/08_Qwen2.5-0.5B_Example/online_qwen_benchmark.yaml

# SGLang
python scripts/concurrency_sweep/run.py \
--config examples/08_Qwen2.5-0.5B_Example/sglang_online_qwen_benchmark.yaml

# TRT-LLM (use the vLLM config or a custom one pointing at port 8000)
python scripts/concurrency_sweep/run.py \
--config examples/08_Qwen2.5-0.5B_Example/online_qwen_benchmark.yaml
```

**Common options:**

| Flag | Default | Description |
|---|---|---|
| `--concurrency N [N ...]` | `1 2 4 8 16 32 64 128 256 512 1024` | Concurrency levels to test |
| `--duration-ms MS` | `600000` (10 min) | Duration per run |
| `--output-dir DIR` | from `report_dir` in config | Root directory for sweep output |
| `--timeout-seconds S` | `720` (12 min) | Per-run subprocess timeout |
| `--verbose` | off | Stream output live to the terminal (useful for debugging) |

Example — quick 3-minute sweep at a few concurrency levels:

```bash
python scripts/concurrency_sweep/run.py \
--config examples/08_Qwen2.5-0.5B_Example/online_qwen_benchmark.yaml \
--concurrency 1 4 16 64 \
--duration-ms 180000 \
--verbose
```

Results land in subdirectories under the config's `report_dir`:

```
results/qwen_online_benchmark/concurrency_sweep/
concurrency_1/ benchmark.log result_summary.json
concurrency_4/ benchmark.log result_summary.json
...
summary.json summary.csv
```

If a run fails, check the per-run log:

```bash
cat results/qwen_online_benchmark/concurrency_sweep/concurrency_64/benchmark.log
```

---

## Step 5 — Summarize results and generate plots

```bash
# vLLM
python scripts/concurrency_sweep/summarize.py \
results/qwen_online_benchmark/concurrency_sweep/

# SGLang
python scripts/concurrency_sweep/summarize.py \
results/qwen_sglang_online_benchmark/concurrency_sweep/
```

This prints formatted tables to stdout and writes three files into the sweep
directory:

| File | Contents |
|---|---|
| `metrics_summary.csv` | All metrics in CSV form |
| `metrics_summary.md` | Markdown tables with throughput, latency, TTFT, TPOT |
| `metrics_summary.png` | Line plots of TPS, TTFT P99, and TPOT P50 vs concurrency |

Pass `--no-save` to print tables only without writing files.

---

## Step 6 — Stop the server

```bash
docker stop vllm-qwen # or sglang-qwen / trtllm-qwen
docker rm vllm-qwen
```

---

## Offline (max-throughput) benchmark

For a single offline run (no sweep):

```bash
# vLLM
inference-endpoint benchmark from-config \
-c examples/08_Qwen2.5-0.5B_Example/offline_qwen_benchmark.yaml

# SGLang
inference-endpoint benchmark from-config \
-c examples/08_Qwen2.5-0.5B_Example/sglang_offline_qwen_benchmark.yaml
```

Results: `results/qwen_offline_benchmark/` or `results/qwen_sglang_offline_benchmark/`.

---

## Automated wrapper

`run_benchmark.sh` automates Steps 2–4 (dataset prep, container start, benchmark):

```bash
bash examples/08_Qwen2.5-0.5B_Example/run_benchmark.sh vllm offline
bash examples/08_Qwen2.5-0.5B_Example/run_benchmark.sh vllm online
bash examples/08_Qwen2.5-0.5B_Example/run_benchmark.sh sglang offline
bash examples/08_Qwen2.5-0.5B_Example/run_benchmark.sh sglang online
```

---

## Config files

| File | Server | Mode |
|---|---|---|
| `offline_qwen_benchmark.yaml` | vLLM (`:8000`) | Offline |
| `online_qwen_benchmark.yaml` | vLLM (`:8000`) | Online sweep |
| `sglang_offline_qwen_benchmark.yaml` | SGLang (`:30000`) | Offline |
| `sglang_online_qwen_benchmark.yaml` | SGLang (`:30000`) | Online sweep |
| `prepare_dataset.py` | — | Converts `dummy_1k.pkl` to example dataset |
| `run_benchmark.sh` | vLLM / SGLang | Automated wrapper |
43 changes: 43 additions & 0 deletions examples/08_Qwen2.5-0.5B_Example/offline_qwen_benchmark.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
name: "qwen-0.5b-offline-benchmark"
version: "1.0"
type: "offline"

model_params:
name: "Qwen/Qwen2.5-0.5B-Instruct"
temperature: 1.0
max_new_tokens: 100
top_p: 1.0
streaming: "on"

datasets:
- name: "qwen-perf-test"
type: "performance"
path: "examples/08_Qwen2.5-0.5B_Example/data/test_dataset.pkl"
samples: 1000

settings:
runtime:
min_duration_ms: 100
max_duration_ms: 60000
scheduler_random_seed: 42
dataloader_random_seed: 42

client:
workers: 1
max_connections: 100
warmup_connections: 0
record_worker_events: false

metrics:
collect:
- "throughput"
- "latency"
- "ttft"
- "tpot"

endpoint_config:
endpoints:
- "http://localhost:8000"
api_key: null

report_dir: "results/qwen_offline_benchmark/"
Loading
Loading