mlcommons
diff --git a/‎examples/08_Qwen2.5-0.5B_Example/QUICKSTART.md‎
Lines changed: 61 additions & 15 deletions b/‎examples/08_Qwen2.5-0.5B_Example/QUICKSTART.md‎
Lines changed: 61 additions & 15 deletions
diff --git a/‎examples/08_Qwen2.5-0.5B_Example/README.md‎
Lines changed: 174 additions & 52 deletions b/‎examples/08_Qwen2.5-0.5B_Example/README.md‎
Lines changed: 174 additions & 52 deletions
diff --git a/‎examples/08_Qwen2.5-0.5B_Example/run_benchmark.sh‎
Lines changed: 3 additions & 3 deletions b/‎examples/08_Qwen2.5-0.5B_Example/run_benchmark.sh‎
Lines changed: 3 additions & 3 deletions
@@ -1,30 +1,76 @@
-# Quick Start - Qwen2.5-0.5B
+# Quick Start — Qwen2.5-0.5B
 
-Use the wrapper script from the repo root:
+All commands run from the **repository root**.
+
+## Setup
 
 ```bash
-# vLLM offline
-bash examples/08_Qwen2.5-0.5B_Example/run_benchmark.sh vllm offline
+python3.12 -m venv .venv && source .venv/bin/activate
+pip install -e ".[test]"
+python examples/08_Qwen2.5-0.5B_Example/prepare_dataset.py
+```
 
-# SGLang offline
-bash examples/08_Qwen2.5-0.5B_Example/run_benchmark.sh sglang offline
+## Option A — Automated (vLLM or SGLang)
 
-# Online concurrency sweep
+```bash
+bash examples/08_Qwen2.5-0.5B_Example/run_benchmark.sh vllm offline
 bash examples/08_Qwen2.5-0.5B_Example/run_benchmark.sh vllm online
+bash examples/08_Qwen2.5-0.5B_Example/run_benchmark.sh sglang online
+```
+
+## Option B — Manual step-by-step
+
+**1. Start server** (pick one):
+
+```bash
+# vLLM
+docker run --runtime nvidia --gpus all -v ~/.cache/huggingface:/root/.cache/huggingface \
+  -e PYTORCH_ALLOC_CONF=expandable_segments:True -p 8000:8000 --ipc=host \
+  --name vllm-qwen -d vllm/vllm-openai:latest \
+  --model Qwen/Qwen2.5-0.5B-Instruct --gpu-memory-utilization 0.85
+
+# SGLang
+docker run --runtime nvidia --gpus all --net host \
+  -v ~/.cache/huggingface:/root/.cache/huggingface --ipc=host \
+  --name sglang-qwen -d lmsysorg/sglang:latest \
+  python3 -m sglang.launch_server --model-path Qwen/Qwen2.5-0.5B-Instruct \
+  --host 0.0.0.0 --port 30000 --mem-fraction-static 0.9 --attention-backend flashinfer
+```
+
+**2. Wait for ready:**
+
+```bash
+until curl -sf http://localhost:8000/v1/models > /dev/null; do sleep 5; done   # vLLM
+until curl -sf http://localhost:30000/health   > /dev/null; do sleep 5; done   # SGLang
 ```
 
-Outputs:
+**3. Run concurrency sweep:**
 
-- vLLM offline: `results/qwen_offline_benchmark/`
-- vLLM online: `results/qwen_online_benchmark/concurrency_sweep/`
-- SGLang offline: `results/qwen_sglang_offline_benchmark/`
-- SGLang online: `results/qwen_sglang_online_benchmark/concurrency_sweep/`
+```bash
+python scripts/concurrency_sweep/run.py \
+  --config examples/08_Qwen2.5-0.5B_Example/online_qwen_benchmark.yaml   # vLLM
+  # or: --config examples/08_Qwen2.5-0.5B_Example/sglang_online_qwen_benchmark.yaml
+
+# Add --verbose to stream output live; add --concurrency / --duration-ms to customize
+```
 
-Summarize an online sweep:
+**4. Summarize and plot:**
 
 ```bash
 python scripts/concurrency_sweep/summarize.py \
-  results/qwen_online_benchmark/concurrency_sweep/
+  results/qwen_online_benchmark/concurrency_sweep/         # vLLM
+  # or: results/qwen_sglang_online_benchmark/concurrency_sweep/
 ```
 
-For manual setup, server commands, and config details, see [README.md](README.md).
+Writes `metrics_summary.csv`, `metrics_summary.md`, and `metrics_summary.png`.
+
+**5. Stop server:**
+
+```bash
+docker stop vllm-qwen && docker rm vllm-qwen
+# or: docker stop sglang-qwen && docker rm sglang-qwen
+```
+
+---
+
+For TRT-LLM setup, config customization, and output file locations, see [README.md](README.md).
@@ -1,36 +1,23 @@
 # Qwen2.5-0.5B-Instruct Benchmark Example
 
-This example benchmarks `Qwen/Qwen2.5-0.5B-Instruct` against either a vLLM or
-SGLang server. It is intended as a small-GPU example that works on typical
-8-16 GB cards.
+Benchmarks `Qwen/Qwen2.5-0.5B-Instruct` with offline (max-throughput) and online
+(concurrency sweep) load patterns. Designed for small GPUs (8–16 GB VRAM).
+
+Supported inference servers: **vLLM**, **SGLang**, **TRT-LLM**.
+
+---
 
 ## Requirements
 
 - Python 3.12+
-- Docker with NVIDIA GPU support
+- Docker with NVIDIA GPU support (`--runtime nvidia`)
 - NVIDIA GPU with at least 8 GB VRAM
 
-## Fastest Path
-
-From the repo root:
-
-```bash
-# Offline benchmark with vLLM
-bash examples/08_Qwen2.5-0.5B_Example/run_benchmark.sh vllm offline
-
-# Offline benchmark with SGLang
-bash examples/08_Qwen2.5-0.5B_Example/run_benchmark.sh sglang offline
-
-# Online concurrency sweep
-bash examples/08_Qwen2.5-0.5B_Example/run_benchmark.sh vllm online
-```
-
-The script prepares the dataset, starts or reuses a container, waits for the
-server, and runs the benchmark.
+---
 
-## Manual Flow
+## Step 1 — Install and prepare dataset
 
-If you do not want to use `run_benchmark.sh`, the minimum manual flow is:
+Run all commands from the **repository root**.
 
 ```bash
 python3.12 -m venv .venv
@@ -40,10 +27,18 @@ pip install -e ".[test]"
 python examples/08_Qwen2.5-0.5B_Example/prepare_dataset.py
 ```
 
-Start one server:
+This converts `tests/datasets/dummy_1k.pkl` into
+`examples/08_Qwen2.5-0.5B_Example/data/test_dataset.pkl`.
+
+---
+
+## Step 2 — Start the inference server
+
+Pick one backend. The server must be fully ready before running benchmarks.
+
+### vLLM (port 8000)
 
 ```bash
-# vLLM
 docker run --runtime nvidia --gpus all \
   -v ~/.cache/huggingface:/root/.cache/huggingface \
   -e PYTORCH_ALLOC_CONF=expandable_segments:True \
@@ -54,8 +49,11 @@ docker run --runtime nvidia --gpus all \
   vllm/vllm-openai:latest \
   --model Qwen/Qwen2.5-0.5B-Instruct \
   --gpu-memory-utilization 0.85
+```
 
-# SGLang
+### SGLang (port 30000)
+
+```bash
 docker run --runtime nvidia --gpus all --net host \
   -v ~/.cache/huggingface:/root/.cache/huggingface \
   --ipc=host \
@@ -70,51 +68,175 @@ docker run --runtime nvidia --gpus all --net host \
   --attention-backend flashinfer
 ```
 
-Run one benchmark:
+### TRT-LLM (port 8000)
 
 ```bash
-# vLLM offline
-inference-endpoint benchmark from-config \
-  -c examples/08_Qwen2.5-0.5B_Example/offline_qwen_benchmark.yaml
+docker run --runtime nvidia --gpus all \
+  -v ~/.cache/huggingface:/root/.cache/huggingface \
+  -p 8000:8000 \
+  --ipc=host \
+  --name trtllm-qwen \
+  -d \
+  nvcr.io/nvidia/tritonserver:latest \
+  # Add your TRT-LLM engine launch arguments here
+```
 
-# SGLang offline
-inference-endpoint benchmark from-config \
-  -c examples/08_Qwen2.5-0.5B_Example/sglang_offline_qwen_benchmark.yaml
+> **Note:** No pre-built TRT-LLM config is provided. Use
+> `examples/08_Qwen2.5-0.5B_Example/online_qwen_benchmark.yaml` as a template and
+> point `endpoint_config.endpoints` at `http://localhost:8000`.
+
+---
+
+## Step 3 — Wait for the server to be ready
 
-# vLLM online sweep
+Poll until the health endpoint responds:
+
+```bash
+# vLLM / TRT-LLM (port 8000)
+until curl -sf http://localhost:8000/v1/models > /dev/null; do
+  echo "Waiting for server..."; sleep 5
+done
+echo "Server ready."
+
+# SGLang (port 30000)
+until curl -sf http://localhost:30000/health > /dev/null; do
+  echo "Waiting for server..."; sleep 5
+done
+echo "Server ready."
+```
+
+---
+
+## Step 4 — Run the concurrency sweep
+
+Choose the config that matches your server. The sweep script overrides
+`load_pattern` and `report_dir` for each concurrency level, leaving all other
+settings (model, dataset, endpoint) from the config file.
+
+```bash
+# vLLM
 python scripts/concurrency_sweep/run.py \
   --config examples/08_Qwen2.5-0.5B_Example/online_qwen_benchmark.yaml
 
-# SGLang online sweep
+# SGLang
 python scripts/concurrency_sweep/run.py \
   --config examples/08_Qwen2.5-0.5B_Example/sglang_online_qwen_benchmark.yaml
+
+# TRT-LLM (use the vLLM config or a custom one pointing at port 8000)
+python scripts/concurrency_sweep/run.py \
+  --config examples/08_Qwen2.5-0.5B_Example/online_qwen_benchmark.yaml
+```
+
+**Common options:**
+
+| Flag | Default | Description |
+|---|---|---|
+| `--concurrency N [N ...]` | `1 2 4 8 16 32 64 128 256 512 1024` | Concurrency levels to test |
+| `--duration-ms MS` | `600000` (10 min) | Duration per run |
+| `--output-dir DIR` | from `report_dir` in config | Root directory for sweep output |
+| `--timeout-seconds S` | `720` (12 min) | Per-run subprocess timeout |
+| `--verbose` | off | Stream output live to the terminal (useful for debugging) |
+
+Example — quick 3-minute sweep at a few concurrency levels:
+
+```bash
+python scripts/concurrency_sweep/run.py \
+  --config examples/08_Qwen2.5-0.5B_Example/online_qwen_benchmark.yaml \
+  --concurrency 1 4 16 64 \
+  --duration-ms 180000 \
+  --verbose
 ```
 
-## Files
+Results land in subdirectories under the config's `report_dir`:
 
-- `offline_qwen_benchmark.yaml`: vLLM offline benchmark
-- `online_qwen_benchmark.yaml`: vLLM online concurrency sweep
-- `sglang_offline_qwen_benchmark.yaml`: SGLang offline benchmark
-- `sglang_online_qwen_benchmark.yaml`: SGLang online concurrency sweep
-- `prepare_dataset.py`: converts `tests/datasets/dummy_1k.pkl` into the example dataset
-- `run_benchmark.sh`: wrapper that automates dataset prep, container startup, and benchmark execution
+```
+results/qwen_online_benchmark/concurrency_sweep/
+  concurrency_1/   benchmark.log   result_summary.json
+  concurrency_4/   benchmark.log   result_summary.json
+  ...
+  summary.json     summary.csv
+```
+
+If a run fails, check the per-run log:
 
-## Results
+```bash
+cat results/qwen_online_benchmark/concurrency_sweep/concurrency_64/benchmark.log
+```
 
-- vLLM offline: `results/qwen_offline_benchmark/`
-- vLLM online: `results/qwen_online_benchmark/concurrency_sweep/`
-- SGLang offline: `results/qwen_sglang_offline_benchmark/`
-- SGLang online: `results/qwen_sglang_online_benchmark/concurrency_sweep/`
+---
 
-To summarize an online sweep:
+## Step 5 — Summarize results and generate plots
 
 ```bash
+# vLLM
 python scripts/concurrency_sweep/summarize.py \
   results/qwen_online_benchmark/concurrency_sweep/
+
+# SGLang
+python scripts/concurrency_sweep/summarize.py \
+  results/qwen_sglang_online_benchmark/concurrency_sweep/
+```
+
+This prints formatted tables to stdout and writes three files into the sweep
+directory:
+
+| File | Contents |
+|---|---|
+| `metrics_summary.csv` | All metrics in CSV form |
+| `metrics_summary.md` | Markdown tables with throughput, latency, TTFT, TPOT |
+| `metrics_summary.png` | Line plots of TPS, TTFT P99, and TPOT P50 vs concurrency |
+
+Pass `--no-save` to print tables only without writing files.
+
+---
+
+## Step 6 — Stop the server
+
+```bash
+docker stop vllm-qwen    # or sglang-qwen / trtllm-qwen
+docker rm   vllm-qwen
 ```
 
-## Notes
+---
+
+## Offline (max-throughput) benchmark
+
+For a single offline run (no sweep):
+
+```bash
+# vLLM
+inference-endpoint benchmark from-config \
+  -c examples/08_Qwen2.5-0.5B_Example/offline_qwen_benchmark.yaml
+
+# SGLang
+inference-endpoint benchmark from-config \
+  -c examples/08_Qwen2.5-0.5B_Example/sglang_offline_qwen_benchmark.yaml
+```
+
+Results: `results/qwen_offline_benchmark/` or `results/qwen_sglang_offline_benchmark/`.
+
+---
+
+## Automated wrapper
+
+`run_benchmark.sh` automates Steps 2–4 (dataset prep, container start, benchmark):
+
+```bash
+bash examples/08_Qwen2.5-0.5B_Example/run_benchmark.sh vllm offline
+bash examples/08_Qwen2.5-0.5B_Example/run_benchmark.sh vllm online
+bash examples/08_Qwen2.5-0.5B_Example/run_benchmark.sh sglang offline
+bash examples/08_Qwen2.5-0.5B_Example/run_benchmark.sh sglang online
+```
+
+---
+
+## Config files
 
-- The online sweep defaults to `1 2 4 8 16 32 64 128 256 512 1024`.
-- Use `scripts/concurrency_sweep/run.py --concurrency ... --duration-ms ...` to shorten or customize the sweep.
-- If vLLM runs out of memory at higher concurrency, lower `--gpu-memory-utilization`.
+| File | Server | Mode |
+|---|---|---|
+| `offline_qwen_benchmark.yaml` | vLLM (`:8000`) | Offline |
+| `online_qwen_benchmark.yaml` | vLLM (`:8000`) | Online sweep |
+| `sglang_offline_qwen_benchmark.yaml` | SGLang (`:30000`) | Offline |
+| `sglang_online_qwen_benchmark.yaml` | SGLang (`:30000`) | Online sweep |
+| `prepare_dataset.py` | — | Converts `dummy_1k.pkl` to example dataset |
+| `run_benchmark.sh` | vLLM / SGLang | Automated wrapper |
@@ -129,13 +129,13 @@ RETRY_COUNT=0
 
 # Different ready indicators for vLLM vs SGLang
 if [[ "$SERVER_TYPE" == "vllm" ]]; then
-    READY_PATTERN="Uvicorn running\|Application startup complete"
+    READY_PATTERN="Uvicorn running|Application startup complete"
 else
-    READY_PATTERN="Uvicorn running\|Server is ready"
+    READY_PATTERN="Uvicorn running|Server is ready"
 fi
 
 while [ $RETRY_COUNT -lt $MAX_RETRIES ]; do
-    if docker logs ${CONTAINER_NAME} 2>&1 | grep -q "$READY_PATTERN"; then
+    if docker logs ${CONTAINER_NAME} 2>&1 | grep -qE "$READY_PATTERN"; then
         echo "✅ Server is ready!"
         break
     fi