A production-grade benchmarking suite for measuring vLLM inference performance (latency, throughput, VRAM, stability) across model configurations, quantization schemes, and deployment environments.
vllm-bench provides:
- Config-driven experiments - Define workloads, model configs, and quantization in YAML
- Compare anything - FP16 vs AWQ vs GPTQ, different model versions, engine settings
- Local and cloud execution - Run on your GPU or on Modal cloud with a
--modalflag - Structured artifacts - Logs, traces, metrics, and summaries for downstream analysis
- Automatic report generation - Comparison tables and performance plots
# Using uv (recommended)
uv sync
# Or with pip
pip install -e ".[dev]"vllm-bench/
├── pyproject.toml # Project config and dependencies
├── README.md
├── Makefile # Helpful dev commands
├── .gitignore
├── .python-version
│
├── src/vllm_bench/ # Core tooling library
│ ├── cli.py # Command-line interface
│ ├── modal.py # Modal cloud integration
│ ├── hashing.py
│ │
│ ├── configs/ # Config schema and loading
│ │ ├── __init__.py
│ │ ├── schema.py
│ │ └── load.py
│ │
│ ├── workloads/ # Workload loaders
│ │ ├── __init__.py
│ │ ├── base.py
│ │ ├── jsonl_prompts.py
│ │ ├── synthetic_lengths.py
│ │ ├── rag_like_jsonl.py
│ │ └── registry.py
│ │
│ ├── runners/ # Experiment execution
│ │ ├── __init__.py
│ │ ├── single_run.py
│ │ └── matrix_runner.py
│ │
│ ├── vllm_client/ # vLLM server and client
│ │ ├── __init__.py
│ │ ├── server.py
│ │ └── client.py
│ │
│ ├── metrics/ # Metrics collection and aggregation
│ │ ├── __init__.py
│ │ ├── record.py
│ │ ├── aggregate.py
│ │ └── quality.py
│ │
│ ├── artifacts/ # Artifact I/O
│ │ ├── __init__.py
│ │ ├── writer.py
│ │ └── reader.py
│ │
│ └── report/ # Report generation
│ ├── __init__.py
│ ├── build.py
│ └── plots.py
│
├── experiments/ # Experiment configs and prompts
│ ├── prompts/ # JSONL prompt files
│ ├── configs/ # YAML experiment configs
│ └── sweeps/ # YAML sweep configs
│
└── tests/
└── unit/
Experiments are defined in YAML config files with sections for model, vLLM settings, quantization, workload, runner, and artifacts.
Quick example:
experiment_name: "qwen_single"
model:
model_id: "Qwen/Qwen2.5-0.5B-Instruct"
max_model_len: 500
vllm:
port: 8000
gpu_memory_utilization: 0.6
quant:
name: "fp16"
workload:
name: "synthetic_lengths"
args:
prompt_lengths: [32, 64]
max_new_tokens: 100
runner:
concurrency_schedule: [2, 4]
num_requests_per_level: 20
artifacts:
out_dir: "artifacts/runs/qwen_single"See Configuration Guide for complete parameter reference and examples. Sample configs are available in experiments/configs/.
# Run a single experiment locally
vllm-bench run experiments/configs/my_experiment.yaml
# Run on Modal cloud (A10G GPU by default)
vllm-bench run experiments/configs/my_experiment.yaml --modal
# Validate config without running
vllm-bench validate experiments/configs/my_experiment.yamlRun experiments on your local GPU:
# Single experiment
vllm-bench run experiments/configs/qwen_single.yaml
# Benchmark comparing multiple variants (FP16, AWQ, GPTQ, etc.)
vllm-bench run experiments/configs/qwen_comparison.yaml
# Enable verbose logging
vllm-bench run experiments/configs/qwen_single.yaml --verboseOutput artifacts:
artifacts/runs/{experiment_name}_{timestamp}/
└── {experiment_name}/
├── config.yaml
├── config_hash.txt
├── env_snapshot.json
├── log.out # Full execution logs
├── trace.jsonl # Per-request metrics
├── summary.json # Aggregated statistics
└── quality.json # Quality metrics (if computed)
Run experiments on Modal cloud GPUs with automatic container caching and persistent storage:
# Default GPU (A10G - 24GB)
vllm-bench run experiments/configs/qwen_comparison.yaml --modal
# Specific GPU type
GPU=A100 vllm-bench run experiments/configs/qwen_comparison.yaml --modal
# Available GPUs: T4, A10G, A100, V100, L40, etc.Modal benefits:
- Access to powerful GPUs without local hardware
- Automatic container image caching (first run builds, subsequent runs are instant)
- Persistent volume storage with per-experiment commits
- Automatic log capture and download
Auto-download: Results are automatically downloaded to results/from_modal/ after completion.
GPU selection:
- Default:
A10G(24GB) - Set via:
GPU=<type> - Common:
T4(16GB),A100(40GB/80GB),V100(16GB)
Compare different model configurations (FP16 vs AWQ vs GPTQ, different context windows, etc.) in a single run using the benchmark config format:
name: "qwen_comparison"
defaults:
model:
model_id: "Qwen/Qwen2.5-0.5B-Instruct"
max_model_len: 2048
# ... vllm, quant, workload, runner settings ...
variants:
- {name: "official_fp16"}
- name: "community_awq"
model: {model_id: "casperhansen/qwen2-0.5b-awq"}
quant: {name: "awq"}See Configuration Guide for complete benchmark examples with all variants.
Run it:
vllm-bench run experiments/configs/qwen_comparison.yamlOutput includes comparison tables and plots:
artifacts/benchmarks/qwen_comparison_{timestamp}/
├── qwen_comparison_official_fp16/
├── qwen_comparison_community_awq/
├── comparison.md
└── plots/
├── latency_vs_concurrency.png
├── throughput_vs_concurrency.png
└── vram_vs_concurrency.png
Validate configs before running:
# Check validity
vllm-bench validate experiments/configs/qwen_comparison.yaml
# Show full config dump
vllm-bench validate experiments/configs/qwen_comparison.yaml --verbose# List artifacts on Modal volume
modal volume ls vllm-bench-artifacts
# Manually download results
modal volume get vllm-bench-artifacts <run_name>/ local_results/See Modal volume documentation for more details.
# Run linter
make lint
# Run formatter
make format
# Run tests
make test- Config validation - YAML schemas validate model, vLLM, workload, and runner settings
- Server startup - Local vLLM server or Modal container with your configuration
- Workload execution - Closed-loop client sends requests at configured concurrency levels
- Metrics collection - Latency (TTFT, TPOT, E2E), throughput (tokens/s), VRAM usage
- Artifact writing - Config hash, environment snapshot, request traces, aggregated summaries
- Report generation - Comparison tables and plots for benchmark runs
MIT License - see LICENSE for details.