vllm-bench

A production-grade benchmarking suite for measuring vLLM inference performance (latency, throughput, VRAM, stability) across model configurations, quantization schemes, and deployment environments.

Overview

vllm-bench provides:

Config-driven experiments - Define workloads, model configs, and quantization in YAML
Compare anything - FP16 vs AWQ vs GPTQ, different model versions, engine settings
Local and cloud execution - Run on your GPU or on Modal cloud with a --modal flag
Structured artifacts - Logs, traces, metrics, and summaries for downstream analysis
Automatic report generation - Comparison tables and performance plots

Installation

# Using uv (recommended)
uv sync

# Or with pip
pip install -e ".[dev]"

Repository Structure

vllm-bench/
├── pyproject.toml          # Project config and dependencies
├── README.md
├── Makefile                # Helpful dev commands
├── .gitignore
├── .python-version
│
├── src/vllm_bench/         # Core tooling library
│   ├── cli.py              # Command-line interface
│   ├── modal.py            # Modal cloud integration
│   ├── hashing.py
│   │
│   ├── configs/            # Config schema and loading
│   │   ├── __init__.py
│   │   ├── schema.py
│   │   └── load.py
│   │
│   ├── workloads/          # Workload loaders
│   │   ├── __init__.py
│   │   ├── base.py
│   │   ├── jsonl_prompts.py
│   │   ├── synthetic_lengths.py
│   │   ├── rag_like_jsonl.py
│   │   └── registry.py
│   │
│   ├── runners/            # Experiment execution
│   │   ├── __init__.py
│   │   ├── single_run.py
│   │   └── matrix_runner.py
│   │
│   ├── vllm_client/        # vLLM server and client
│   │   ├── __init__.py
│   │   ├── server.py
│   │   └── client.py
│   │
│   ├── metrics/            # Metrics collection and aggregation
│   │   ├── __init__.py
│   │   ├── record.py
│   │   ├── aggregate.py
│   │   └── quality.py
│   │
│   ├── artifacts/          # Artifact I/O
│   │   ├── __init__.py
│   │   ├── writer.py
│   │   └── reader.py
│   │
│   └── report/             # Report generation
│       ├── __init__.py
│       ├── build.py
│       └── plots.py
│
├── experiments/            # Experiment configs and prompts
│   ├── prompts/            # JSONL prompt files
│   ├── configs/            # YAML experiment configs
│   └── sweeps/             # YAML sweep configs
│
└── tests/
    └── unit/

Configuration

Experiments are defined in YAML config files with sections for model, vLLM settings, quantization, workload, runner, and artifacts.

Quick example:

experiment_name: "qwen_single"
model:
  model_id: "Qwen/Qwen2.5-0.5B-Instruct"
  max_model_len: 500
vllm:
  port: 8000
  gpu_memory_utilization: 0.6
quant:
  name: "fp16"
workload:
  name: "synthetic_lengths"
  args:
    prompt_lengths: [32, 64]
    max_new_tokens: 100
runner:
  concurrency_schedule: [2, 4]
  num_requests_per_level: 20
artifacts:
  out_dir: "artifacts/runs/qwen_single"

See Configuration Guide for complete parameter reference and examples. Sample configs are available in experiments/configs/.

Usage

Quick Start

# Run a single experiment locally
vllm-bench run experiments/configs/my_experiment.yaml

# Run on Modal cloud (A10G GPU by default)
vllm-bench run experiments/configs/my_experiment.yaml --modal

# Validate config without running
vllm-bench validate experiments/configs/my_experiment.yaml

Local Execution

Run experiments on your local GPU:

# Single experiment
vllm-bench run experiments/configs/qwen_single.yaml

# Benchmark comparing multiple variants (FP16, AWQ, GPTQ, etc.)
vllm-bench run experiments/configs/qwen_comparison.yaml

# Enable verbose logging
vllm-bench run experiments/configs/qwen_single.yaml --verbose

Output artifacts:

artifacts/runs/{experiment_name}_{timestamp}/
  └── {experiment_name}/
      ├── config.yaml
      ├── config_hash.txt
      ├── env_snapshot.json
      ├── log.out              # Full execution logs
      ├── trace.jsonl          # Per-request metrics
      ├── summary.json         # Aggregated statistics
      └── quality.json         # Quality metrics (if computed)

Modal Cloud Execution

Run experiments on Modal cloud GPUs with automatic container caching and persistent storage:

# Default GPU (A10G - 24GB)
vllm-bench run experiments/configs/qwen_comparison.yaml --modal

# Specific GPU type
GPU=A100 vllm-bench run experiments/configs/qwen_comparison.yaml --modal

# Available GPUs: T4, A10G, A100, V100, L40, etc.

Modal benefits:

Access to powerful GPUs without local hardware
Automatic container image caching (first run builds, subsequent runs are instant)
Persistent volume storage with per-experiment commits
Automatic log capture and download

Auto-download: Results are automatically downloaded to results/from_modal/ after completion.

GPU selection:

Default: A10G (24GB)
Set via: GPU=<type>
Common: T4 (16GB), A100 (40GB/80GB), V100 (16GB)

Benchmarking Multiple Variants

Compare different model configurations (FP16 vs AWQ vs GPTQ, different context windows, etc.) in a single run using the benchmark config format:

name: "qwen_comparison"
defaults:
  model:
    model_id: "Qwen/Qwen2.5-0.5B-Instruct"
    max_model_len: 2048
  # ... vllm, quant, workload, runner settings ...
variants:
  - {name: "official_fp16"}
  - name: "community_awq"
    model: {model_id: "casperhansen/qwen2-0.5b-awq"}
    quant: {name: "awq"}

See Configuration Guide for complete benchmark examples with all variants.

Run it:

vllm-bench run experiments/configs/qwen_comparison.yaml

Output includes comparison tables and plots:

artifacts/benchmarks/qwen_comparison_{timestamp}/
  ├── qwen_comparison_official_fp16/
  ├── qwen_comparison_community_awq/
  ├── comparison.md
  └── plots/
      ├── latency_vs_concurrency.png
      ├── throughput_vs_concurrency.png
      └── vram_vs_concurrency.png

Config Validation

Validate configs before running:

# Check validity
vllm-bench validate experiments/configs/qwen_comparison.yaml

# Show full config dump
vllm-bench validate experiments/configs/qwen_comparison.yaml --verbose

Accessing Modal Results

# List artifacts on Modal volume
modal volume ls vllm-bench-artifacts

# Manually download results
modal volume get vllm-bench-artifacts <run_name>/ local_results/

See Modal volume documentation for more details.

Development

# Run linter
make lint

# Run formatter
make format

# Run tests
make test

How It Works

Config validation - YAML schemas validate model, vLLM, workload, and runner settings
Server startup - Local vLLM server or Modal container with your configuration
Workload execution - Closed-loop client sends requests at configured concurrency levels
Metrics collection - Latency (TTFT, TPOT, E2E), throughput (tokens/s), VRAM usage
Artifact writing - Config hash, environment snapshot, request traces, aggregated summaries
Report generation - Comparison tables and plots for benchmark runs

License

MIT License - see LICENSE for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

vllm-bench

Overview

Installation

Repository Structure

Configuration

Usage

Quick Start

Local Execution

Modal Cloud Execution

Benchmarking Multiple Variants

Config Validation

Accessing Modal Results

Development

How It Works

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.github/workflows		.github/workflows
docs		docs
experiments		experiments
src/vllm_bench		src/vllm_bench
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

License

rohitkt10/vllm-bench

Folders and files

Latest commit

History

Repository files navigation

vllm-bench

Overview

Installation

Repository Structure

Configuration

Usage

Quick Start

Local Execution

Modal Cloud Execution

Benchmarking Multiple Variants

Config Validation

Accessing Modal Results

Development

How It Works

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages