Skip to content

A reproducible benchmarking suite for vLLM inference. Measure latency, throughput, and VRAM across model configurations, quantization schemes, and deployment environments.

License

Notifications You must be signed in to change notification settings

rohitkt10/vllm-bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

vllm-bench

CI

A production-grade benchmarking suite for measuring vLLM inference performance (latency, throughput, VRAM, stability) across model configurations, quantization schemes, and deployment environments.

Overview

vllm-bench provides:

  • Config-driven experiments - Define workloads, model configs, and quantization in YAML
  • Compare anything - FP16 vs AWQ vs GPTQ, different model versions, engine settings
  • Local and cloud execution - Run on your GPU or on Modal cloud with a --modal flag
  • Structured artifacts - Logs, traces, metrics, and summaries for downstream analysis
  • Automatic report generation - Comparison tables and performance plots

Installation

# Using uv (recommended)
uv sync

# Or with pip
pip install -e ".[dev]"

Repository Structure

vllm-bench/
├── pyproject.toml          # Project config and dependencies
├── README.md
├── Makefile                # Helpful dev commands
├── .gitignore
├── .python-version
│
├── src/vllm_bench/         # Core tooling library
│   ├── cli.py              # Command-line interface
│   ├── modal.py            # Modal cloud integration
│   ├── hashing.py
│   │
│   ├── configs/            # Config schema and loading
│   │   ├── __init__.py
│   │   ├── schema.py
│   │   └── load.py
│   │
│   ├── workloads/          # Workload loaders
│   │   ├── __init__.py
│   │   ├── base.py
│   │   ├── jsonl_prompts.py
│   │   ├── synthetic_lengths.py
│   │   ├── rag_like_jsonl.py
│   │   └── registry.py
│   │
│   ├── runners/            # Experiment execution
│   │   ├── __init__.py
│   │   ├── single_run.py
│   │   └── matrix_runner.py
│   │
│   ├── vllm_client/        # vLLM server and client
│   │   ├── __init__.py
│   │   ├── server.py
│   │   └── client.py
│   │
│   ├── metrics/            # Metrics collection and aggregation
│   │   ├── __init__.py
│   │   ├── record.py
│   │   ├── aggregate.py
│   │   └── quality.py
│   │
│   ├── artifacts/          # Artifact I/O
│   │   ├── __init__.py
│   │   ├── writer.py
│   │   └── reader.py
│   │
│   └── report/             # Report generation
│       ├── __init__.py
│       ├── build.py
│       └── plots.py
│
├── experiments/            # Experiment configs and prompts
│   ├── prompts/            # JSONL prompt files
│   ├── configs/            # YAML experiment configs
│   └── sweeps/             # YAML sweep configs
│
└── tests/
    └── unit/

Configuration

Experiments are defined in YAML config files with sections for model, vLLM settings, quantization, workload, runner, and artifacts.

Quick example:

experiment_name: "qwen_single"
model:
  model_id: "Qwen/Qwen2.5-0.5B-Instruct"
  max_model_len: 500
vllm:
  port: 8000
  gpu_memory_utilization: 0.6
quant:
  name: "fp16"
workload:
  name: "synthetic_lengths"
  args:
    prompt_lengths: [32, 64]
    max_new_tokens: 100
runner:
  concurrency_schedule: [2, 4]
  num_requests_per_level: 20
artifacts:
  out_dir: "artifacts/runs/qwen_single"

See Configuration Guide for complete parameter reference and examples. Sample configs are available in experiments/configs/.

Usage

Quick Start

# Run a single experiment locally
vllm-bench run experiments/configs/my_experiment.yaml

# Run on Modal cloud (A10G GPU by default)
vllm-bench run experiments/configs/my_experiment.yaml --modal

# Validate config without running
vllm-bench validate experiments/configs/my_experiment.yaml

Local Execution

Run experiments on your local GPU:

# Single experiment
vllm-bench run experiments/configs/qwen_single.yaml

# Benchmark comparing multiple variants (FP16, AWQ, GPTQ, etc.)
vllm-bench run experiments/configs/qwen_comparison.yaml

# Enable verbose logging
vllm-bench run experiments/configs/qwen_single.yaml --verbose

Output artifacts:

artifacts/runs/{experiment_name}_{timestamp}/
  └── {experiment_name}/
      ├── config.yaml
      ├── config_hash.txt
      ├── env_snapshot.json
      ├── log.out              # Full execution logs
      ├── trace.jsonl          # Per-request metrics
      ├── summary.json         # Aggregated statistics
      └── quality.json         # Quality metrics (if computed)

Modal Cloud Execution

Run experiments on Modal cloud GPUs with automatic container caching and persistent storage:

# Default GPU (A10G - 24GB)
vllm-bench run experiments/configs/qwen_comparison.yaml --modal

# Specific GPU type
GPU=A100 vllm-bench run experiments/configs/qwen_comparison.yaml --modal

# Available GPUs: T4, A10G, A100, V100, L40, etc.

Modal benefits:

  • Access to powerful GPUs without local hardware
  • Automatic container image caching (first run builds, subsequent runs are instant)
  • Persistent volume storage with per-experiment commits
  • Automatic log capture and download

Auto-download: Results are automatically downloaded to results/from_modal/ after completion.

GPU selection:

  • Default: A10G (24GB)
  • Set via: GPU=<type>
  • Common: T4 (16GB), A100 (40GB/80GB), V100 (16GB)

Benchmarking Multiple Variants

Compare different model configurations (FP16 vs AWQ vs GPTQ, different context windows, etc.) in a single run using the benchmark config format:

name: "qwen_comparison"
defaults:
  model:
    model_id: "Qwen/Qwen2.5-0.5B-Instruct"
    max_model_len: 2048
  # ... vllm, quant, workload, runner settings ...
variants:
  - {name: "official_fp16"}
  - name: "community_awq"
    model: {model_id: "casperhansen/qwen2-0.5b-awq"}
    quant: {name: "awq"}

See Configuration Guide for complete benchmark examples with all variants.

Run it:

vllm-bench run experiments/configs/qwen_comparison.yaml

Output includes comparison tables and plots:

artifacts/benchmarks/qwen_comparison_{timestamp}/
  ├── qwen_comparison_official_fp16/
  ├── qwen_comparison_community_awq/
  ├── comparison.md
  └── plots/
      ├── latency_vs_concurrency.png
      ├── throughput_vs_concurrency.png
      └── vram_vs_concurrency.png

Config Validation

Validate configs before running:

# Check validity
vllm-bench validate experiments/configs/qwen_comparison.yaml

# Show full config dump
vllm-bench validate experiments/configs/qwen_comparison.yaml --verbose

Accessing Modal Results

# List artifacts on Modal volume
modal volume ls vllm-bench-artifacts

# Manually download results
modal volume get vllm-bench-artifacts <run_name>/ local_results/

See Modal volume documentation for more details.

Development

# Run linter
make lint

# Run formatter
make format

# Run tests
make test

How It Works

  1. Config validation - YAML schemas validate model, vLLM, workload, and runner settings
  2. Server startup - Local vLLM server or Modal container with your configuration
  3. Workload execution - Closed-loop client sends requests at configured concurrency levels
  4. Metrics collection - Latency (TTFT, TPOT, E2E), throughput (tokens/s), VRAM usage
  5. Artifact writing - Config hash, environment snapshot, request traces, aggregated summaries
  6. Report generation - Comparison tables and plots for benchmark runs

License

MIT License - see LICENSE for details.

About

A reproducible benchmarking suite for vLLM inference. Measure latency, throughput, and VRAM across model configurations, quantization schemes, and deployment environments.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published