Skip to content

🏎️ Dead-simple LLM benchmarking CLI - latency, cost, and quality metrics

License

Notifications You must be signed in to change notification settings

ManasVardhan/bench-my-llm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

13 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🏎️ bench-my-llm

New here? Start with the Getting Started Guide.

PyPI version Python 3.10+ License: MIT CI

Stop guessing which model is faster. Measure it.

Point bench-my-llm at any OpenAI-compatible API and get latency, throughput, cost, and quality metrics in seconds. Compare models side by side. Get a beautiful terminal report. Ship with confidence.

✨ Features

  • πŸ”₯ TTFT Measurement - Time to first token via streaming
  • ⚑ Tokens per Second - Real throughput numbers
  • πŸ“Š p50 / p95 / p99 Latencies - Production-grade percentiles
  • πŸ’° Cost Estimation - Know what you're spending
  • 🎯 Quality Scoring - Compare responses against reference answers
  • 🏁 Model Comparison - Side-by-side with winner highlights
  • πŸ“¦ Built-in Prompt Suites - Reasoning, coding, creative, factual
  • πŸ”Œ Any OpenAI-compatible API - OpenAI, Anthropic, Ollama, vLLM, Together, and more
  • πŸ’Ύ Export to JSON - Pipe into CI, dashboards, or your own tools

πŸš€ Quick Start

pip install bench-my-llm

Single Model Benchmark

bench-my-llm run --model gpt-4o --suite reasoning
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  🏎️  Benchmark Report                                    β”‚
β”‚  bench-my-llm results for gpt-4o                         β”‚
β”‚  Suite: reasoning | Prompts: 5 | Cost: $0.0043           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

          Latency Summary
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Metric β”‚ TTFT (ms)  β”‚ Total Latency (ms) β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ p50    β”‚ 234.1      β”‚ 1,523.4            β”‚
β”‚ p95    β”‚ 312.7      β”‚ 2,187.9            β”‚
β”‚ p99    β”‚ 348.2      β”‚ 2,401.3            β”‚
β”‚ Mean   β”‚ 251.3      β”‚ 1,687.2            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

       Throughput & Quality
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Metric            β”‚ Value       β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Mean TPS          β”‚ 67.3 tok/s  β”‚
β”‚ Median TPS        β”‚ 64.8 tok/s  β”‚
β”‚ Quality Score     β”‚ 82%         β”‚
β”‚ Estimated Cost    β”‚ $0.0043     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Model Comparison

bench-my-llm compare gpt-4o gpt-4o-mini --suite reasoning
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  🏁 Model Comparison                                     β”‚
β”‚  gpt-4o vs gpt-4o-mini                                   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

              Head-to-Head
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Metric                 β”‚ gpt-4o  β”‚ gpt-4o-mini β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ TTFT p50 (ms)          β”‚ 234.1   β”‚ 142.3  πŸ†   β”‚
β”‚ TTFT p95 (ms)          β”‚ 312.7   β”‚ 198.4  πŸ†   β”‚
β”‚ Total Latency p50 (ms) β”‚ 1523.4  β”‚ 876.2  πŸ†   β”‚
β”‚ Mean TPS               β”‚ 67.3 πŸ† β”‚ 54.1        β”‚
β”‚ Cost (USD)             β”‚ $0.0043 β”‚ $0.0008 πŸ†  β”‚
β”‚ Quality Score          β”‚ 0.82 πŸ† β”‚ 0.71        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ† Winner: gpt-4o-mini (4/6 metrics)

πŸ“– Usage

Custom Prompts

Pass your own prompts file (JSON array):

[
  {"text": "Explain quantum computing", "category": "factual", "reference": "...", "max_tokens": 256}
]

Prompt Suites

Suite Description Prompts
reasoning Logic, math, step-by-step 5
coding Code generation and explanation 5
creative Writing, storytelling, metaphors 5
factual Knowledge recall, definitions 5
all Everything combined 20

Export Results

bench-my-llm run --model gpt-4o --suite all --output results.json
bench-my-llm report results.json

Local Models (Ollama)

bench-my-llm run --model llama3 --base-url http://localhost:11434/v1 --api-key ollama

CI Integration

Add to your GitHub Actions workflow:

- name: Benchmark LLM
  run: |
    pip install bench-my-llm
    bench-my-llm run --model gpt-4o-mini --suite reasoning --output benchmark.json
  env:
    OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

- name: Upload results
  uses: actions/upload-artifact@v4
  with:
    name: benchmark-results
    path: benchmark.json

πŸ› οΈ Development

git clone https://github.com/manasvardhan/bench-my-llm.git
cd bench-my-llm
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
pytest

πŸ“„ License

MIT. See LICENSE.

About

🏎️ Dead-simple LLM benchmarking CLI - latency, cost, and quality metrics

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages