New here? Start with the Getting Started Guide.
Stop guessing which model is faster. Measure it.
Point bench-my-llm at any OpenAI-compatible API and get latency, throughput, cost, and quality metrics in seconds. Compare models side by side. Get a beautiful terminal report. Ship with confidence.
- π₯ TTFT Measurement - Time to first token via streaming
- β‘ Tokens per Second - Real throughput numbers
- π p50 / p95 / p99 Latencies - Production-grade percentiles
- π° Cost Estimation - Know what you're spending
- π― Quality Scoring - Compare responses against reference answers
- π Model Comparison - Side-by-side with winner highlights
- π¦ Built-in Prompt Suites - Reasoning, coding, creative, factual
- π Any OpenAI-compatible API - OpenAI, Anthropic, Ollama, vLLM, Together, and more
- πΎ Export to JSON - Pipe into CI, dashboards, or your own tools
pip install bench-my-llmbench-my-llm run --model gpt-4o --suite reasoningββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ποΈ Benchmark Report β
β bench-my-llm results for gpt-4o β
β Suite: reasoning | Prompts: 5 | Cost: $0.0043 β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Latency Summary
ββββββββββ¬βββββββββββββ¬βββββββββββββββββββββ
β Metric β TTFT (ms) β Total Latency (ms) β
ββββββββββΌβββββββββββββΌβββββββββββββββββββββ€
β p50 β 234.1 β 1,523.4 β
β p95 β 312.7 β 2,187.9 β
β p99 β 348.2 β 2,401.3 β
β Mean β 251.3 β 1,687.2 β
ββββββββββ΄βββββββββββββ΄βββββββββββββββββββββ
Throughput & Quality
βββββββββββββββββββββ¬ββββββββββββββ
β Metric β Value β
βββββββββββββββββββββΌββββββββββββββ€
β Mean TPS β 67.3 tok/s β
β Median TPS β 64.8 tok/s β
β Quality Score β 82% β
β Estimated Cost β $0.0043 β
βββββββββββββββββββββ΄ββββββββββββββ
bench-my-llm compare gpt-4o gpt-4o-mini --suite reasoningββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β π Model Comparison β
β gpt-4o vs gpt-4o-mini β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Head-to-Head
ββββββββββββββββββββββββββ¬ββββββββββ¬ββββββββββββββ
β Metric β gpt-4o β gpt-4o-mini β
ββββββββββββββββββββββββββΌββββββββββΌββββββββββββββ€
β TTFT p50 (ms) β 234.1 β 142.3 π β
β TTFT p95 (ms) β 312.7 β 198.4 π β
β Total Latency p50 (ms) β 1523.4 β 876.2 π β
β Mean TPS β 67.3 π β 54.1 β
β Cost (USD) β $0.0043 β $0.0008 π β
β Quality Score β 0.82 π β 0.71 β
ββββββββββββββββββββββββββ΄ββββββββββ΄ββββββββββββββ
π Winner: gpt-4o-mini (4/6 metrics)
Pass your own prompts file (JSON array):
[
{"text": "Explain quantum computing", "category": "factual", "reference": "...", "max_tokens": 256}
]| Suite | Description | Prompts |
|---|---|---|
reasoning |
Logic, math, step-by-step | 5 |
coding |
Code generation and explanation | 5 |
creative |
Writing, storytelling, metaphors | 5 |
factual |
Knowledge recall, definitions | 5 |
all |
Everything combined | 20 |
bench-my-llm run --model gpt-4o --suite all --output results.json
bench-my-llm report results.jsonbench-my-llm run --model llama3 --base-url http://localhost:11434/v1 --api-key ollamaAdd to your GitHub Actions workflow:
- name: Benchmark LLM
run: |
pip install bench-my-llm
bench-my-llm run --model gpt-4o-mini --suite reasoning --output benchmark.json
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
- name: Upload results
uses: actions/upload-artifact@v4
with:
name: benchmark-results
path: benchmark.jsongit clone https://github.com/manasvardhan/bench-my-llm.git
cd bench-my-llm
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
pytestMIT. See LICENSE.