Repowise is the codebase intelligence layer for AI coding agents. It indexes repositories into five intelligence layers — dependency graphs, git analytics, auto-generated docs, architectural decisions, and code health scores — and exposes them through nine MCP tools. The result: fewer tool calls, fewer file reads, lower LLM costs, and health scores that predict real-world defects.
This repo proves those claims with reproducible benchmarks on public codebases.
| Benchmark | Status | Headline | Report |
|---|---|---|---|
| SWE-QA | Complete | -36-70% tool calls, -29-36% cost, quality at parity | flask48 · sklearn48 |
| health-defect | Complete | 10-75x defect ratio, ROC AUC 0.70-0.74 | README · full report |
A paired benchmark comparing two coding-agent configurations on
SWE-QA tasks drawn from
pallets/flask and
scikit-learn/scikit-learn.
What is compared:
| Configuration | Tools available to the agent |
|---|---|
| C0_bare | Read, Grep, Glob, Bash, Agent (built-in coding-agent toolkit) |
| C2_full | All of the above plus four MCP tools (get_answer, get_symbol, get_context, search_codebase) backed by a precomputed documentation index of the repository |
Both configurations use the same model (claude-sonnet-4-6), the same SWE-QA
prompt scaffolding, the same per-task budget cap, and the same LLM judge. The
only variable is the tool surface presented to the agent.
| Metric | C0 (baseline) | C2 (doc-augmented) | Δ |
|---|---|---|---|
| Cost / task (mean) | $0.1396 | $0.0890 | -36.2 % |
| Wall / task (mean) | 41.7 s | 33.9 s | -18.6 % |
| Tool calls (mean) | 7.4 | 3.8 | -49.2 % |
| Files read (mean) | 1.9 | 0.2 | -89.0 % |
| Score (0-10, mean) | 8.82 | 8.81 | tied |
32 / 48 (67 %) tasks are cheaper under C2; quality is at parity.
Full report: BENCHMARK_REPORT_FLASK48.md
| Metric | C0 (baseline) | C2 (doc-augmented) | Δ |
|---|---|---|---|
| Cost / task (mean) | $0.1180 | $0.0834 | -29.3 % |
| Wall / task (mean) | 39.7 s | 28.6 s | -27.9 % |
| Tool calls (mean) | 8.1 | 2.4 | -70.5 % |
| Files read (mean) | 1.8 | 0.6 | -69.3 % |
| Score (0-10, mean) | 8.72 | 8.23 | similar on this sample |
33 / 48 (69 %) tasks are cheaper under C2; 28 / 48 (58 %) are faster.
Full report: BENCHMARK_REPORT_SKLEARN48.md
How many tokens does each strategy require for a model to understand a commit,
measured on the 30 most recent non-merge commits of pallets/flask?
| Strategy | Tokens / commit |
|---|---|
| naive (full contents of changed files) | 64,039 |
git diff only |
14,888 |
get_context |
2,391 |
Reduction vs naive: 209x mean, 26.8x pooled, 12.6x median, 1,214x best case.
Reduction vs git diff: 41.7x mean, 6.2x pooled.
Reproduce:
.venv/bin/python harness/token_efficiency_bench.py \
--repo repos/pallets/flask --last 30 --min-repowise-tokens 0Raw data: results/token_efficiency/results.csv.
A reproducible benchmark proving that deterministic code health scores predict real-world defects in open-source Python projects. Health scores are collected at a historical snapshot (T0); bug-fixing commits are counted over the following 6 months (T0 -> T1); the two are correlated.
Across three public repositories (862 source files, 6-month defect window):
| Repo | Files | Spearman ρ | p-value | Defect ratio | ROC AUC | Precision@20 |
|---|---|---|---|---|---|---|
| Django | 542 | -0.337 | <0.0001 | 12x | 0.698 | 70 % |
| Pydantic | 216 | -0.229 | 0.0007 | 10x | 0.742 | 30 % |
| FastAPI | 104 | -0.272 | 0.0053 | 75x | 0.715 | 35 % |
Files scoring below 4.0 have 10-75x more bug-fixing commits than files scoring above 8.0. The correlation is statistically significant (p < 0.01) across all three codebases.
Top biomarker predictors (by Cliff's delta effect size):
developer_congestion— δ = +0.78 (Django)untested_hotspot— δ = +0.69 (Django), +0.67 (FastAPI)brain_method— δ = +0.62 (Pydantic), +0.43 (Django)
Full report: health-defect/BENCHMARK_REPORT.md Reproduction steps: health-defect/README.md
repowise-bench/
├── README.md — this file (index of all benchmarks)
├── requirements.txt — shared Python dependencies
│
├── harness/ — shared runner infrastructure (SWE-QA)
│ ├── run_experiment.py — entry point: orchestrates a paired run
│ ├── swe_qa_runner.py — per-task runner + LLM-as-judge
│ ├── metrics.py — RunMetrics, stream parser, BudgetTracker
│ └── token_efficiency_bench.py — token-efficiency mini-benchmark
│
├── configs/ — benchmark configuration files (SWE-QA)
│ └── swe_qa_flask48.yaml — canonical SWE-QA / Flask configuration
│
├── data/ — static benchmark datasets
│ └── swe_qa/tasks.json — full SWE-QA task corpus
│
├── analysis/ — aggregation scripts (SWE-QA)
│ └── aggregate_flask48.py
│
├── scripts/ — shared utility scripts
│ └── download_benchmarks.py — fetches SWE-QA dataset and clones repos
│
├── results/ — all benchmark outputs (gitignored except baselines)
│ ├── swe_qa_flask48/ — SWE-QA Flask results
│ ├── swe_qa_sklearn48/ — SWE-QA scikit-learn results
│ ├── token_efficiency/ — token-efficiency results
│ └── health_defect_{repo}/ — one directory per health-defect repo
│ ├── correlation.json
│ ├── defect_counts.json
│ ├── joined_data.json
│ ├── health_scores.json
│ └── charts/
│
├── BENCHMARK_REPORT_FLASK48.md — SWE-QA full report: Flask
├── BENCHMARK_REPORT_SKLEARN48.md — SWE-QA full report: scikit-learn
│
├── health-defect/ — self-contained health-defect benchmark
│ ├── README.md — benchmark overview and reproduction steps
│ ├── BENCHMARK_REPORT.md — full statistical report
│ ├── config.yaml — per-repo configuration
│ ├── run_benchmark.py — entry point
│ └── lib/ — benchmark library modules
│
├── mcp_configs/ — generated MCP server configs (gitignored)
├── indexes/ — generated documentation indexes (gitignored)
├── repos/ — cloned target repositories (gitignored)
└── logs/ — per-run logs (gitignored)
Each benchmark gets its own directory. Convention:
- Create a directory at
repowise-bench/<benchmark-name>/ - Add a
README.mdwith methodology, headline numbers, and reproduction steps - Add a
run_benchmark.py(or equivalent entry point) runnable from within the directory - Write results to
../results/<benchmark_name>_{variant}/so outputs land in the sharedresults/tree - Update this README — add a row to the Benchmarks table
Shared repos and indexes can be reused from ../repos/ and ../indexes/. New Python dependencies go in the top-level requirements.txt.
Every task is run under both conditions, and every metric is computed per-task before being aggregated. We never compare a C0 mean against a C2 mean drawn from a different subset of tasks. If a task fails to complete under one condition, it is re-run under both conditions and the new pair replaces the old one in full.
Cost is read directly from each task's estimated_cost_usd field, populated
from the agent runtime's per-model billing roll-up. This sums cost across
every model invoked — both the parent session and any subagents dispatched
via the Agent tool. Token-based recomputation is intentionally avoided
because it can miss subagent spend not surfaced in the parent stream's
usage blocks.
Each (task, configuration) pair is scored by an LLM judge using a fixed five-dimension rubric (correctness, completeness, relevance, clarity, reasoning) on a 0-10 scale. The judge does not see the configuration label and is the same model in both arms.
Runs are deterministic up to LLM nondeterminism. Model versions, prompt templates, and the SWE-QA task corpus are pinned in this repository. The only external dependencies are the repository checkouts (pinned by commit hash in the documentation index metadata) and the Anthropic API.
The full pipeline takes about 30 minutes of wall-clock time per arm and costs approximately $5-10 per arm at list prices, depending on retry behavior.
- Python 3.11+
- Claude Code CLI (
claude) installed and authenticated (OAuth orANTHROPIC_API_KEY) - repowise CLI installed and discoverable on
$PATH, or a local checkout of repowise sibling to this directory - ~5 GB free disk space for the checkout, index, and run logs
pip install -r requirements.txtpython scripts/download_benchmarks.py --benchmark swe_qarepowise init repos/pallets/flask --output-dir indexesPYTHONIOENCODING=utf-8 python harness/run_experiment.py \
--config configs/swe_qa_flask48.yamlResults are written incrementally to results/swe_qa_flask48/swe_qa.jsonl;
the run is safe to interrupt and resume.
python analysis/aggregate_flask48.pyFor health-defect reproduction steps, see health-defect/README.md.
Each row of results/swe_qa_flask48/swe_qa.jsonl contains:
| Field | Type | Description |
|---|---|---|
task_id |
string | Unique task identifier (e.g. flask_017) |
benchmark |
string | Always swe_qa |
condition |
string | C0_bare or C2_full |
repo |
string | Source repository (e.g. pallets/flask) |
question_type |
string | SWE-QA question category (What / Where / How / Why) |
answer |
string | The agent's final answer |
judge_scores |
dict[str,float] | Judge dimension scores in [0, 10] |
estimated_cost_usd |
float | Total dollar cost across all models invoked |
wall_clock_seconds |
float | End-to-end wall-clock duration |
num_tool_calls |
int | Total tool invocations made by the agent |
files_explored |
list[str] | Distinct file paths opened via Read |
For the health-defect output schema, see health-defect/README.md.
If you use these benchmarks or their results, please cite the relevant report:
Repowise on SWE-QA: A Benchmark Study of Documentation-Augmented Code
Question Answering on Flask. 2026.
Repowise health-defect Benchmark: Code Health Scores as Defect Predictors
Across Django, FastAPI, and Pydantic. 2026.
This benchmark harness is released under the Apache 2.0 license. The repository checkouts used as targets are owned by their respective projects and licensed separately. The SWE-QA task corpus is the property of its original authors.