Skip to content

Commit 5c8495a

Browse files
authored
docs: add benchmark results - Sonnet + agentsys vs raw Opus (#315)
* docs: add benchmark results showing Sonnet + agentsys vs raw Opus Real benchmarks on /can-i-help and /onboard against glide-mq repo. Sonnet + agentsys: $0.66, 6,084 tokens, specific recommendations. Raw Opus: $1.10, 2,841 tokens, generic recommendations. Model switch savings: 73-83% with equivalent quality. * docs: remove GEN markers from README, add benchmarks to website * fix: update tests for ast-grep to agent-analyzer migration Delete 5 test files for removed modules (runner, queries, usage-analyzer). Rewrite repo-map-updater tests for remaining checkStaleness() export. Rewrite repo-map-installer tests for agent-analyzer binary checks. Fix generate-docs test to use surviving GEN markers.
1 parent a4091b0 commit 5c8495a

10 files changed

+202
-1726
lines changed

README.md

Lines changed: 38 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -73,9 +73,46 @@ This came from testing on 1,000+ repositories.
7373

7474
---
7575

76+
## Benchmarks
77+
78+
Structured prompts and enriched context do more for output quality than model tier. Benchmarked on real tasks (`/can-i-help` and `/onboard` against [glide-mq](https://github.com/avifenesh/glide-mq)), measured with `claude -p --output-format json`.
79+
80+
### Sonnet + AgentSys vs raw Opus
81+
82+
Same task, same repo, same prompt ("I want to improve docs"):
83+
84+
| Configuration | Cost | Output tokens | Result quality |
85+
|---------------|------|---------------|----------------|
86+
| Opus, no agentsys | $1.10 | 2,841 | Generic recommendations, no project-specific context |
87+
| Opus + agentsys | $1.95 | 5,879 | Specific recommendations with effort estimates, convention awareness, breaking change detection |
88+
| **Sonnet + agentsys** | **$0.66** | **6,084** | **Comparable to Opus + agentsys: specific, actionable, project-aware** |
89+
90+
Sonnet + agentsys produced more output with higher specificity than raw Opus - at 40% lower cost.
91+
92+
### With agentsys, model tier matters less
93+
94+
Once the pipeline provides structured prompts, enriched repo-intel data, and phase-gated workflows, the model does less heavy lifting. The gap between Sonnet and Opus narrows:
95+
96+
| Plugin | Opus | Sonnet | Savings |
97+
|--------|------|--------|---------|
98+
| /onboard | $1.10 | $0.30 | 73% |
99+
| /can-i-help | $1.34 | $0.23 | 83% |
100+
101+
Both models reached the same outcome quality - Sonnet just costs less to get there. The structured pipeline captures most of the gains that would otherwise require a more expensive model.
102+
103+
### What this means
104+
105+
| Scenario | Model cost | Quality |
106+
|----------|-----------|---------|
107+
| Without agentsys | Need Opus for good results | Depends on model capability |
108+
| **With agentsys** | **Sonnet is sufficient** | **Pipeline handles the structure, model handles judgment** |
109+
110+
The investment shifts from model spend to pipeline design. Better prompts, richer context, enforced phases - these compound in ways that model upgrades alone don't.
111+
112+
---
113+
76114
## Commands
77115

78-
<!-- GEN:START:readme-commands -->
79116
| Command | What it does |
80117
|---------|--------------|
81118
| [`/next-task`](#next-task) | Task workflow: discovery, implementation, PR, merge |
@@ -96,15 +133,13 @@ This came from testing on 1,000+ repositories.
96133
| [`/skillers`](#skillers) | Workflow pattern learning and automation |
97134
| [`/onboard`](#onboard) | Codebase orientation for newcomers |
98135
| [`/can-i-help`](#can-i-help) | Match contributor skills to project needs |
99-
<!-- GEN:END:readme-commands -->
100136

101137
Each command works standalone. Together, they compose into end-to-end pipelines.
102138

103139
---
104140

105141
## Skills
106142

107-
<!-- GEN:START:readme-skills -->
108143
38 skills included across the plugins:
109144

110145
| Category | Skills |
@@ -120,7 +155,6 @@ Each command works standalone. Together, they compose into end-to-end pipelines.
120155
| **Web** | `web-auth`, `web-browse` |
121156
| **Release** | `release` |
122157
| **Analysis** | `drift-analysis`, `repo-intel` |
123-
<!-- GEN:END:readme-skills -->
124158

125159
**External skill plugins** (standalone repos, installed separately):
126160

__tests__/generate-docs.test.js

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -224,24 +224,24 @@ describe('generate-docs', () => {
224224
});
225225

226226
test('returns stale when docs are tampered with', () => {
227-
const readmePath = path.join(REPO_ROOT, 'README.md');
228-
const original = fs.readFileSync(readmePath, 'utf8');
227+
const agentsPath = path.join(REPO_ROOT, 'AGENTS.md');
228+
const original = fs.readFileSync(agentsPath, 'utf8');
229229

230230
try {
231231
// Tamper with generated section
232232
const tampered = original.replace(
233-
/<!-- GEN:START:readme-skills -->\n[\s\S]*?\n<!-- GEN:END:readme-skills -->/,
234-
'<!-- GEN:START:readme-skills -->\ntampered content\n<!-- GEN:END:readme-skills -->'
233+
/<!-- GEN:START:claude-architecture -->\n[\s\S]*?\n<!-- GEN:END:claude-architecture -->/,
234+
'<!-- GEN:START:claude-architecture -->\ntampered content\n<!-- GEN:END:claude-architecture -->'
235235
);
236-
fs.writeFileSync(readmePath, tampered);
236+
fs.writeFileSync(agentsPath, tampered);
237237
discovery.invalidateCache();
238238

239239
const result = genDocs.checkFreshness();
240240
expect(result.status).toBe('stale');
241-
expect(result.staleFiles).toContain('README.md');
241+
expect(result.staleFiles).toContain('AGENTS.md');
242242
} finally {
243243
// Restore original
244-
fs.writeFileSync(readmePath, original);
244+
fs.writeFileSync(agentsPath, original);
245245
}
246246
});
247247
});

__tests__/repo-map-batch.test.js

Lines changed: 0 additions & 110 deletions
This file was deleted.

0 commit comments

Comments
 (0)