agent-sh
diff --git a/‎README.md‎
Lines changed: 38 additions & 4 deletions b/‎README.md‎
Lines changed: 38 additions & 4 deletions
diff --git a/‎__tests__/generate-docs.test.js‎
Lines changed: 7 additions & 7 deletions b/‎__tests__/generate-docs.test.js‎
Lines changed: 7 additions & 7 deletions
diff --git a/‎__tests__/repo-map-batch.test.js‎
Lines changed: 0 additions & 110 deletions b/‎__tests__/repo-map-batch.test.js‎
Lines changed: 0 additions & 110 deletions
@@ -73,9 +73,46 @@ This came from testing on 1,000+ repositories.
 
 ---
 
+## Benchmarks
+
+Structured prompts and enriched context do more for output quality than model tier. Benchmarked on real tasks (`/can-i-help` and `/onboard` against [glide-mq](https://github.com/avifenesh/glide-mq)), measured with `claude -p --output-format json`.
+
+### Sonnet + AgentSys vs raw Opus
+
+Same task, same repo, same prompt ("I want to improve docs"):
+
+| Configuration | Cost | Output tokens | Result quality |
+|---------------|------|---------------|----------------|
+| Opus, no agentsys | $1.10 | 2,841 | Generic recommendations, no project-specific context |
+| Opus + agentsys | $1.95 | 5,879 | Specific recommendations with effort estimates, convention awareness, breaking change detection |
+| **Sonnet + agentsys** | **$0.66** | **6,084** | **Comparable to Opus + agentsys: specific, actionable, project-aware** |
+
+Sonnet + agentsys produced more output with higher specificity than raw Opus - at 40% lower cost.
+
+### With agentsys, model tier matters less
+
+Once the pipeline provides structured prompts, enriched repo-intel data, and phase-gated workflows, the model does less heavy lifting. The gap between Sonnet and Opus narrows:
+
+| Plugin | Opus | Sonnet | Savings |
+|--------|------|--------|---------|
+| /onboard | $1.10 | $0.30 | 73% |
+| /can-i-help | $1.34 | $0.23 | 83% |
+
+Both models reached the same outcome quality - Sonnet just costs less to get there. The structured pipeline captures most of the gains that would otherwise require a more expensive model.
+
+### What this means
+
+| Scenario | Model cost | Quality |
+|----------|-----------|---------|
+| Without agentsys | Need Opus for good results | Depends on model capability |
+| **With agentsys** | **Sonnet is sufficient** | **Pipeline handles the structure, model handles judgment** |
+
+The investment shifts from model spend to pipeline design. Better prompts, richer context, enforced phases - these compound in ways that model upgrades alone don't.
+
+---
+
 ## Commands
 
-<!-- GEN:START:readme-commands -->
 | Command | What it does |
 |---------|--------------|
 | [`/next-task`](#next-task) | Task workflow: discovery, implementation, PR, merge |
@@ -96,15 +133,13 @@ This came from testing on 1,000+ repositories.
 | [`/skillers`](#skillers) | Workflow pattern learning and automation |
 | [`/onboard`](#onboard) | Codebase orientation for newcomers |
 | [`/can-i-help`](#can-i-help) | Match contributor skills to project needs |
-<!-- GEN:END:readme-commands -->
 
 Each command works standalone. Together, they compose into end-to-end pipelines.
 
 ---
 
 ## Skills
 
-<!-- GEN:START:readme-skills -->
 38 skills included across the plugins:
 
 | Category | Skills |
@@ -120,7 +155,6 @@ Each command works standalone. Together, they compose into end-to-end pipelines.
 | **Web** | `web-auth`, `web-browse` |
 | **Release** | `release` |
 | **Analysis** | `drift-analysis`, `repo-intel` |
-<!-- GEN:END:readme-skills -->
 
 **External skill plugins** (standalone repos, installed separately):
 
 
@@ -224,24 +224,24 @@ describe('generate-docs', () => {
     });
 
     test('returns stale when docs are tampered with', () => {
-      const readmePath = path.join(REPO_ROOT, 'README.md');
-      const original = fs.readFileSync(readmePath, 'utf8');
+      const agentsPath = path.join(REPO_ROOT, 'AGENTS.md');
+      const original = fs.readFileSync(agentsPath, 'utf8');
 
       try {
         // Tamper with generated section
         const tampered = original.replace(
-          /<!-- GEN:START:readme-skills -->\n[\s\S]*?\n<!-- GEN:END:readme-skills -->/,
-          '<!-- GEN:START:readme-skills -->\ntampered content\n<!-- GEN:END:readme-skills -->'
+          /<!-- GEN:START:claude-architecture -->\n[\s\S]*?\n<!-- GEN:END:claude-architecture -->/,
+          '<!-- GEN:START:claude-architecture -->\ntampered content\n<!-- GEN:END:claude-architecture -->'
         );
-        fs.writeFileSync(readmePath, tampered);
+        fs.writeFileSync(agentsPath, tampered);
         discovery.invalidateCache();
 
         const result = genDocs.checkFreshness();
         expect(result.status).toBe('stale');
-        expect(result.staleFiles).toContain('README.md');
+        expect(result.staleFiles).toContain('AGENTS.md');
       } finally {
         // Restore original
-        fs.writeFileSync(readmePath, original);
+        fs.writeFileSync(agentsPath, original);
       }
     });
   });