|
| 1 | +--- |
| 2 | +slug: ai-agent-performance |
| 3 | +title: Why Your AI Agent Gets Dumber with Large Specs (And How to Fix It) |
| 4 | +authors: [marvin] |
| 5 | +tags: [ai, context-engineering, best-practices] |
| 6 | +--- |
| 7 | + |
| 8 | +Your spec fits in the context window. So why does your AI agent make mistakes, ignore instructions, and produce worse code? |
| 9 | + |
| 10 | +You paste a detailed 2,000-line architecture document into Cursor. The context window can handle it—200K tokens, plenty of room. But something's off. The AI suggests an approach you explicitly ruled out on page 3. It asks questions you already answered. The code it generates contradicts the design decisions you documented. |
| 11 | + |
| 12 | +**The problem isn't context size. It's context quality.** |
| 13 | + |
| 14 | +{/* truncate */} |
| 15 | + |
| 16 | +## The Real Problem: Performance Degradation |
| 17 | + |
| 18 | +Modern AI models have massive context windows—Claude has 200K tokens, GPT has 128K, and newer models are pushing toward 1M+. But here's what the marketing doesn't tell you: **AI performance degrades significantly as context grows**, even when you're nowhere near the limit. |
| 19 | + |
| 20 | +The research is clear: |
| 21 | + |
| 22 | +**Databricks found** that Llama 3.1 405B shows quality degradation starting around 32K tokens—far below its theoretical limit. Smaller models degrade even earlier. |
| 23 | + |
| 24 | +**Berkeley's Function-Calling Leaderboard** confirmed that ALL models perform worse when given more tools or options to choose from. More context = more confusion = lower accuracy. |
| 25 | + |
| 26 | +**Microsoft and Salesforce research** showed a 39% performance drop when models need to gather information across multiple context turns or conflicting sources. |
| 27 | + |
| 28 | +### Why This Happens |
| 29 | + |
| 30 | +It comes down to fundamental constraints: |
| 31 | + |
| 32 | +1. **Attention dilution** - Transformer attention has N² complexity. More tokens = harder to focus on what matters. |
| 33 | + |
| 34 | +2. **Context rot** - With large context, models start ignoring their training and just repeat patterns from the context history. They become less intelligent, not more. |
| 35 | + |
| 36 | +3. **Option overload** - Too many choices (tools, patterns, approaches) leads to wrong selections. This isn't unique to AI—it's a cognitive constraint. |
| 37 | + |
| 38 | +4. **Token economics** - Every extra token costs money and time. A 2,000-line spec costs 6x more to process than a 300-line spec. |
| 39 | + |
| 40 | +### What This Means For You |
| 41 | + |
| 42 | +When you're using AI coding assistants: |
| 43 | + |
| 44 | +- **Cursor, Copilot, Claude** start making basic mistakes they wouldn't make with smaller context |
| 45 | +- **Code generation** becomes less accurate and more likely to contradict your requirements |
| 46 | +- **Responses slow down** as the model processes more irrelevant information |
| 47 | +- **Costs scale up** linearly with context size |
| 48 | +- **You spend more time** fixing AI mistakes than you save from AI assistance |
| 49 | + |
| 50 | +The irony: You write detailed specs to help the AI, but the detail makes the AI worse. |
| 51 | + |
| 52 | +## The Solution: Context Engineering |
| 53 | + |
| 54 | +Context engineering is the practice of managing AI working memory to maximize effectiveness. It's not about squeezing into context limits—it's about **maintaining AI performance** at any scale. |
| 55 | + |
| 56 | +Here are four strategies that actually work, backed by research and real-world usage: |
| 57 | + |
| 58 | +### 1. Partitioning - Split and Load Selectively |
| 59 | + |
| 60 | +**What it is**: Break content into focused chunks, load only what's needed for the current task. |
| 61 | + |
| 62 | +**Example**: |
| 63 | +``` |
| 64 | +# Instead of one 1,200-line spec: |
| 65 | +specs/dashboard/README.md (200 lines - overview) |
| 66 | +specs/dashboard/DESIGN.md (350 lines - architecture) |
| 67 | +specs/dashboard/IMPLEMENTATION.md (150 lines - plan) |
| 68 | +specs/dashboard/TESTING.md (180 lines - tests) |
| 69 | +
|
| 70 | +# AI loads only what it needs |
| 71 | +# Working on architecture? Read DESIGN.md only |
| 72 | +# Writing tests? Read TESTING.md only |
| 73 | +``` |
| 74 | + |
| 75 | +**The benefit**: AI processes 200-350 lines instead of 1,200. Faster, more focused, fewer mistakes. |
| 76 | + |
| 77 | +### 2. Compaction - Remove Redundancy |
| 78 | + |
| 79 | +**What it is**: Eliminate duplicate or inferable content. |
| 80 | + |
| 81 | +**Before**: |
| 82 | +```markdown |
| 83 | +## Authentication |
| 84 | +The authentication system uses JWT tokens. JWT tokens are |
| 85 | +industry-standard and provide stateless authentication. The |
| 86 | +benefit of JWT tokens is that they don't require server-side |
| 87 | +session storage... |
| 88 | + |
| 89 | +## Implementation |
| 90 | +We'll implement JWT authentication. JWT was chosen because... |
| 91 | +[repeats same rationale] |
| 92 | +``` |
| 93 | + |
| 94 | +**After**: |
| 95 | +```markdown |
| 96 | +## Authentication |
| 97 | +Uses JWT tokens (stateless, no session storage). |
| 98 | + |
| 99 | +## Implementation |
| 100 | +[links to Authentication section for rationale] |
| 101 | +``` |
| 102 | + |
| 103 | +**The benefit**: Higher signal-to-noise ratio. AI focuses on unique information, not repetition. |
| 104 | + |
| 105 | +### 3. Compression - Summarize What's Done |
| 106 | + |
| 107 | +**What it is**: Condense completed work while preserving essential decisions. |
| 108 | + |
| 109 | +**Before**: |
| 110 | +```markdown |
| 111 | +## Phase 1: Infrastructure Setup |
| 112 | +Set up project structure: |
| 113 | +- Create src/ directory |
| 114 | +- Create tests/ directory |
| 115 | +- Configure TypeScript with tsconfig.json |
| 116 | +- Set up ESLint with .eslintrc |
| 117 | +[50 lines of detailed steps...] |
| 118 | +``` |
| 119 | + |
| 120 | +**After** (once completed): |
| 121 | +```markdown |
| 122 | +## ✅ Phase 1: Infrastructure (Completed 2025-10-15) |
| 123 | +Project structure established with TypeScript, testing, and CI. |
| 124 | +See commit abc123 for details. |
| 125 | +``` |
| 126 | + |
| 127 | +**The benefit**: Keep project history without bloat. AI knows what happened without drowning in details. |
| 128 | + |
| 129 | +### 4. Isolation - Separate Unrelated Concerns |
| 130 | + |
| 131 | +**What it is**: Move independent features into separate specs with clear relationships. |
| 132 | + |
| 133 | +**Before**: One 1,200-line spec covering dashboard UI, metrics API, health scoring algorithm, and chart library evaluation. |
| 134 | + |
| 135 | +**After**: Four focused specs, each under 400 lines: |
| 136 | +- `dashboard-ui` - User interface and interactions |
| 137 | +- `metrics-api` - Data endpoint design |
| 138 | +- `health-scoring` - Algorithm details |
| 139 | +- `chart-evaluation` - Library comparison (can be archived after decision) |
| 140 | + |
| 141 | +**The benefit**: Independent evolution. When the algorithm changes, the UI spec stays untouched. |
| 142 | + |
| 143 | +### The Key Insight |
| 144 | + |
| 145 | +**Keep context dense (high signal), not just small.** |
| 146 | + |
| 147 | +It's not about arbitrary line limits. It's about removing anything that doesn't directly inform the current decision. Every word that doesn't help the AI make better choices is making it worse. |
| 148 | + |
| 149 | +## Real Results from Dogfooding |
| 150 | + |
| 151 | +We built LeanSpec using LeanSpec itself—the ultimate test of whether this methodology actually works. |
| 152 | + |
| 153 | +**The velocity**: 6 days from zero to production |
| 154 | +- Full-featured CLI with 15+ commands |
| 155 | +- MCP server for Claude Desktop integration |
| 156 | +- Documentation site with comprehensive guides |
| 157 | +- 54 specs written and implemented with AI agents |
| 158 | + |
| 159 | +**Then we violated our own principles**: Some specs grew to 1,166 lines. We hit the exact problems we were solving: |
| 160 | +- AI agents started corrupting specs during edits |
| 161 | +- Code generation became less reliable |
| 162 | +- Responses slowed down noticeably |
| 163 | +- We spent more time fixing mistakes |
| 164 | + |
| 165 | +**We applied context engineering**: Split large specs, removed redundancy, compressed historical sections. |
| 166 | +- Largest spec went from 1,166 lines → 378 lines (largest partition) |
| 167 | +- AI agents work reliably again |
| 168 | +- Faster iterations, accurate output |
| 169 | +- Can confidently say: "We practice what we preach" |
| 170 | + |
| 171 | +### Concrete Benefits You'll See |
| 172 | + |
| 173 | +When you apply context engineering to your specs: |
| 174 | + |
| 175 | +✅ **Fewer AI mistakes** - Focused context produces accurate, consistent output |
| 176 | +✅ **Faster iterations** - Less processing time per AI request |
| 177 | +✅ **Lower costs** - Fewer tokens = cheaper API calls (6x savings on 2,000→300 line reduction) |
| 178 | +✅ **Better understanding** - AI actually follows your requirements instead of hallucinating |
| 179 | +✅ **Maintainable by humans** - Specs you can read in 5-10 minutes stay in sync with code |
| 180 | + |
| 181 | +### Works With Your Tools |
| 182 | + |
| 183 | +This isn't about a specific AI tool—it's about how all transformer-based models handle context: |
| 184 | + |
| 185 | +- **Cursor** - Reads markdown specs for context |
| 186 | +- **GitHub Copilot** - Uses workspace files for suggestions |
| 187 | +- **Claude** - Via MCP server integration |
| 188 | +- **Aider** - Processes project documentation |
| 189 | +- **Windsurf** - Analyzes codebase context |
| 190 | + |
| 191 | +Any AI coding assistant benefits from well-engineered context. |
| 192 | + |
| 193 | +## Getting Started |
| 194 | + |
| 195 | +LeanSpec gives you both the **methodology** and the **tooling** to apply context engineering to your specs. |
| 196 | + |
| 197 | +### The Methodology |
| 198 | + |
| 199 | +Five principles guide decision-making: |
| 200 | + |
| 201 | +1. **Context Economy** - Fit in working memory (human + AI) |
| 202 | +2. **Signal-to-Noise** - Every word informs decisions |
| 203 | +3. **Progressive Disclosure** - Add structure when needed |
| 204 | +4. **Intent Over Implementation** - Capture why, not just how |
| 205 | +5. **Bridge the Gap** - Both human and AI understand |
| 206 | + |
| 207 | +These aren't arbitrary rules—they're derived from real constraints (transformer attention, cognitive limits, token costs). |
| 208 | + |
| 209 | +### The Tooling |
| 210 | + |
| 211 | +CLI commands help you detect and fix context issues: |
| 212 | + |
| 213 | +```bash |
| 214 | +# Install |
| 215 | +npm install -g lean-spec |
| 216 | + |
| 217 | +# Initialize in your project |
| 218 | +cd your-project |
| 219 | +lean-spec init |
| 220 | + |
| 221 | +# Detect issues |
| 222 | +lean-spec validate # Check for problems |
| 223 | +lean-spec complexity <spec> # Analyze size/structure |
| 224 | + |
| 225 | +# Fix issues |
| 226 | +lean-spec split <spec> # Guided splitting workflow |
| 227 | + |
| 228 | +# Track progress |
| 229 | +lean-spec board # Kanban view of all specs |
| 230 | +``` |
| 231 | + |
| 232 | +### Start Simple, Grow as Needed |
| 233 | + |
| 234 | +**Solo developer?** Just use `status` and `created` fields. Keep specs focused. |
| 235 | + |
| 236 | +**Small team?** Add `tags` and `priority`. Use the CLI for visibility. |
| 237 | + |
| 238 | +**Enterprise?** Add custom fields (`epic`, `sprint`, `assignee`). Integrate with your workflow. |
| 239 | + |
| 240 | +The structure adapts to your needs—you never add complexity "just in case." |
| 241 | + |
| 242 | +### Try It Today |
| 243 | + |
| 244 | +```bash |
| 245 | +npm install -g lean-spec |
| 246 | +cd your-project |
| 247 | +lean-spec init |
| 248 | +lean-spec create user-authentication |
| 249 | +``` |
| 250 | + |
| 251 | +Your AI coding assistant will thank you. |
| 252 | + |
| 253 | +## The Bottom Line |
| 254 | + |
| 255 | +**Your AI tools are only as good as the context you give them.** |
| 256 | + |
| 257 | +A 2,000-line spec that fits in the context window will still produce worse results than a 300-line spec with the same essential information. It's not about limits—it's about performance. |
| 258 | + |
| 259 | +Context engineering isn't optimization. It's fundamental to making AI-assisted development work reliably. |
| 260 | + |
| 261 | +LeanSpec is a context engineering methodology for human-AI collaboration on software specs. It gives you: |
| 262 | +- Principles derived from real constraints |
| 263 | +- Patterns that scale from solo to enterprise |
| 264 | +- Tools that detect and prevent context problems |
| 265 | +- Proof from building the tool with the methodology |
| 266 | + |
| 267 | +**The choice**: Keep writing large specs and fighting with unreliable AI output, or engineer your context for the tools you actually use. |
| 268 | + |
| 269 | +--- |
| 270 | + |
| 271 | +**Learn more**: |
| 272 | +- GitHub: [github.com/codervisor/lean-spec](https://github.com/codervisor/lean-spec) |
| 273 | +- Docs: [lean-spec.dev](https://lean-spec.dev) |
| 274 | +- Research: [Context Engineering Guide](/docs/guide/context-engineering) |
0 commit comments