Update README with improved performance results documentation

olaservo · claude · olaservo · commit 6104c56364f4 · 2025-11-20T18:05:38.000-08:00
- Restructure example results section with clearer subsections - Add detailed performance metrics tables for both large and small datasets - Include explanations of why each approach succeeded or failed - Improve clarity and formatting throughout the results section 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
diff --git a/README.md b/README.md
@@ -63,27 +63,110 @@ One hypothesis is that code execution should significantly reduce token consumpt
 
 This repository provides both modes so you can test and compare actual results for your use case.
 
-## Example Results: Analyzing a high volume of issues with the GitHub MCP Server
+## Example Performance Results: Analyzing GitHub Issues
 
-### Where this approach works better than direct MCP: Processing ~5k GitHub Issues
-
-The [Claude Code repo](https://github.com/anthropics/claude-code) is a useful test case for processing a large number of [GitHub issues](https://github.com/anthropics/claude-code/issues) that would overwhelm the model if using the direct tool calling approach.
+I tested both approaches across two real-world repositories:
 
 _Shoutout to [@johncburns1](https://github.com/johncburns1) for coming up with the GitHub issues use case idea!_
 
-In preliminary experiments, Claude was consistently able to create a reusable skill and use the correct tool to successfully complete the task in code execution mode.
+### Large Dataset Example: Claude Code Repository (5,205 open issues)
+
+The [Claude Code repo](https://github.com/anthropics/claude-code) provided a good stress test with [~5k open issues](https://github.com/anthropics/claude-code/issues) at the time of my test, which was a large enough dataset to expose fundamental architectural differences between approaches.
+
+**Results across 5 runs per approach:**
+
+Code execution succeeded on all 5 runs. Claude consistently created reusable skills and used the `list_issues` tool wrapper to successfully fetch and process all issues.
+
+Direct MCP failed on all 5 runs, hitting context overflow errors after attempting to load all issues into context simultaneously.
+
+| Approach | Success Rate | Avg Duration | Avg Cost | Output |
+|----------|-------------|--------------|----------|---------|
+| **Code Execution** | 100% (5/5) | 343s (5.7 min) | $0.34 | Complete reports + JSON |
+| **Direct MCP** | 0% (5/5) | 31s to failure | N/A | N/A |
+
+**Why Direct MCP Failed:**
+- Attempted to load all 5,205 issues into context simultaneously
+- Reached ~600-800 issues before hitting 200K token limit (turn 3)
+
+**Why Code Execution Succeeded:**
+- Wrote TypeScript to fetch issues in batches using the list_issues tool wrapper
+- Processed and aggregated data in memory (~520K tokens worth)
+- Passed only summary statistics to model (~10K tokens)
+
+### Small Dataset: Anthropic SDK Python (45 issues)
+
+When testing with the [`anthropic-sdk-python`](https://github.com/anthropics/anthropic-sdk-python) repository (~45 open issues), both approaches succeeded, but code execution still significantly outperformed:
+
+| Metric | Code Execution | Direct MCP | Advantage |
+|--------|---------------|------------|-----------|
+| Success Rate | 100% (5/5) | 100% (5/5) | Tie |
+| Avg Duration | 76s | 269s | **3.5× faster** |
+| Avg Cost | $0.18 | $0.54 | **67% cheaper** |
+| Avg Turns | 11.6 | 4.8 | 2.4× more turns |
+| Output Tokens | 2,053 | 15,287 | **7.4× fewer** |
+| Output Quality | Structured, concise | Narrative, detailed | Different styles |
+
+**Output Style Differences:**
+
+The approaches produced notably different report styles:
+
+- **Code Execution:** Generated structured, data-focused reports (~9.7KB) optimized for quick scanning and machine parsing.
+- **Direct MCP:** Generated richer narrative reports (~15.8-19.5KB) with more context and explanation.
+
+Both capture the same data accurately (all 45 issues, correct categorizations), but direct MCP's 7.4× higher output token count reflected more verbose, explanatory text rather than additional insights.
+
+**Complexity Trade-offs:**
+
+Code execution requires 2.4× more agent turns (11.6 vs 4.8), reflecting the overhead of:
+- Writing and debugging TypeScript code
+- Executing scripts and reviewing outputs  
+- Iterating on data processing logic
+
+However, this complexity overhead was more than offset by execution speed and cost savings. The additional turns happened quickly because the agent was delegating computation to code rather than generating extensive analytical text.
+
+Even when direct MCP can technically succeed, code execution delivered dramatically better performance. The main trade-off in this example was the 'structured' vs 'narrative' output style.
+
+I was genuinely surprised by this result, since I expected direct MCP to have an advantage on small datasets where context limits aren't a concern. Instead, code execution's efficiency gains from out-of-context data processing outweighed its additional complexity overhead, even at small scale.
+
+### Future Improvements
+
+**Skill Optimization:** The code execution approach's turn count (11.6 avg) could likely be reduced by optimizing the Skill implementation. The current agent sometimes writes and debugs code iteratively when a more efficient workflow might generate working code in fewer turns.
+
+### Full Experimental Data
+
+You can find full zip archives of the logs, metrics, and workspace files generated by these experiments in the Releases tab of this repo.
+
+## What These Results Suggest
+
+Based on these GitHub issue analysis experiments:
+
+**For this specific use case (analyzing repository issues):**
+- Code execution was the only viable approach for large datasets (5K+ issues)
+- Code execution outperformed direct MCP even on small datasets (45 issues) by 3.5× on speed and 67% on cost
+- The main trade-off was structured vs narrative output style, with code execution requiring more agent turns
+
+**Open Questions:**
 
-In direct MCP mode, Claude reliably called the right tools and failed with a 400 error after attempting to stuff too much data into context.
+These results only cover one type of task. More testing needed for:
+- Different data types (structured vs unstructured)
+- Different operations (CRUD vs analysis)
+- Different MCP servers (filesystem, databases, APIs)
+- Real-time vs batch workloads
+- Tasks requiring back-and-forth iteration
 
-### Example 2: Processing a smaller set of issues
+**Hypotheses Worth Testing:**
 
-Originally I assumed that direct MCP would still be a better choice for smaller-scale tasks, and thought that this was the case in a few early examples I tried.  However I thought that I should confirm this using the exact same setup as the example above.  I ran the same task as above, but just swapped out the `claude-code` repo for the `anthropic-sdk-python` repo (which at the time of this test had 45 issues).
+Code execution may be beneficial whenever:
+- Dataset size is unknown or potentially large
+- Data needs filtering/transformation before analysis
+- Multiple data sources need to be joined
+- The same workflow will be repeated
 
-I'm glad that I ran this 2nd comparison, since even though direct MCP versions needed fewer turns to complete this task than code execution, it did use a lot more tokens and took longer to run overall.
+But these are still hypotheses, not proven conclusions. Your mileage may vary.
 
-### Download Full Example Results
+**Help Build the Evidence:**
 
-You can find a full zip archive of the logs, metrics, and workspace files generated by these initial experiments in the Releases tab of this repo.
+If you test this pattern with different use cases, please share your findings in the [MCP community discussion](https://github.com/modelcontextprotocol/modelcontextprotocol/discussions/1780) or open an issue in this repo. More data points will help the community understand when this approach makes sense.
 
 ## Execution Flow