Skip to content

Commit 6104c56

Browse files
olaservoclaude
andcommitted
Update README with improved performance results documentation
- Restructure example results section with clearer subsections - Add detailed performance metrics tables for both large and small datasets - Include explanations of why each approach succeeded or failed - Improve clarity and formatting throughout the results section 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
1 parent 9554579 commit 6104c56

File tree

1 file changed

+94
-11
lines changed

1 file changed

+94
-11
lines changed

README.md

Lines changed: 94 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -63,27 +63,110 @@ One hypothesis is that code execution should significantly reduce token consumpt
6363

6464
This repository provides both modes so you can test and compare actual results for your use case.
6565

66-
## Example Results: Analyzing a high volume of issues with the GitHub MCP Server
66+
## Example Performance Results: Analyzing GitHub Issues
6767

68-
### Where this approach works better than direct MCP: Processing ~5k GitHub Issues
69-
70-
The [Claude Code repo](https://github.com/anthropics/claude-code) is a useful test case for processing a large number of [GitHub issues](https://github.com/anthropics/claude-code/issues) that would overwhelm the model if using the direct tool calling approach.
68+
I tested both approaches across two real-world repositories:
7169

7270
_Shoutout to [@johncburns1](https://github.com/johncburns1) for coming up with the GitHub issues use case idea!_
7371

74-
In preliminary experiments, Claude was consistently able to create a reusable skill and use the correct tool to successfully complete the task in code execution mode.
72+
### Large Dataset Example: Claude Code Repository (5,205 open issues)
73+
74+
The [Claude Code repo](https://github.com/anthropics/claude-code) provided a good stress test with [~5k open issues](https://github.com/anthropics/claude-code/issues) at the time of my test, which was a large enough dataset to expose fundamental architectural differences between approaches.
75+
76+
**Results across 5 runs per approach:**
77+
78+
Code execution succeeded on all 5 runs. Claude consistently created reusable skills and used the `list_issues` tool wrapper to successfully fetch and process all issues.
79+
80+
Direct MCP failed on all 5 runs, hitting context overflow errors after attempting to load all issues into context simultaneously.
81+
82+
| Approach | Success Rate | Avg Duration | Avg Cost | Output |
83+
|----------|-------------|--------------|----------|---------|
84+
| **Code Execution** | 100% (5/5) | 343s (5.7 min) | $0.34 | Complete reports + JSON |
85+
| **Direct MCP** | 0% (5/5) | 31s to failure | N/A | N/A |
86+
87+
**Why Direct MCP Failed:**
88+
- Attempted to load all 5,205 issues into context simultaneously
89+
- Reached ~600-800 issues before hitting 200K token limit (turn 3)
90+
91+
**Why Code Execution Succeeded:**
92+
- Wrote TypeScript to fetch issues in batches using the list_issues tool wrapper
93+
- Processed and aggregated data in memory (~520K tokens worth)
94+
- Passed only summary statistics to model (~10K tokens)
95+
96+
### Small Dataset: Anthropic SDK Python (45 issues)
97+
98+
When testing with the [`anthropic-sdk-python`](https://github.com/anthropics/anthropic-sdk-python) repository (~45 open issues), both approaches succeeded, but code execution still significantly outperformed:
99+
100+
| Metric | Code Execution | Direct MCP | Advantage |
101+
|--------|---------------|------------|-----------|
102+
| Success Rate | 100% (5/5) | 100% (5/5) | Tie |
103+
| Avg Duration | 76s | 269s | **3.5× faster** |
104+
| Avg Cost | $0.18 | $0.54 | **67% cheaper** |
105+
| Avg Turns | 11.6 | 4.8 | 2.4× more turns |
106+
| Output Tokens | 2,053 | 15,287 | **7.4× fewer** |
107+
| Output Quality | Structured, concise | Narrative, detailed | Different styles |
108+
109+
**Output Style Differences:**
110+
111+
The approaches produced notably different report styles:
112+
113+
- **Code Execution:** Generated structured, data-focused reports (~9.7KB) optimized for quick scanning and machine parsing.
114+
- **Direct MCP:** Generated richer narrative reports (~15.8-19.5KB) with more context and explanation.
115+
116+
Both capture the same data accurately (all 45 issues, correct categorizations), but direct MCP's 7.4× higher output token count reflected more verbose, explanatory text rather than additional insights.
117+
118+
**Complexity Trade-offs:**
119+
120+
Code execution requires 2.4× more agent turns (11.6 vs 4.8), reflecting the overhead of:
121+
- Writing and debugging TypeScript code
122+
- Executing scripts and reviewing outputs
123+
- Iterating on data processing logic
124+
125+
However, this complexity overhead was more than offset by execution speed and cost savings. The additional turns happened quickly because the agent was delegating computation to code rather than generating extensive analytical text.
126+
127+
Even when direct MCP can technically succeed, code execution delivered dramatically better performance. The main trade-off in this example was the 'structured' vs 'narrative' output style.
128+
129+
I was genuinely surprised by this result, since I expected direct MCP to have an advantage on small datasets where context limits aren't a concern. Instead, code execution's efficiency gains from out-of-context data processing outweighed its additional complexity overhead, even at small scale.
130+
131+
### Future Improvements
132+
133+
**Skill Optimization:** The code execution approach's turn count (11.6 avg) could likely be reduced by optimizing the Skill implementation. The current agent sometimes writes and debugs code iteratively when a more efficient workflow might generate working code in fewer turns.
134+
135+
### Full Experimental Data
136+
137+
You can find full zip archives of the logs, metrics, and workspace files generated by these experiments in the Releases tab of this repo.
138+
139+
## What These Results Suggest
140+
141+
Based on these GitHub issue analysis experiments:
142+
143+
**For this specific use case (analyzing repository issues):**
144+
- Code execution was the only viable approach for large datasets (5K+ issues)
145+
- Code execution outperformed direct MCP even on small datasets (45 issues) by 3.5× on speed and 67% on cost
146+
- The main trade-off was structured vs narrative output style, with code execution requiring more agent turns
147+
148+
**Open Questions:**
75149

76-
In direct MCP mode, Claude reliably called the right tools and failed with a 400 error after attempting to stuff too much data into context.
150+
These results only cover one type of task. More testing needed for:
151+
- Different data types (structured vs unstructured)
152+
- Different operations (CRUD vs analysis)
153+
- Different MCP servers (filesystem, databases, APIs)
154+
- Real-time vs batch workloads
155+
- Tasks requiring back-and-forth iteration
77156

78-
### Example 2: Processing a smaller set of issues
157+
**Hypotheses Worth Testing:**
79158

80-
Originally I assumed that direct MCP would still be a better choice for smaller-scale tasks, and thought that this was the case in a few early examples I tried. However I thought that I should confirm this using the exact same setup as the example above. I ran the same task as above, but just swapped out the `claude-code` repo for the `anthropic-sdk-python` repo (which at the time of this test had 45 issues).
159+
Code execution may be beneficial whenever:
160+
- Dataset size is unknown or potentially large
161+
- Data needs filtering/transformation before analysis
162+
- Multiple data sources need to be joined
163+
- The same workflow will be repeated
81164

82-
I'm glad that I ran this 2nd comparison, since even though direct MCP versions needed fewer turns to complete this task than code execution, it did use a lot more tokens and took longer to run overall.
165+
But these are still hypotheses, not proven conclusions. Your mileage may vary.
83166

84-
### Download Full Example Results
167+
**Help Build the Evidence:**
85168

86-
You can find a full zip archive of the logs, metrics, and workspace files generated by these initial experiments in the Releases tab of this repo.
169+
If you test this pattern with different use cases, please share your findings in the [MCP community discussion](https://github.com/modelcontextprotocol/modelcontextprotocol/discussions/1780) or open an issue in this repo. More data points will help the community understand when this approach makes sense.
87170

88171
## Execution Flow
89172

0 commit comments

Comments
 (0)