feat: Generic MCP evaluation framework #1377

rysweet · 2025-11-16T22:54:06Z

Summary

Reusable framework for evaluating ANY MCP server integration with amplihack through empirical measurement.

Purpose

Provides evidence-based guidance for MCP integration decisions by measuring real performance improvements (or lack thereof) through controlled comparisons.

Features

Generic Design

Works with ANY MCP tool (Serena, GitHub Copilot MCP, future tools)
Tool adapter pattern - tool-specific logic isolated to adapters
Same test scenarios work for all tools

Comprehensive Metrics

Quality: Correctness, completeness, code quality
Efficiency: Tokens, time, file operations, tool calls
Tool-Specific: Feature usage, effectiveness

3 Generic Test Scenarios

Cross-File Navigation - Finding code across files
Code Understanding - Analyzing structure and dependencies
Targeted Modification - Making precise edits

Automated Reporting

Executive summaries with recommendations
Detailed per-scenario comparisons
Statistical analysis

Architecture

tests/mcp_evaluation/
├── framework/          # 6 core modules (1,367 lines)
│   ├── types.py       # Data structures
│   ├── adapter.py     # Tool adapter interface
│   ├── metrics.py     # Metrics collection
│   ├── evaluator.py   # Main engine
│   └── reporter.py    # Report generation
├── scenarios/          # 3 generic tests + realistic codebase
├── tools/              # Tool adapter base class
└── docs/               # Complete specs and guides

Usage

# Run framework tests
python tests/mcp_evaluation/test_framework.py

# Evaluate an MCP tool (requires adapter)
python tests/mcp_evaluation/run_evaluation.py --tool <name>

Extensibility

Add new MCP tool evaluation:

Create config: tools/<tool>_config.yaml (describes capabilities)
Create adapter: tools/<tool>_adapter.py (enable/disable/metrics)
Run same test scenarios - no framework changes needed

Changes

38 files, 6,830 insertions

New Files:

Framework core (6 modules)
3 generic test scenarios
Realistic test codebase (16 files)
Design specifications (5 docs in Specs/)
Documentation (README, Quick Start, Implementation Summary)

Key Decisions:

Test results NOT committed (added to .gitignore)
No tool-specific code in framework (use adapters)
Mock adapter for testing without real MCP servers

Testing

✅ 6 framework tests passing
✅ Mock evaluation runs successfully
✅ Report generation working
✅ No test results committed

Philosophy

Ruthless Simplicity: Core < 1,500 lines, focused on one problem
Brick Design: Self-contained, regeneratable framework
Zero-BS: No stubs or placeholders
Measurement-Driven: Real execution data, not estimates

Future Use

This framework enables data-driven MCP integration decisions for:

Serena MCP server (first use case)
GitHub Copilot MCP
Future MCP tools

🤖 Generated with Claude Code

Co-Authored-By: Claude [email protected]

Reusable framework for evaluating ANY MCP server integration with amplihack. ## Purpose Empirically measure MCP tool value through controlled comparisons of baseline vs tool-enhanced coding workflows. ## Features **Generic Design**: - Works with ANY MCP tool (not specific to one tool) - Tool adapter pattern for extensibility - Pluggable: Add new tools with config + adapter only **Comprehensive Metrics**: - Quality: Correctness, completeness, code quality - Efficiency: Tokens, time, file ops, tool calls - Tool-specific: Feature usage, effectiveness **3 Generic Test Scenarios**: 1. Cross-File Navigation - Finding code across files 2. Code Understanding - Analyzing structure and dependencies 3. Targeted Modification - Making precise edits **Automated Reporting**: - Executive summaries with recommendations - Detailed comparisons - Statistical analysis ## Architecture Framework (tests/mcp_evaluation/): - framework/ - 6 core modules (1,367 lines) - scenarios/ - 3 generic tests with realistic codebase - tools/ - Tool adapter base class - Documentation - Complete specs and guides ## Usage ```bash # Run framework tests python tests/mcp_evaluation/test_framework.py # Evaluate a tool (requires adapter) python tests/mcp_evaluation/run_evaluation.py --tool <name> ``` ## Extensibility Add new MCP tool: 1. Create config: `tools/<tool>_config.yaml` 2. Create adapter: `tools/<tool>_adapter.py` 3. Run evaluation with same scenarios ## Testing - 6 framework tests: ALL PASSING - Mock evaluation: Working - No test results committed (results/ in .gitignore) ## Philosophy - Ruthless Simplicity: Core framework < 1,500 lines - Brick Design: Self-contained, regeneratable - Zero-BS: No stubs or placeholders - Measurement-Driven: Real execution data This framework provides evidence-based guidance for MCP integration decisions. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

rysweet · 2025-11-17T01:33:18Z

Automation Script Available

Automated evaluation script created and tested with real Serena evaluation!

Location: ======================================================================
SERENA MCP EVALUATION - Real Testing with Auto Mode

[scenario1_navigation] Find Handler Implementations
Running WITHOUT Serena... (can be added to framework)

What it does:

Launches amplihack auto mode sessions with/without MCP tool
Runs 3 scenarios for both baseline and enhanced configs
Collects real timing metrics
Saves results to JSON

Proven with Serena:

6 auto mode sessions completed successfully
Real performance data collected
Results show 16.6% average speedup with Serena
Posted to PR feat: Enable Serena MCP by default (simple integration) #1376

Usage:

python /tmp/run_serena_evaluation.py

This automation enables repeatable, empirical MCP tool evaluation.

Real A/B testing automation for MCP tool evaluation. Proven with Serena: 16.6% average speedup. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Add architectural design documents for power-steering mode that were created during implementation but never committed to the repository. Background: - Power-steering mode was implemented in PR #1351 (issue #1350) - These architectural specs were created during design phase - Never committed, leaving knowledge gap for future maintainers Documentation Added: - POWER_STEERING_SUMMARY.md - Overview and key design decisions - power_steering_architecture.md - Complete system architecture - considerations_format.md - Structure for 21 considerations - control_mechanisms.md - Enable/disable control system - edge_cases.md - Edge case handling and error scenarios - implementation_phases.md - Implementation phases and rollout - power_steering_checker.md - Checker implementation details - power_steering_config.md - Configuration file format - stop_py_integration.md - Integration with stop hook Value: ✅ Preserves architectural knowledge for future maintainers ✅ Documents design decisions and rationale ✅ Explains implementation phases and evolution ✅ Provides configuration and customization guide Related: - Original issue: #1350 (closed) - Implementation PR: #1351 (merged) - Follow-up fix: #1384 (merged) Fixes #1390 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

…1377) Creates comprehensive user documentation to make the MCP Evaluation Framework discoverable and accessible to end users. Without these docs, users cannot find or effectively use the framework introduced in PR #1377. ## New Documentation - docs/mcp_evaluation/README.md: Entry point with quick start guide - docs/mcp_evaluation/USER_GUIDE.md: Complete end-to-end user journey (400+ lines) - README.md: Added MCP Tool Evaluation section with link to docs ## Key Features - Discovery: Main README links to MCP evaluation docs - Orientation: Entry point explains what, why, and who - Tutorial: Step-by-step guide from setup through decision-making - Pirate style: Follows user communication preferences - Philosophy-aligned: Ruthless simplicity and clarity ## Additional Changes Includes pre-commit auto-fixes (formatting, whitespace, end-of-file) applied across the codebase during commit validation. ## Dependencies Documentation references framework code from PR #1377 (feat/mcp-evaluation-framework). Both PRs should be merged together or this PR should wait for #1377. Resolves #1400 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

…valuation-docs # Conflicts: # Specs/POWER_STEERING_SUMMARY.md # Specs/considerations_format.md # Specs/control_mechanisms.md # Specs/edge_cases.md # Specs/implementation_phases.md # Specs/power_steering_architecture.md # Specs/power_steering_checker.md # Specs/power_steering_config.md # Specs/stop_py_integration.md

…ons only The pirate communication style should only apply to conversational interactions with the user, NOT to documentation or other end-user artifacts. ## Changes - Updated USER_PREFERENCES.md to clarify scope of pirate style - Rewrote docs/mcp_evaluation/README.md in professional language - Rewrote docs/mcp_evaluation/USER_GUIDE.md in professional language ## What Was Changed Removed pirate phrases and replaced with professional equivalents: - "Ahoy, matey!" → "Welcome" or removed - "ye/yer" → "you/your" - "be" → "is" - "fer" → "for" - "Arr!" → removed ## What Was Preserved ✓ All technical content and accuracy ✓ Complete structure and organization ✓ All examples, commands, and code blocks ✓ All metrics, tables, and workflows The documentation is now professional and suitable for all users while conversational interactions remain in pirate style per user preference. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

rysweet · 2025-11-18T18:03:38Z

This PR needs more testing before it can be merged.

Please ensure:

Comprehensive test coverage
Manual testing in realistic scenarios
Integration testing with existing features
Documentation of test results

Moving to draft status until testing is complete.

…ocs' into feat/mcp-evaluation-framework

rysweet · 2025-11-18T18:54:45Z

Documentation Added

This PR now includes comprehensive user-facing documentation (merged from PR #1401):

New Documentation Files

README.md - Added "MCP Tool Evaluation" section linking to framework docs
docs/mcp_evaluation/README.md (277 lines) - Entry point with:
- Framework overview and value proposition
- Quick 5-minute mock evaluation guide
- Role-based navigation
- Key concepts explanation
docs/mcp_evaluation/USER_GUIDE.md (995 lines) - Complete user guide with:
- End-to-end evaluation workflow (5 phases)
- Step-by-step mock evaluation tutorial
- Results interpretation guide
- Decision-making frameworks
- Troubleshooting section

Why Combined

Framework + documentation ship together as one complete, atomic feature:

No broken link dependencies
Complete feature delivery
Better review (reviewers see code + docs together)
Follows industry best practices
Aligns with amplihack philosophy (ruthless simplicity, DDD)

Related: PR #1401 (now closed - merged into this PR)

## Summary Reusable framework for evaluating ANY MCP server integration with amplihack through empirical measurement. ## Purpose Provides evidence-based guidance for MCP integration decisions by measuring real performance improvements (or lack thereof) through controlled comparisons. ## Features ### Generic Design - Works with ANY MCP tool (Serena, GitHub Copilot MCP, future tools) - Tool adapter pattern - tool-specific logic isolated to adapters - Same test scenarios work for all tools ### Comprehensive Metrics - **Quality**: Correctness, completeness, code quality - **Efficiency**: Tokens, time, file operations, tool calls - **Tool-Specific**: Feature usage, effectiveness ### 3 Generic Test Scenarios 1. **Cross-File Navigation** - Finding code across files 2. **Code Understanding** - Analyzing structure and dependencies 3. **Targeted Modification** - Making precise edits ### Automated Reporting - Executive summaries with recommendations - Detailed per-scenario comparisons - Statistical analysis ## Architecture ``` tests/mcp_evaluation/ ├── framework/ # 6 core modules (1,367 lines) │ ├── types.py # Data structures │ ├── adapter.py # Tool adapter interface │ ├── metrics.py # Metrics collection │ ├── evaluator.py # Main engine │ └── reporter.py # Report generation ├── scenarios/ # 3 generic tests + realistic codebase ├── tools/ # Tool adapter base class └── docs/ # Complete specs and guides ``` ## Changes **40 files, 8,249 insertions**: - Framework core (6 modules) - 3 generic test scenarios - Realistic test codebase (16 files) - Design specifications (5 docs in Specs/) - Documentation (README, Quick Start, Implementation Summary) ## Philosophy - **Ruthless Simplicity**: Core < 1,500 lines, focused on one problem - **Brick Design**: Self-contained, regeneratable framework - **Zero-BS**: No stubs or placeholders - **Measurement-Driven**: Real execution data, not estimates ## Note This is a clean replacement for PR #1377, which had 193 files due to branch merge issues. This PR contains ONLY the 40 MCP-related files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

rysweet · 2025-11-23T18:56:36Z

Closing this PR in favor of #1533, which contains only the MCP evaluation framework files.

Problem with this PR:

Contains 193 files instead of expected 40
152 files (80%) are unrelated to MCP evaluation
Caused by merging main into feature branch, pulling in 20+ unrelated PRs

Clean replacement:

PR feat: Add MCP evaluation framework (clean) #1533 contains only 40 MCP-related files
Created from clean branch
Follows branch hygiene best practices

See DISCOVERIES.md entry "PR Scope Creep from Branch Merging" for prevention patterns.

## Summary Reusable framework for evaluating ANY MCP server integration with amplihack through empirical measurement. ## Purpose Provides evidence-based guidance for MCP integration decisions by measuring real performance improvements (or lack thereof) through controlled comparisons. ## Features ### Generic Design - Works with ANY MCP tool (Serena, GitHub Copilot MCP, future tools) - Tool adapter pattern - tool-specific logic isolated to adapters - Same test scenarios work for all tools ### Comprehensive Metrics - **Quality**: Correctness, completeness, code quality - **Efficiency**: Tokens, time, file operations, tool calls - **Tool-Specific**: Feature usage, effectiveness ### 3 Generic Test Scenarios 1. **Cross-File Navigation** - Finding code across files 2. **Code Understanding** - Analyzing structure and dependencies 3. **Targeted Modification** - Making precise edits ### Automated Reporting - Executive summaries with recommendations - Detailed per-scenario comparisons - Statistical analysis ## Architecture ``` tests/mcp_evaluation/ ├── framework/ # 6 core modules (1,367 lines) │ ├── types.py # Data structures │ ├── adapter.py # Tool adapter interface │ ├── metrics.py # Metrics collection │ ├── evaluator.py # Main engine │ └── reporter.py # Report generation ├── scenarios/ # 3 generic tests + realistic codebase ├── tools/ # Tool adapter base class └── docs/ # Complete specs and guides ``` ## Changes **40 files, 8,249 insertions**: - Framework core (6 modules) - 3 generic test scenarios - Realistic test codebase (16 files) - Design specifications (5 docs in Specs/) - Documentation (README, Quick Start, Implementation Summary) ## Philosophy - **Ruthless Simplicity**: Core < 1,500 lines, focused on one problem - **Brick Design**: Self-contained, regeneratable framework - **Zero-BS**: No stubs or placeholders - **Measurement-Driven**: Real execution data, not estimates ## Note This is a clean replacement for PR #1377, which had 193 files due to branch merge issues. This PR contains ONLY the 40 MCP-related files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude <[email protected]>

rysweet mentioned this pull request Nov 16, 2025

feat: Enable Serena MCP by default (simple integration) #1376

Draft

feat: Add automated evaluation with amplihack auto mode

ae034bb

Real A/B testing automation for MCP tool evaluation. Proven with Serena: 16.6% average speedup. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

rysweet mentioned this pull request Nov 17, 2025

Integrate Serena MCP Server for Symbol-Level Code Navigation #1359

Closed

6 tasks

rysweet mentioned this pull request Nov 17, 2025

docs: Add user-facing documentation for MCP Evaluation Framework (PR 1377) #1400

Open

rysweet mentioned this pull request Nov 17, 2025

docs: Add user-facing documentation for MCP Evaluation Framework (PR #1377) #1401

Closed

rysweet and others added 2 commits November 17, 2025 20:36

rysweet marked this pull request as draft November 18, 2025 18:03

Merge remote-tracking branch 'origin/docs/issue-1400-mcp-evaluation-d…

95fe0cc

…ocs' into feat/mcp-evaluation-framework

rysweet added the documentation Improvements or additions to documentation label Nov 18, 2025

rysweet mentioned this pull request Nov 23, 2025

feat: Add MCP evaluation framework (clean) #1533

Merged

rysweet closed this Nov 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Generic MCP evaluation framework #1377

feat: Generic MCP evaluation framework #1377

Uh oh!

rysweet commented Nov 16, 2025

Uh oh!

rysweet commented Nov 17, 2025

Uh oh!

rysweet commented Nov 18, 2025

Uh oh!

rysweet commented Nov 18, 2025

Uh oh!

rysweet commented Nov 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: Generic MCP evaluation framework #1377

feat: Generic MCP evaluation framework #1377

Uh oh!

Conversation

rysweet commented Nov 16, 2025

Summary

Purpose

Features

Generic Design

Comprehensive Metrics

3 Generic Test Scenarios

Automated Reporting

Architecture

Usage

Extensibility

Changes

Testing

Philosophy

Future Use

Uh oh!

rysweet commented Nov 17, 2025

Automation Script Available

Location: ====================================================================== SERENA MCP EVALUATION - Real Testing with Auto Mode

Uh oh!

rysweet commented Nov 18, 2025

Uh oh!

rysweet commented Nov 18, 2025

Documentation Added

New Documentation Files

Why Combined

Uh oh!

rysweet commented Nov 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Location: ======================================================================
SERENA MCP EVALUATION - Real Testing with Auto Mode