This document outlines systematic testing to validate that Claude Code (and other MCP agents) properly understand and use the codemode-unified MCP server.
Verify that agents can:
- ✅ Discover available MCP tools
- ✅ Read TypeScript declarations from resources
- ✅ Write type-safe code using MCP proxy
- ✅ Handle both namespace styles (dot notation and bracket notation)
- ✅ Successfully execute code with MCP tool calls
| Mode | Description | Token Cost | Agent Visibility |
|---|---|---|---|
| on-demand | Types via resource only | ~300 tokens | Must read resource |
| auto-include | Types in tool description | ~1500 tokens | Immediate visibility |
Goal: Verify agent can discover and understand the server capabilities
Prompt: "What tools are available from codemode-unified MCP server?"
Expected Behavior:
- Agent lists tools: execute_code, list_runtimes, get_runtime_capabilities, runtime_health_check
- Agent mentions MCP tool access via global mcp object
- Agent mentions resource mcp://types/declarations (on-demand mode)
- OR agent lists available MCP tools directly (auto-include mode)
Success Criteria:
✅ Agent identifies all 4 tools
✅ Agent understands MCP integration concept
✅ Agent knows how to get type information
Failure Modes:
❌ Agent doesn't mention resource in on-demand mode
❌ Agent doesn't understand mcp.* proxy concept
❌ Agent skips tool discovery
Prompt: "What resources does codemode-unified provide?"
Expected Behavior:
- Agent discovers mcp://types/declarations resource
- Agent explains it contains TypeScript type definitions
- Agent understands to read this BEFORE writing code
Success Criteria:
✅ Agent finds the resource
✅ Agent understands its purpose
✅ Agent knows when to use it
Failure Modes:
❌ Agent doesn't discover resources capability
❌ Agent confuses resource with tool
Prompt: "What code templates are available from codemode-unified?"
Expected Behavior:
- Agent discovers 3 prompts: mcp-tool-example, async-handler, mcp-batch-operations
- Agent can retrieve and show prompt content
- Agent understands prompts provide code patterns
Success Criteria:
✅ Agent finds all 3 prompts
✅ Agent can retrieve prompt content
✅ Agent explains what each provides
Failure Modes:
❌ Agent doesn't explore prompts capability
❌ Agent confuses prompts with tools
Prompt: "Show me the TypeScript types for MCP tools available in codemode-unified"
Expected Behavior:
- Agent reads mcp://types/declarations resource
- Agent displays TypeScript interfaces
- Agent identifies available namespaces (automem, sequential-thinking, context7, claude-code)
Success Criteria:
✅ Agent reads resource without prompting
✅ Agent displays type declarations
✅ Agent understands namespace structure
Failure Modes:
❌ Agent tries to generate types instead of reading resource
❌ Agent doesn't know how to read resources
❌ Agent reads resource but doesn't parse content
Prompt: "What MCP tools can I use when executing code?"
Expected Behavior:
- Agent lists MCP tools from tool description
- Agent shows namespace.method format
- Agent explains available operations
Success Criteria:
✅ Agent extracts tool list from description
✅ Agent correctly shows namespace syntax
✅ Agent understands what each tool does
Failure Modes:
❌ Agent doesn't parse tool summary in description
❌ Agent incorrectly formats namespace access
❌ Agent misses tools in the list
Prompt: "Write code to store a memory using automem with content 'Test' and importance 0.8"
Expected Behavior (on-demand):
1. Agent reads mcp://types/declarations resource first
2. Agent writes type-safe code using mcp.automem.store_memory()
3. Agent includes all required parameters
4. Agent uses correct runtime (bun or deno for TypeScript)
Expected Behavior (auto-include):
1. Agent sees tool list in description
2. Agent writes code directly (may still read resource for types)
3. Same correct code as on-demand mode
Success Criteria:
✅ Agent reads types before coding (on-demand)
✅ Agent uses correct namespace syntax: mcp.automem
✅ Agent includes all required parameters with correct types
✅ Agent chooses TypeScript-aware runtime
✅ Code executes successfully
Failure Modes:
❌ Agent skips reading resource (on-demand)
❌ Agent uses wrong namespace: automem.store_memory() instead of mcp.automem
❌ Agent provides wrong argument types
❌ Agent uses QuickJS (doesn't support async)
❌ Execution fails due to syntax errors
Prompt: "Use sequential-thinking MCP tool to analyze the problem of code organization"
Expected Behavior:
- Agent recognizes hyphenated namespace requires bracket notation
- Agent writes: mcp["sequential-thinking"].sequentialthinking({...})
- Agent provides all required parameters
- Agent understands the tool's purpose
Success Criteria:
✅ Agent uses bracket notation: mcp["sequential-thinking"]
✅ Agent quotes the namespace correctly
✅ Agent provides proper arguments
✅ Code executes successfully
Failure Modes:
❌ Agent uses dot notation: mcp.sequential-thinking (syntax error)
❌ Agent doesn't quote namespace: mcp[sequential-thinking] (error)
❌ Agent skips hyphens: mcp.sequentialthinking (wrong namespace)
Prompt: "Store two memories and then use sequential-thinking to analyze them"
Expected Behavior:
- Agent combines multiple MCP tool calls
- Agent uses correct namespace for each
- Agent handles async operations properly
- Agent includes error handling
Success Criteria:
✅ Agent calls multiple MCP tools in sequence
✅ Agent mixes different namespace styles correctly
✅ Agent uses async/await properly
✅ Code executes all operations successfully
Failure Modes:
❌ Agent mixes up namespaces
❌ Agent forgets await on async calls
❌ Agent doesn't handle errors
Prompt: "Store 5 memories in parallel using Promise.allSettled"
Expected Behavior:
- Agent uses batch operations pattern
- Agent may reference mcp-batch-operations prompt
- Agent handles success/failure cases
- Agent reports results properly
Success Criteria:
✅ Agent uses parallel execution pattern
✅ Agent uses Promise.allSettled (not Promise.all)
✅ Agent handles both success and failure
✅ Code executes efficiently
Failure Modes:
❌ Agent uses sequential calls (slow)
❌ Agent uses Promise.all (fails on first error)
❌ Agent doesn't report which operations failed
Prompt: "Store a memory with importance set to 'high'"
Expected Behavior:
- Agent reads types first
- Agent realizes importance must be number, not string
- Agent corrects to numeric value (e.g., 0.8)
- OR agent explains the type error
Success Criteria:
✅ Agent catches type mismatch before execution
✅ Agent provides correct type
✅ Code executes successfully
Failure Modes:
❌ Agent writes code with type error
❌ Execution fails with type error
❌ Agent doesn't consult types
Prompt: "Call automem.store_memory without any arguments"
Expected Behavior:
- Agent reads types and sees 'content' is required
- Agent either refuses or adds placeholder value
- Agent explains parameter requirements
Success Criteria:
✅ Agent identifies missing required parameter
✅ Agent provides complete arguments
✅ Code is valid
Failure Modes:
❌ Agent writes incomplete code
❌ Execution fails due to missing parameter
Prompt: "Execute code that calls MCP tools"
Expected Behavior:
- Agent chooses Bun or Deno (TypeScript-aware)
- Agent avoids QuickJS (no async support)
- Agent explains runtime choice if asked
Success Criteria:
✅ Agent selects Bun or Deno automatically
✅ Code executes with async/await support
✅ MCP tools work correctly
Failure Modes:
❌ Agent uses QuickJS (no async)
❌ Code fails due to runtime limitations
Prompt: "Use Bun runtime to execute code with MCP tools"
Expected Behavior:
- Agent respects runtime choice
- Agent writes code appropriate for Bun
- Code executes successfully
Success Criteria:
✅ Agent uses specified runtime
✅ Code compatible with runtime
✅ Execution successful
Failure Modes:
❌ Agent ignores runtime request
❌ Agent uses wrong runtime
- Setup: Start server with appropriate configuration mode
- Execute: Run the prompt with Claude Code
- Observe: Record agent behavior and outcomes
- Score: Rate understanding (0-100%)
- Document: Note any failures or improvements needed
| Score | Meaning | Action |
|---|---|---|
| 90-100% | Excellent | Prompt is clear and effective |
| 75-89% | Good | Minor refinements may help |
| 50-74% | Needs Work | Significant prompt improvements needed |
| 0-49% | Poor | Major redesign required |
Target: 99% comprehension on critical tests (Phase 1, 3.1, 3.2) Acceptable: 95%+ on all other tests
For each test run, document:
### Test X.Y: [Test Name]
**Date:** YYYY-MM-DD
**Mode:** on-demand | auto-include
**Model:** Claude 3.5 Sonnet | GPT-4 | etc.
**Prompt:**
[Exact prompt used]
**Agent Behavior:**
[What the agent did step-by-step]
**Outcome:**
✅ Success | ❌ Failure | ⚠️ Partial
**Score:** X%
**Issues Identified:**
- [List any problems]
**Prompt Improvements:**
- [Proposed changes to tool descriptions]- Run all tests in current mode
- Calculate average score across all tests
- If < 99% on critical tests:
- Identify failure patterns
- Revise tool descriptions
- Re-test affected scenarios
- Repeat until 99% threshold met
- Switch modes and re-run tests
- Compare results between modes
If tests reveal comprehension issues:
- Make instructions more explicit
- Add examples inline
- Clarify namespace syntax rules
- Emphasize resource reading requirement
- Clarify when to read resource
- Explain resource format
- Show example usage
- Add more detailed examples
- Include common error patterns
- Show best practices
Future: Create automated test harness that:
- Starts MCP server
- Runs prompt suite
- Validates agent responses
- Scores comprehension automatically
- Generates improvement recommendations
- ✅ Complete initial test run in on-demand mode
- ✅ Complete initial test run in auto-include mode
- ✅ Compare results and identify optimal configuration
- ✅ Refine tool descriptions based on findings
- ✅ Achieve 99% comprehension threshold
- ✅ Document final configuration recommendations