β οΈ Still Under Development - APIs may change. Use with caution in production.
Server-focused evaluation framework for MCP (Model Context Protocol) servers.
π Test your MCP server capabilities, not LLM conversation patterns.
"Are my MCP server's tools working correctly and being used as expected?"
PyMCPEvals separates what you can control (server) from what you cannot (LLM behavior):
- Tool implementation correctness
- Tool parameter validation
- Error handling and recovery
- Tool result formatting
- Multi-turn state management
- LLM conversation patterns
- How LLMs choose to use tools
- LLM response formatting
- Whether LLMs provide intermediate responses
- π« Manual Tool Testing: Automated assertions verify exact tool calls
- β Multi-step Failures: Track tool chaining across conversation turns
- π Silent Tool Errors: Instant feedback when expected tools aren't called
- π CI/CD Integration: JUnit XML output for automated testing pipelines
pip install pymcpevals
pymcpevals init # Create template config
pymcpevals run evals.yaml # Run evaluations
model:
provider: openai
name: gpt-4
server:
command: ["python", "my_server.py"]
evaluations:
- name: "weather_check"
prompt: "What's the weather in Boston?"
expected_tools: ["get_weather"] # β
Validates tool usage
expected_result: "Should call weather API and return conditions"
threshold: 3.5
- name: "multi_step"
turns:
- role: "user"
content: "What's the weather in London?"
expected_tools: ["get_weather"]
- role: "user"
content: "And in Paris?"
expected_tools: ["get_weather"]
expected_result: "Should provide weather for both cities"
threshold: 4.0
Output: Pass/fail status, tool validation, execution metrics, and server-focused scoring.
- Connect to your MCP server via FastMCP
- Execute prompts and track tool calls
- Validate expected tools are called (instant feedback)
- Evaluate server performance (ignores LLM style)
- Report results with actionable insights
Precise Tool Assertions: Unlike traditional evaluations that judge LLM responses, PyMCPEvals validates:
- β
Exact tool calls:
assert_tools_called(result, ["add", "multiply"])
- β
Tool execution success:
assert_no_tool_errors(result)
- β Multi-turn trajectories: Test tool chaining across conversation steps
- β Instant failure detection: No expensive LLM evaluation for obvious failures
# Basic usage
pymcpevals run evals.yaml
# Override server/model
pymcpevals run evals.yaml --server "node server.js" --model gpt-4
# Different outputs
pymcpevals run evals.yaml --output table # Simple table
pymcpevals run evals.yaml --output json # Full JSON
pymcpevals run evals.yaml --output junit # CI/CD format
from pymcpevals import (
assert_tools_called,
assert_evaluation_passed,
assert_min_score,
assert_no_tool_errors,
ConversationTurn
)
# Simple marker-based test
@pytest.mark.mcp_eval(
prompt="What is 15 + 27?",
expected_tools=["add"],
min_score=4.0
)
async def test_basic_addition(mcp_result):
assert_evaluation_passed(mcp_result)
assert_tools_called(mcp_result, ["add"])
assert "42" in mcp_result.server_response
# Multi-turn trajectory testing
async def test_math_sequence(mcp_evaluator):
turns = [
ConversationTurn(role="user", content="What is 10 + 5?", expected_tools=["add"]),
ConversationTurn(role="user", content="Now multiply by 2", expected_tools=["multiply"])
]
result = await mcp_evaluator.evaluate_trajectory(turns, min_score=4.0)
# Rich assertions
assert_evaluation_passed(result)
assert_tools_called(result, ["add", "multiply"])
assert_no_tool_errors(result)
assert_min_score(result, 4.0, dimension="accuracy")
assert "30" in str(result.conversation_history)
# Run with: pytest -m mcp_eval
Check out the examples/
directory for:
calculator_server.py
- Simple MCP server for testinglocal_server_basic.yaml
- Basic evaluation configuration examplestrajectory_evaluation.yaml
- Multi-turn conversation examplestest_simple_plugin_example.py
- Pytest integration examples
Run the examples:
# Test with the example calculator server
pymcpevals run examples/local_server_basic.yaml
# Run pytest examples
cd examples && pytest test_simple_plugin_example.py
pip install pymcpevals
export OPENAI_API_KEY="sk-..." # or ANTHROPIC_API_KEY
export GEMINI_API_KEY="..." # for Gemini models
ββββββββββββββββββββββββββββββββββββββββββββ¬βββββββββ¬ββββββ¬βββββββ¬ββββββ¬βββββββ¬βββββββ¬βββββββ¬ββββββββ
β Name β Status β Acc β Comp β Rel β Clar β Reas β Avg β Tools β
ββββββββββββββββββββββββββββββββββββββββββββΌβββββββββΌββββββΌβββββββΌββββββΌβββββββΌβββββββΌβββββββΌββββββββ€
β What is 15 + 27? β PASS β 4.5 β 4.2 β 5.0 β 4.8 β 4.1 β 4.52 β β β
β What happens if I divide 10 by 0? β PASS β 4.0 β 4.1 β 4.5 β 4.2 β 3.8 β 4.12 β β β
β Multi-turn test β PASS β 4.2 β 4.5 β 4.8 β 4.1 β 4.3 β 4.38 β β β
ββββββββββββββββββββββββββββββββββββββββββββ΄βββββββββ΄ββββββ΄βββββββ΄ββββββ΄βββββββ΄βββββββ΄βββββββ΄ββββββββ
Summary: 3/3 passed (100.0%) - Average: 4.34/5.0
βββββββββββββββββββββββββββ¬βββββββββ¬βββββββ¬βββββββββββββββββββββ¬βββββββββββββββββββββ¬βββββββββ¬βββββββββ¬βββββββββββββββββββββββββββββββ
β Test β Status β Scoreβ Expected Tools β Tools Used β Time β Errors β Notes β
βββββββββββββββββββββββββββΌβββββββββΌβββββββΌβββββββββββββββββββββΌβββββββββββββββββββββΌβββββββββΌβββββββββΌβββββββββββββββββββββββββββββββ€
β What is 15 + 27? β PASS β 4.5 β add β add β 12ms β 0 β OK β
β What happens if I div...β PASS β 4.1 β divide β divide β 8ms β 1 β Handled error correctly β
β Multi-turn test β PASS β 4.4 β add, multiply β add, multiply β 23ms β 0 β Tool chaining successful β
βββββββββββββββββββββββββββ΄βββββββββ΄βββββββ΄βββββββββββββββββββββ΄βββββββββββββββββββββ΄βββββββββ΄βββββββββ΄βββββββββββββββββββββββββββββββ
π§ Tool Execution Details:
β’ add: Called 2 times, avg 10ms, 100% success rate
β’ divide: Called 1 time, 8ms, handled error gracefully
β’ multiply: Called 1 time, 13ms, 100% success rate
Summary: 3/3 passed (100.0%) - Average: 4.33/5.0
- π― Server-Focused Testing: Test your server capabilities, not LLM behavior
- β Instant Tool Validation: Get immediate feedback if wrong tools are called (no LLM needed)
- π§ Tool Execution Insights: See success rates, timing, and error handling
- π Multi-turn Validation: Test tool chaining and state management
- π Capability Scoring: LLM judges server tool performance, ignoring conversation style
- π οΈ Easy Integration: Works with any MCP server via FastMCP
- π CI/CD Integration: JUnit XML output for automated testing pipelines
- π Progress Tracking: Monitor improvement over time with consistent scoring
- π Regression Testing: Ensure new changes don't break existing functionality
- βοΈ Model Comparison: Test across different LLM providers
π Huge kudos to mcp-evals - This Python package was heavily inspired by the excellent Node.js implementation by @mclenhard.
If you're working in a Node.js environment, definitely check out the original mcp-evals project, which also includes GitHub Action integration and monitoring capabilities.
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Ensure all tests pass
- Submit a pull request
MIT - see LICENSE file.