uv tool to test MCP servers using LLM-as-a-Judge approach.
Many MCP servers fail in practice because of poorly engineered prompts, tool descriptions and usage examples. This causes LLMs to confuse tools, use wrong parameters, and ultimately fail at their tasks. This problem is magnified in complex workflows that use multiple MCPs, where clear and precise prompts are crucial to avoid tool confusion. MCP-Judge solves this by automating testing with an LLM-as-a-Judge approach. It rigorously evaluates how well your LLM can understand and use your tools, ensuring your MCP is robust, reliable, and ready for real-world use. It also allows evaluation of workflows with several MCP servers used together.
- Multi-Model Testing: Test multiple LLM models simultaneously
- Judge-Based Evaluation: Uses an LLM judge to evaluate tool usage and response quality
- MCP Tools Integration: Supports loading custom MCP tools or built-in pharmacology tools
- Rich Output: Beautiful terminal output with progress tracking and detailed results
- Flexible Configuration: Customizable system prompts, judge prompts, and model settings
- Export Results: Save detailed results to JSON for further analysis
cd mcp-judge
uv sync# Test with default settings (uses Gemini 2.5 Pro)
uv run mcp-judge run examples/sample_questions.jsonquestion: The question/task to be performedexpected_tools: List of tool names that should be usedexpected_parameters: Dictionary of expected parameters for the toolsanswer: Reference answer for comparison
The judge evaluates responses based on:
- Correct Tool Usage: Whether the right MCP tools were called
- Parameter Accuracy: Whether tool parameters are appropriate and well-formed
- Logical Workflow: Whether tools are used in a sensible sequence
- Information Retrieval: Whether essential information was retrieved correctly
The judge returns either "PASS" or "FAIL [reason]".
Use --pharmacology-tools to load pharmacology MCP tools:
search_targets_to_filesearch_ligands_to_fileget_target_interactions_to_fileget_ligand_interactions_to_file
Create a Python module with your MCP tools and use --tools-module:
# my_tools.py
def my_search_tool(query: str, filename: str) -> str:
"""Search and save results to file."""
# Implementation here
return "Search completed and saved to " + filename
def my_analysis_tool(data_id: int) -> str:
"""Analyze data by ID."""
# Implementation here
return f"Analysis completed for ID {data_id}"Then load specific tools:
uv run mcp-judge run questions.json \
--tools-module my_tools.py \
--tool-name my_search_tool \
--tool-name my_analysis_tool✅ Loaded 3 questions from examples/sample_questions.json
✅ Loaded 4 pharmacology tools
✅ Available tools: search_targets_to_file, search_ligands_to_file, get_target_interactions_to_file, get_ligand_interactions_to_file
✅ Generated default system prompt from available tools
✅ Initialized judge agent with model: openai/gpt-4o
✅ Initialized test runner
🔄 Testing model: openai/gpt-4o-mini
Testing openai/gpt-4o-mini ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 3/3
📊 Results: 3/3 passed (100.0%)
📋 Overall Results:
Model Comparison
┏━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Model ┃ Passed ┃ Total ┃ Pass Rate ┃ Status ┃
┡━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ openai/gpt-4o-mini │ 3 │ 3 │ 100.0% │ 🟢 Excellent │
└──────────────────────┴────────┴───────┴───────────┴──────────────┘
💾 Detailed results saved to results.json
Set up your API keys by copying the template file and filling in your actual keys:
# Copy the template file
cp .env.template .env
# Edit .env and add your actual API keys
# OPENAI_API_KEY=your-actual-openai-key
# ANTHROPIC_API_KEY=your-actual-anthropic-key
# GOOGLE_API_KEY=your-actual-google-keyAlternatively, you can set environment variables directly:
export OPENAI_API_KEY="your-openai-key"
export ANTHROPIC_API_KEY="your-anthropic-key"
export GOOGLE_API_KEY="your-google-key"just-agents>=0.8.2- LLM agent frameworktyper>=0.16.0- CLI frameworkeliot>=1.17.5- Structured loggingrich>=13.0.0- Rich terminal outputpydantic>=2.11.7- Data validationpython-dotenv>=1.1.1- Environment configuration