Skip to content

Conversation

andrewginns
Copy link
Owner

@andrewginns andrewginns commented Jun 16, 2025

Initial release of Merbench, a comprehensive evaluation dashboard and benchmarking toolkit for LLM agents using Model Context Protocol (MCP) integration.

Problem:

The project lacked systematic evaluation capabilities for MCP-enabled agents. There was no way to benchmark multiple LLMs on complex multi-server tasks, compare cost/performance trade-offs, or validate agent behaviour with real-world MCP server interactions. Additionally, AWS Bedrock model support was missing, and the existing cost tracking was incomplete.

Solution:

Built Merbench - a production-ready evaluation platform that tests LLM agents on Mermaid diagram generation tasks using MCP servers for validation and error correction. Added comprehensive multi-model support including AWS Bedrock, created an interactive Streamlit dashboard with leaderboards and Pareto analysis, and implemented sophisticated cost tracking with real pricing data across providers.

Unlocks:

  • Multi-Model Benchmarking: Compare OpenAI, Gemini, and Bedrock models on identical MCP tasks
  • Cost/Performance Analysis: Make data-driven model selection decisions with integrated pricing
  • MCP Server Validation: Test agent tool usage patterns and multi-server orchestration
  • Extensible Framework: Easy addition of new models, evaluation protocols, and MCP servers
  • Production Monitoring: Real-world agent performance tracking and optimisation insights

Detailed breakdown of changes:

  • AWS Bedrock Integration: Added full Claude model family support with region/profile configuration via .env variables
    (AWS_REGION, AWS_PROFILE)
  • Enhanced Cost Tracking: Migrated from unsafe eval() parsing to secure JSON format in costs.json, added friendly model
    names and multi-tier pricing structures
  • Dashboard Improvements: Upgraded merbench_ui.py with smart label positioning for Pareto plots, richer UI descriptions,
    and better cost/performance visualisation options
  • Evaluation Engine: Enhanced evals_pydantic_mcp.py with Bedrock model creation logic and improved error handling;
    updated run_multi_evals.py with refined parallelism and new default model configurations
  • Content Quality: Fixed Mermaid diagram syntax errors and improved example diagrams in mermaid_diagrams.py
  • Developer Experience: Added make leaderboard command, updated dependencies for Bedrock support (boto3, botocore,
    s3transfer), and improved schema validation in dashboard_config.py
  • Documentation Cleanup: Removed references to non-existent eval_basic_mcp_use directory from README.md to align
    documentation with actual codebase structure

- Remove outdated eval_basic_mcp_use section from README
- Fix naming inconsistency in adk_mcp.py (pydantic -> adk)
- Improve model costs parsing safety with restricted eval
- Fix mermaid diagram string formatting
- Add new evaluation scripts and results directory
… from CSV to JSON format

- Update merbench_ui.py to use the new JSON format
@andrewginns andrewginns merged commit fc556f5 into main Jun 16, 2025
1 check passed
@andrewginns andrewginns deleted the initial-merbench-release branch June 29, 2025 12:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant