Initial merbench release #6

andrewginns · 2025-06-16T16:25:36Z

Initial release of Merbench, a comprehensive evaluation dashboard and benchmarking toolkit for LLM agents using Model Context Protocol (MCP) integration.

Problem:

The project lacked systematic evaluation capabilities for MCP-enabled agents. There was no way to benchmark multiple LLMs on complex multi-server tasks, compare cost/performance trade-offs, or validate agent behaviour with real-world MCP server interactions. Additionally, AWS Bedrock model support was missing, and the existing cost tracking was incomplete.

Solution:

Built Merbench - a production-ready evaluation platform that tests LLM agents on Mermaid diagram generation tasks using MCP servers for validation and error correction. Added comprehensive multi-model support including AWS Bedrock, created an interactive Streamlit dashboard with leaderboards and Pareto analysis, and implemented sophisticated cost tracking with real pricing data across providers.

Unlocks:

Multi-Model Benchmarking: Compare OpenAI, Gemini, and Bedrock models on identical MCP tasks
Cost/Performance Analysis: Make data-driven model selection decisions with integrated pricing
MCP Server Validation: Test agent tool usage patterns and multi-server orchestration
Extensible Framework: Easy addition of new models, evaluation protocols, and MCP servers
Production Monitoring: Real-world agent performance tracking and optimisation insights

Detailed breakdown of changes:

AWS Bedrock Integration: Added full Claude model family support with region/profile configuration via .env variables
(AWS_REGION, AWS_PROFILE)
Enhanced Cost Tracking: Migrated from unsafe eval() parsing to secure JSON format in costs.json, added friendly model
names and multi-tier pricing structures
Dashboard Improvements: Upgraded merbench_ui.py with smart label positioning for Pareto plots, richer UI descriptions,
and better cost/performance visualisation options
Evaluation Engine: Enhanced evals_pydantic_mcp.py with Bedrock model creation logic and improved error handling;
updated run_multi_evals.py with refined parallelism and new default model configurations
Content Quality: Fixed Mermaid diagram syntax errors and improved example diagrams in mermaid_diagrams.py
Developer Experience: Added make leaderboard command, updated dependencies for Bedrock support (boto3, botocore,
s3transfer), and improved schema validation in dashboard_config.py
Documentation Cleanup: Removed references to non-existent eval_basic_mcp_use directory from README.md to align
documentation with actual codebase structure

…t models

- Remove outdated eval_basic_mcp_use section from README - Fix naming inconsistency in adk_mcp.py (pydantic -> adk) - Improve model costs parsing safety with restricted eval - Fix mermaid diagram string formatting - Add new evaluation scripts and results directory

… from CSV to JSON format - Update merbench_ui.py to use the new JSON format

andrewginns added 10 commits June 11, 2025 08:37

chore: Update eval cases, prices, default models, and refactor defaul…

4996326

…t models

chore: Description added

6f827c1

chore: Add anthropic models costs

d027631

chore: Update dashboard description

71f882f

feat: Add Bedrock models support

ed636ef

feat: Add time to pareto plot

b88254f

feat: Improve legibility and swap in friendly model names

d825b2d

chore: Bump Pydantic to v2 implementation

f9b894b

fix: Address security issue with eval() by converting model cost data…

3694b37

… from CSV to JSON format - Update merbench_ui.py to use the new JSON format

andrewginns merged commit fc556f5 into main Jun 16, 2025
1 check passed

andrewginns deleted the initial-merbench-release branch June 29, 2025 12:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Initial merbench release #6

Initial merbench release #6

Uh oh!

andrewginns commented Jun 16, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Initial merbench release #6

Initial merbench release #6

Uh oh!

Conversation

andrewginns commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem:

Solution:

Unlocks:

Detailed breakdown of changes:

Uh oh!

Uh oh!

Uh oh!

andrewginns commented Jun 16, 2025 •

edited

Loading