Initial merbench release #6
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Initial release of Merbench, a comprehensive evaluation dashboard and benchmarking toolkit for LLM agents using Model Context Protocol (MCP) integration.
Problem:
The project lacked systematic evaluation capabilities for MCP-enabled agents. There was no way to benchmark multiple LLMs on complex multi-server tasks, compare cost/performance trade-offs, or validate agent behaviour with real-world MCP server interactions. Additionally, AWS Bedrock model support was missing, and the existing cost tracking was incomplete.
Solution:
Built Merbench - a production-ready evaluation platform that tests LLM agents on Mermaid diagram generation tasks using MCP servers for validation and error correction. Added comprehensive multi-model support including AWS Bedrock, created an interactive Streamlit dashboard with leaderboards and Pareto analysis, and implemented sophisticated cost tracking with real pricing data across providers.
Unlocks:
Detailed breakdown of changes:
.env
variables(
AWS_REGION
,AWS_PROFILE
)eval()
parsing to secure JSON format incosts.json
, added friendly modelnames and multi-tier pricing structures
merbench_ui.py
with smart label positioning for Pareto plots, richer UI descriptions,and better cost/performance visualisation options
evals_pydantic_mcp.py
with Bedrock model creation logic and improved error handling;updated
run_multi_evals.py
with refined parallelism and new default model configurationsmermaid_diagrams.py
make leaderboard
command, updated dependencies for Bedrock support (boto3, botocore,s3transfer), and improved schema validation in
dashboard_config.py
eval_basic_mcp_use
directory from README.md to aligndocumentation with actual codebase structure