Merged
Conversation
- Remove outdated eval_basic_mcp_use section from README - Fix naming inconsistency in adk_mcp.py (pydantic -> adk) - Improve model costs parsing safety with restricted eval - Fix mermaid diagram string formatting - Add new evaluation scripts and results directory
… from CSV to JSON format - Update merbench_ui.py to use the new JSON format
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Initial release of Merbench, a comprehensive evaluation dashboard and benchmarking toolkit for LLM agents using Model Context Protocol (MCP) integration.
Problem:
The project lacked systematic evaluation capabilities for MCP-enabled agents. There was no way to benchmark multiple LLMs on complex multi-server tasks, compare cost/performance trade-offs, or validate agent behaviour with real-world MCP server interactions. Additionally, AWS Bedrock model support was missing, and the existing cost tracking was incomplete.
Solution:
Built Merbench - a production-ready evaluation platform that tests LLM agents on Mermaid diagram generation tasks using MCP servers for validation and error correction. Added comprehensive multi-model support including AWS Bedrock, created an interactive Streamlit dashboard with leaderboards and Pareto analysis, and implemented sophisticated cost tracking with real pricing data across providers.
Unlocks:
Detailed breakdown of changes:
.envvariables(
AWS_REGION,AWS_PROFILE)eval()parsing to secure JSON format incosts.json, added friendly modelnames and multi-tier pricing structures
merbench_ui.pywith smart label positioning for Pareto plots, richer UI descriptions,and better cost/performance visualisation options
evals_pydantic_mcp.pywith Bedrock model creation logic and improved error handling;updated
run_multi_evals.pywith refined parallelism and new default model configurationsmermaid_diagrams.pymake leaderboardcommand, updated dependencies for Bedrock support (boto3, botocore,s3transfer), and improved schema validation in
dashboard_config.pyeval_basic_mcp_usedirectory from README.md to aligndocumentation with actual codebase structure