Add various difficulty evals and a UI for comparison #5
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR introduces evaluation modules for mermaid diagram fixing across various levels of difficulty, adds robust multi-model benchmarking, and provides a Streamlit UI dashboard for interactive comparison. It expands the repo’s ability to evaluate, compare, and visualise agentic model performance using multiple MCP servers and tool integrations.
Problem:
Solution:
run_multi_evals.py
).merbench_ui.py
) and a configurable UI backend for result visualisation, filtering, cost analysis, and leaderboard generation.Unlocks:
Detailed breakdown of changes:
agents_mcp_usage/multi_mcp/eval_multi_mcp/
with:evals_pydantic_mcp.py
: Core evaluation logic supporting multi-level difficulty, retry logic, rich metrics, and robust error handling.run_multi_evals.py
: Script to run evaluations across multiple models, supports both parallel and sequential execution, writes results to CSV.dashboard_config.py
: Configuration for the new UI dashboard, specifying metrics, grouping, cost breakdown, and plot settings.merbench_ui.py
: Streamlit-based user dashboard for visualising and comparing evaluation results.costs.csv
: Pricing table for supported models (used for cost calculation in dashboard and reports).README.md
: Comprehensive documentation for the new evaluation system, covering usage, error handling, output schema, and available options.basic_mcp_use/*.py
) with improved docstrings and argument clarity.