Add various difficulty evals and a UI for comparison #5

andrewginns · 2025-06-09T09:39:23Z

This PR introduces evaluation modules for mermaid diagram fixing across various levels of difficulty, adds robust multi-model benchmarking, and provides a Streamlit UI dashboard for interactive comparison. It expands the repo’s ability to evaluate, compare, and visualise agentic model performance using multiple MCP servers and tool integrations.

Problem:

Previous evaluation functionality was limited to basic or single-case scenarios.
No systematic way to benchmark multiple models or difficulty levels.
Lacked a user interface for visualising evaluation results and resource usage.
Cost calculation, error handling, and metrics aggregation were inconsistent or manual.

Solution:

Added a new evaluation suite covering easy, medium, and hard cases for mermaid diagram correction using multiple MCP servers.
Implemented parallel and sequential evaluation runners for multi-model benchmarking (run_multi_evals.py).
Introduced comprehensive error/retry handling and granular metrics collection (usage, tokens, tool calls, failure reasons).
Developed a Streamlit dashboard (merbench_ui.py) and a configurable UI backend for result visualisation, filtering, cost analysis, and leaderboard generation.
Added cost schema/configuration and CSV result exporting for downstream analysis.
Improved docs and in-code comments for clarity and extensibility.

Unlocks:

Enables robust, repeatable benchmarking across LLM models and agent frameworks.
Provides a rich dashboard for model comparison by accuracy, cost, speed, and resource usage.
Simplifies diagnosing agent/tool failures and supports deeper analysis of model/tool interactions.
Facilitates future extension to other agent frameworks, evaluation types, or visualisation needs.

Detailed breakdown of changes:

Added new directory: agents_mcp_usage/multi_mcp/eval_multi_mcp/ with:
- evals_pydantic_mcp.py: Core evaluation logic supporting multi-level difficulty, retry logic, rich metrics, and robust error handling.
- run_multi_evals.py: Script to run evaluations across multiple models, supports both parallel and sequential execution, writes results to CSV.
- dashboard_config.py: Configuration for the new UI dashboard, specifying metrics, grouping, cost breakdown, and plot settings.
- merbench_ui.py: Streamlit-based user dashboard for visualising and comparing evaluation results.
- costs.csv: Pricing table for supported models (used for cost calculation in dashboard and reports).
- README.md: Comprehensive documentation for the new evaluation system, covering usage, error handling, output schema, and available options.
Enhanced agent scripts (basic_mcp_use/*.py) with improved docstrings and argument clarity.
Added/fixed extensive inline documentation and error-handling comments throughout new modules.
All new features and files are modular, extensible, and designed to support both current and future MCP evaluation scenarios.

…alidator

…fficulty

andrewginns added 15 commits June 4, 2025 14:53

chore: Bump model versions

a43c1cb

feat: Simplified python wrapper for MCP server over the npx mermaid v…

b1733ff

…alidator

Merge remote-tracking branch 'origin/main' into add-levels-of-eval-di…

ed40390

…fficulty

feat: Write results of evals to csv

8b3e82c

feat: Multiple eval runs and results output

c621f80

feat: Streamlit UI for leaderboard

7a067c7

feat: Many usability and data vis improvements

ca1f86d

chore: Env deps updates

486789b

chore: Better error tracking and cleanup

7ada57c

feat: Make benchmark dashboard more flexible for eval tasks

87bdffc

feat: Improve default costs and UI representation

4bc6f87

chore: Ruff linting fixes

42bb618

docs: Google style docstrings

9375eeb

chore: Update lock

09cf03a

docs: Update with mention of evals and dashboard

b97b853

andrewginns merged commit 77725bb into main Jun 9, 2025
1 check passed

andrewginns deleted the add-levels-of-eval-difficulty branch June 11, 2025 08:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add various difficulty evals and a UI for comparison #5

Add various difficulty evals and a UI for comparison #5

Uh oh!

andrewginns commented Jun 9, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Add various difficulty evals and a UI for comparison #5

Add various difficulty evals and a UI for comparison #5

Uh oh!

Conversation

andrewginns commented Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem:

Solution:

Unlocks:

Detailed breakdown of changes:

Uh oh!

Uh oh!

Uh oh!

andrewginns commented Jun 9, 2025 •

edited

Loading