Skip to content

Conversation

andrewginns
Copy link
Owner

@andrewginns andrewginns commented Jun 9, 2025

This PR introduces evaluation modules for mermaid diagram fixing across various levels of difficulty, adds robust multi-model benchmarking, and provides a Streamlit UI dashboard for interactive comparison. It expands the repo’s ability to evaluate, compare, and visualise agentic model performance using multiple MCP servers and tool integrations.

Problem:

  • Previous evaluation functionality was limited to basic or single-case scenarios.
  • No systematic way to benchmark multiple models or difficulty levels.
  • Lacked a user interface for visualising evaluation results and resource usage.
  • Cost calculation, error handling, and metrics aggregation were inconsistent or manual.

Solution:

  • Added a new evaluation suite covering easy, medium, and hard cases for mermaid diagram correction using multiple MCP servers.
  • Implemented parallel and sequential evaluation runners for multi-model benchmarking (run_multi_evals.py).
  • Introduced comprehensive error/retry handling and granular metrics collection (usage, tokens, tool calls, failure reasons).
  • Developed a Streamlit dashboard (merbench_ui.py) and a configurable UI backend for result visualisation, filtering, cost analysis, and leaderboard generation.
  • Added cost schema/configuration and CSV result exporting for downstream analysis.
  • Improved docs and in-code comments for clarity and extensibility.

Unlocks:

  • Enables robust, repeatable benchmarking across LLM models and agent frameworks.
  • Provides a rich dashboard for model comparison by accuracy, cost, speed, and resource usage.
  • Simplifies diagnosing agent/tool failures and supports deeper analysis of model/tool interactions.
  • Facilitates future extension to other agent frameworks, evaluation types, or visualisation needs.

Detailed breakdown of changes:

  • Added new directory: agents_mcp_usage/multi_mcp/eval_multi_mcp/ with:
    • evals_pydantic_mcp.py: Core evaluation logic supporting multi-level difficulty, retry logic, rich metrics, and robust error handling.
    • run_multi_evals.py: Script to run evaluations across multiple models, supports both parallel and sequential execution, writes results to CSV.
    • dashboard_config.py: Configuration for the new UI dashboard, specifying metrics, grouping, cost breakdown, and plot settings.
    • merbench_ui.py: Streamlit-based user dashboard for visualising and comparing evaluation results.
    • costs.csv: Pricing table for supported models (used for cost calculation in dashboard and reports).
    • README.md: Comprehensive documentation for the new evaluation system, covering usage, error handling, output schema, and available options.
  • Enhanced agent scripts (basic_mcp_use/*.py) with improved docstrings and argument clarity.
  • Added/fixed extensive inline documentation and error-handling comments throughout new modules.
  • All new features and files are modular, extensible, and designed to support both current and future MCP evaluation scenarios.

image

@andrewginns andrewginns merged commit 77725bb into main Jun 9, 2025
1 check passed
@andrewginns andrewginns deleted the add-levels-of-eval-difficulty branch June 11, 2025 08:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant