docs: Update with mention of evals and dashboard

andrewginns · andrewginns · commit b97b853c66b3 · 2025-06-09T10:18:44.000Z
diff --git a/README.md b/README.md
@@ -1,15 +1,18 @@
-# Model Context Protocol (MCP) Agent Frameworks Demo
+# Model Context Protocol (MCP) Agent Frameworks Demo & Benchmarking Platform
 
 This repository demonstrates LLM Agents using tools from Model Context Protocol (MCP) servers with several frameworks:
 - Google Agent Development Kit (ADK)
 - LangGraph Agents
 - OpenAI Agents
 - Pydantic-AI Agents
 
-Both single and multiple MCP server examples are demonstrated
-- [Agent with a single MCP server](agents_mcp_usage/basic_mcp/README.md)
-- [Agent with multiple MCP servers](agents_mcp_usage/multi_mcp/README.md)
-  - Also includes Agent evaluations
+## Repository Structure
+
+- [Agent with a single MCP server](agents_mcp_usage/basic_mcp/README.md) - Learning examples and basic patterns
+- [Agent with multiple MCP servers](agents_mcp_usage/multi_mcp/README.md) - Advanced usage with comprehensive evaluation suite
+  - **Evaluation Dashboard**: Interactive Streamlit UI for model comparison
+  - **Multi-Model Benchmarking**: Parallel/sequential evaluation across multiple LLMs
+  - **Rich Metrics**: Usage analysis, cost comparison, and performance leaderboards
 
 The repo also includes Python MCP Servers:
 - [`example_server.py`](mcp_servers/example_server.py) based on [MCP Python SDK Quickstart](https://github.com/modelcontextprotocol/python-sdk/blob/b4c7db6a50a5c88bae1db5c1f7fba44d16eebc6e/README.md?plain=1#L104) - Modified to include a datetime tool and run as a server invoked by Agents
@@ -217,10 +220,59 @@ uv run agents_mcp_usage/multi_mcp/multi_mcp_use/pydantic_mcp.py
 
 # Run the multi-MCP evaluation
 uv run agents_mcp_usage/multi_mcp/eval_multi_mcp/evals_pydantic_mcp.py
+
+# Run multi-model benchmarking
+uv run agents_mcp_usage/multi_mcp/eval_multi_mcp/run_multi_evals.py --models "gemini-2.5-pro,gemini-2.0-flash" --runs 5 --parallel
+
+# Launch the evaluation dashboard
+uv run streamlit run agents_mcp_usage/multi_mcp/eval_multi_mcp/merbench_ui.py
 ```
 
 More details on multi-MCP implementation can be found in the [multi_mcp README](agents_mcp_usage/multi_mcp/README.md).
 
+## Evaluation Suite & Benchmarking Dashboard
+
+This repository includes a comprehensive evaluation system for benchmarking LLM agent performance across multiple frameworks and models. The evaluation suite tests agents on mermaid diagram correction tasks using multiple MCP servers, providing rich metrics and analysis capabilities.
+
+### Key Evaluation Features
+
+- **Multi-Level Difficulty**: Easy, medium, and hard test cases for comprehensive assessment
+- **Multi-Model Benchmarking**: Parallel or sequential evaluation across multiple LLM models
+- **Interactive Dashboard**: Streamlit-based UI for visualising results, cost analysis, and model comparison
+- **Rich Metrics Collection**: Token usage, cost analysis, success rates, and failure categorisation
+- **Robust Error Handling**: Comprehensive retry logic and detailed failure analysis
+- **Export Capabilities**: CSV results for downstream analysis and reporting
+
+### Dashboard Features
+
+The included Streamlit dashboard (`merbench_ui.py`) provides:
+
+- **Model Leaderboards**: Performance rankings by accuracy, cost efficiency, and speed
+- **Cost Analysis**: Detailed cost breakdowns and cost-per-success metrics
+- **Failure Analysis**: Categorised failure reasons with debugging insights
+- **Performance Trends**: Visualisation of model behaviour across difficulty levels
+- **Resource Usage**: Token consumption and API call patterns
+- **Comparative Analysis**: Side-by-side model performance comparison
+
+### Quick Evaluation Commands
+
+```bash
+# Single model evaluation
+uv run agents_mcp_usage/multi_mcp/eval_multi_mcp/evals_pydantic_mcp.py
+
+# Multi-model parallel benchmarking
+uv run agents_mcp_usage/multi_mcp/eval_multi_mcp/run_multi_evals.py \
+  --models "gemini-2.5-pro,gemini-2.0-flash,gemini-2.5-flash-preview-04-17" \
+  --runs 5 \
+  --parallel \
+  --output-dir ./results
+
+# Launch interactive dashboard
+uv run streamlit run agents_mcp_usage/multi_mcp/eval_multi_mcp/merbench_ui.py
+```
+
+The evaluation system enables robust, repeatable benchmarking across LLM models and agent frameworks, supporting both research and production model selection decisions.
+
 ## What is MCP?
 
 The Model Context Protocol allows applications to provide context for LLMs in a standardised way, separating the concerns of providing context from the actual LLM interaction.
@@ -258,4 +310,4 @@ A key advantage highlighted is flexibility; MCP allows developers to more easily
 - OpenTelemetry support for leveraging existing tooling
 - Pydantic integration for analytics on validations
 
-Logfire gives you visibility into how your code is running, which is especially valuable for LLM applications where understanding model behaviour is critical.
+Logfire gives you visibility into how your code is running, which is especially valuable for LLM applications where understanding model behaviour is critical.
diff --git a/agents_mcp_usage/multi_mcp/README.md b/agents_mcp_usage/multi_mcp/README.md
@@ -1,8 +1,9 @@
-# Multi-MCP Usage
+# Multi-MCP Usage & Evaluation Suite
 
-This directory contains examples demonstrating the integration of tools from multiple Model Context Protocol (MCP) servers with various LLM agent frameworks.
+This directory contains examples demonstrating the integration of tools from multiple Model Context Protocol (MCP) servers with various LLM agent frameworks, along with a comprehensive evaluation and benchmarking system.
+
+Agents utilising multiple MCP servers can be dramatically more complex than an Agent using a single server. This is because as the number of servers grow the number of tools that the Agent must reason on when and how to use increases. As a result this component not only demonstrates an Agent's use of multiple MCP servers, but also includes a production-ready evaluation suite to validate performance, analyse costs, and compare models across multiple difficulty levels.
 
-Agents utilising multiple MCP servers can be dramatically more complex than an Agent using a single server. This is because as the number of servers grow the number of tools that the Agent must reason on when and how to use increases. As a result this component not only demonstrates an Agent's use of multiple MCP servers, but also includes evaluations to validate that they are being used to successfully accomplish the task according to various evaluation criterias.
 
 
 ## Quickstart
@@ -22,9 +23,15 @@ Agents utilising multiple MCP servers can be dramatically more complex than an A
    
    # Run the multi-MCP evaluation
    uv run agents_mcp_usage/multi_mcp/eval_multi_mcp/evals_pydantic_mcp.py
+   
+   # Run multi-model benchmarking
+   uv run agents_mcp_usage/multi_mcp/eval_multi_mcp/run_multi_evals.py --models "gemini-2.5-pro,gemini-2.0-flash" --runs 5 --parallel
+   
+   # Launch the evaluation dashboard
+   uv run streamlit run agents_mcp_usage/multi_mcp/eval_multi_mcp/merbench_ui.py
    ```
 
-5. Check the console output or Logfire for results.
+5. Check the console output, Logfire, or dashboard for results.
 
 
 ### Multi-MCP Architecture
@@ -180,6 +187,92 @@ Research in LLM agent development has identified tool overload as a significant
 
 The evaluation framework included in this component is essential for validating that agents can effectively navigate the increased complexity of multiple MCP servers. By measuring success against specific evaluation criteria, developers can ensure that the benefits of tool specialisation outweigh the potential pitfalls of tool overload.
 
+## Comprehensive Evaluation Suite
+
+The `eval_multi_mcp/` directory contains a production-ready evaluation system for benchmarking LLM agent performance across multiple frameworks and models. The suite tests agents on mermaid diagram correction tasks using multiple MCP servers, providing rich metrics and analysis capabilities.
+
+### Evaluation Components
+
+#### Core Modules
+- **`evals_pydantic_mcp.py`** - Single-model evaluation with comprehensive metrics collection
+- **`run_multi_evals.py`** - Multi-model parallel/sequential benchmarking with CSV export
+- **`merbench_ui.py`** - Interactive Streamlit dashboard for visualisation and analysis
+- **`dashboard_config.py`** - Configuration-driven UI setup for flexible dashboard customisation
+- **`costs.csv`** - Pricing integration for cost analysis and budget planning
+
+#### Test Difficulty Levels
+The evaluation includes three test cases of increasing complexity:
+1. **Easy** - Simple syntax errors in mermaid diagrams
+2. **Medium** - More complex structural issues requiring deeper reasoning
+3. **Hard** - Advanced mermaid syntax problems testing sophisticated tool usage
+
+#### Evaluation Metrics
+The system captures five key performance indicators:
+- **UsedBothMCPTools** - Validates proper coordination between multiple MCP servers
+- **UsageLimitNotExceeded** - Monitors resource consumption and efficiency
+- **MermaidDiagramValid** - Assesses technical correctness of outputs
+- **LLMJudge (Format)** - Evaluates response formatting and structure
+- **LLMJudge (Structure)** - Measures preservation of original diagram intent
+
+## Interactive Dashboard & Visualisation
+
+The Streamlit-based dashboard (`merbench_ui.py`) provides comprehensive analysis and comparison capabilities:
+
+### Dashboard Features
+- **Model Leaderboards** - Performance rankings by accuracy, cost efficiency, and execution speed
+- **Cost Analysis** - Detailed cost breakdowns with cost-per-success metrics and budget projections
+- **Failure Analysis** - Categorised failure reasons with debugging insights and error patterns
+- **Performance Trends** - Visualisation of model behaviour across difficulty levels and test iterations
+- **Resource Usage** - Token consumption patterns and API call efficiency metrics
+- **Comparative Analysis** - Side-by-side model performance comparison with statistical significance
+
+### Dashboard Quick Launch
+```bash
+# Launch the interactive evaluation dashboard
+uv run streamlit run agents_mcp_usage/multi_mcp/eval_multi_mcp/merbench_ui.py
+```
+
+The dashboard automatically loads evaluation results from the `mermaid_eval_results/` directory, providing immediate insights into model performance and cost efficiency.
+
+## Multi-Model Benchmarking
+
+The `run_multi_evals.py` script enables systematic comparison across multiple LLM models with flexible execution options:
+
+### Benchmarking Features
+- **Parallel Execution** - Simultaneous evaluation across models for faster results
+- **Sequential Mode** - Conservative execution for resource-constrained environments
+- **Configurable Runs** - Multiple iterations per model for statistical reliability
+- **Comprehensive Error Handling** - Robust retry logic with exponential backoff
+- **CSV Export** - Structured results for downstream analysis and reporting
+
+### Example Benchmarking Commands
+
+```bash
+# Parallel benchmarking across multiple models
+uv run agents_mcp_usage/multi_mcp/eval_multi_mcp/run_multi_evals.py \
+  --models "gemini-2.5-pro,gemini-2.0-flash,gemini-2.5-flash-preview-04-17" \
+  --runs 5 \
+  --parallel \
+  --timeout 600 \
+  --output-dir ./benchmark_results
+
+# Sequential execution with custom judge model
+uv run agents_mcp_usage/multi_mcp/eval_multi_mcp/run_multi_evals.py \
+  --models "gemini-2.5-pro,claude-3-opus" \
+  --runs 3 \
+  --sequential \
+  --judge-model "gemini-2.5-pro" \
+  --output-dir ./comparative_analysis
+```
+
+### Output Structure
+Results are organised with timestamped files:
+- **Individual model results** - `YYYY-MM-DD_HH-MM-SS_individual_{model}.csv`
+- **Combined analysis** - `YYYY-MM-DD_HH-MM-SS_combined_results.csv`
+- **Dashboard integration** - Automatic loading into visualisation interface
+
+The evaluation system enables robust, repeatable benchmarking across LLM models and agent frameworks, supporting both research investigations and production model selection decisions.
+
 ## Example Files
 
 ### Pydantic-AI Multi-MCP