diff --git a/Makefile b/Makefile
index cab622a..477824e 100644
--- a/Makefile
+++ b/Makefile
@@ -6,7 +6,7 @@ lint:
uv run ruff check .
leaderboard:
- uv run -- streamlit run agents_mcp_usage/multi_mcp/eval_multi_mcp/merbench_ui.py
+ uv run -- streamlit run agents_mcp_usage/evaluations/mermaid_evals/merbench_ui.py
adk_basic_ui:
uv run adk web agents_mcp_usage/basic_mcp
diff --git a/README.md b/README.md
index 8e73b20..25fe2ed 100644
--- a/README.md
+++ b/README.md
@@ -9,7 +9,8 @@ This repository demonstrates LLM Agents using tools from Model Context Protocol
## Repository Structure
- [Agent with a single MCP server](agents_mcp_usage/basic_mcp/README.md) - Learning examples and basic patterns
-- [Agent with multiple MCP servers](agents_mcp_usage/multi_mcp/README.md) - Advanced usage with comprehensive evaluation suite
+- [Agent with multiple MCP servers](agents_mcp_usage/multi_mcp/README.md) - Advanced usage with MCP server coordination
+- [Evaluation suite](agents_mcp_usage/evaluations/mermaid_evals/README.md) - Comprehensive benchmarking tools
- **Evaluation Dashboard**: Interactive Streamlit UI for model comparison
- **Multi-Model Benchmarking**: Parallel/sequential evaluation across multiple LLMs
- **Rich Metrics**: Usage analysis, cost comparison, and performance leaderboards
@@ -67,8 +68,12 @@ This project aims to teach:
- **[agents_mcp_usage/multi_mcp/](agents_mcp_usage/multi_mcp/)** - Advanced multi-MCP server integration examples
- **multi_mcp_use/** - Contains examples of using multiple MCP servers simultaneously:
- `pydantic_mcp.py` - Example of using multiple MCP servers with Pydantic-AI Agent
- - **eval_multi_mcp/** - Contains evaluation examples for multi-MCP usage:
- - `evals_pydantic_mcp.py` - Example of evaluating the use of multiple MCP servers with Pydantic-AI
+
+- **[agents_mcp_usage/evaluations/](agents_mcp_usage/evaluations/)** - Evaluation modules for benchmarking
+ - **mermaid_evals/** - Comprehensive evaluation suite for mermaid diagram fixing tasks
+ - `evals_pydantic_mcp.py` - Core evaluation module for single-model testing
+ - `run_multi_evals.py` - Multi-model benchmarking with parallel execution
+ - `merbench_ui.py` - Interactive dashboard for result visualization
- **Demo Python MCP Servers**
- `mcp_servers/example_server.py` - Simple MCP server that runs locally, implemented in Python
@@ -221,13 +226,13 @@ graph LR
uv run agents_mcp_usage/multi_mcp/multi_mcp_use/pydantic_mcp.py
# Run the multi-MCP evaluation
-uv run agents_mcp_usage/multi_mcp/eval_multi_mcp/evals_pydantic_mcp.py
+uv run agents_mcp_usage/evaluations/mermaid_evals/evals_pydantic_mcp.py
# Run multi-model benchmarking
-uv run agents_mcp_usage/multi_mcp/eval_multi_mcp/run_multi_evals.py --models "gemini-2.5-pro-preview-06-05,gemini-2.0-flash" --runs 5 --parallel
+uv run agents_mcp_usage/evaluations/mermaid_evals/run_multi_evals.py --models "gemini-2.5-pro-preview-06-05,gemini-2.0-flash" --runs 5 --parallel
# Launch the evaluation dashboard
-uv run streamlit run agents_mcp_usage/multi_mcp/eval_multi_mcp/merbench_ui.py
+uv run streamlit run agents_mcp_usage/evaluations/mermaid_evals/merbench_ui.py
```
More details on multi-MCP implementation can be found in the [multi_mcp README](agents_mcp_usage/multi_mcp/README.md).
@@ -260,17 +265,17 @@ The included Streamlit dashboard (`merbench_ui.py`) provides:
```bash
# Single model evaluation
-uv run agents_mcp_usage/multi_mcp/eval_multi_mcp/evals_pydantic_mcp.py
+uv run agents_mcp_usage/evaluations/mermaid_evals/evals_pydantic_mcp.py
# Multi-model parallel benchmarking
-uv run agents_mcp_usage/multi_mcp/eval_multi_mcp/run_multi_evals.py \
+uv run agents_mcp_usage/evaluations/mermaid_evals/run_multi_evals.py \
--models "gemini-2.5-pro-preview-06-05,gemini-2.0-flash,gemini-2.5-flash" \
--runs 5 \
--parallel \
--output-dir ./results
# Launch interactive dashboard
-uv run streamlit run agents_mcp_usage/multi_mcp/eval_multi_mcp/merbench_ui.py
+uv run streamlit run agents_mcp_usage/evaluations/mermaid_evals/merbench_ui.py
```
The evaluation system enables robust, repeatable benchmarking across LLM models and agent frameworks, supporting both research and production model selection decisions.
diff --git a/agents_mcp_usage/evaluations/__init__.py b/agents_mcp_usage/evaluations/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/agents_mcp_usage/multi_mcp/eval_multi_mcp/README.md b/agents_mcp_usage/evaluations/mermaid_evals/README.md
similarity index 95%
rename from agents_mcp_usage/multi_mcp/eval_multi_mcp/README.md
rename to agents_mcp_usage/evaluations/mermaid_evals/README.md
index d105ddb..5808c5c 100644
--- a/agents_mcp_usage/multi_mcp/eval_multi_mcp/README.md
+++ b/agents_mcp_usage/evaluations/mermaid_evals/README.md
@@ -1,4 +1,4 @@
-# Multi-MCP Mermaid Diagram Evaluation System
+# Mermaid Diagram Evaluation System
This directory contains evaluation modules for testing LLM agents on mermaid diagram fixing tasks using multiple MCP (Model Context Protocol) servers. The system evaluates how well language models can fix invalid mermaid diagrams while utilizing multiple external tools.
@@ -21,7 +21,7 @@ The system tests LLM agents on their ability to:
The evaluation includes three test cases of increasing difficulty:
1. **Easy** - Simple syntax errors in mermaid diagrams
-2. **Medium** - More complex structural issues
+2. **Medium** - More complex structural issues
3. **Hard** - Advanced mermaid syntax problems
## Output Schema
@@ -164,18 +164,18 @@ Results are exported to CSV files with the following columns:
```bash
# Run evaluation with default model
-uv run agents_mcp_usage/multi_mcp/eval_multi_mcp/evals_pydantic_mcp.py
+uv run agents_mcp_usage/evaluations/mermaid_evals/evals_pydantic_mcp.py
# Customize model and judge
AGENT_MODEL="gemini-2.5-pro-preview-06-05" JUDGE_MODEL="gemini-2.0-flash" \
-uv run agents_mcp_usage/multi_mcp/eval_multi_mcp/evals_pydantic_mcp.py
+uv run agents_mcp_usage/evaluations/mermaid_evals/evals_pydantic_mcp.py
```
### Multi-Model Evaluation
```bash
# Run evaluation across multiple models
-uv run agents_mcp_usage/multi_mcp/eval_multi_mcp/run_multi_evals.py \
+uv run agents_mcp_usage/evaluations/mermaid_evals/run_multi_evals.py \
--models "gemini-2.5-pro-preview-06-05,gemini-2.0-flash" \
--runs 5 \
--parallel \
@@ -183,7 +183,7 @@ uv run agents_mcp_usage/multi_mcp/eval_multi_mcp/run_multi_evals.py \
--output-dir ./results
# Sequential execution with custom judge
-uv run agents_mcp_usage/multi_mcp/eval_multi_mcp/run_multi_evals.py \
+uv run agents_mcp_usage/evaluations/mermaid_evals/run_multi_evals.py \
--models "gemini-2.5-pro-preview-06-05,claude-3-opus" \
--runs 3 \
--sequential \
@@ -247,4 +247,4 @@ The system implements robust error handling:
- **pydantic-evals** - Evaluation framework and metrics
- **logfire** - Logging and monitoring
- **rich** - Console output and progress bars
-- **asyncio** - Asynchronous evaluation execution
\ No newline at end of file
+- **asyncio** - Asynchronous evaluation execution
\ No newline at end of file
diff --git a/agents_mcp_usage/evaluations/mermaid_evals/__init__.py b/agents_mcp_usage/evaluations/mermaid_evals/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/agents_mcp_usage/multi_mcp/eval_multi_mcp/costs.json b/agents_mcp_usage/evaluations/mermaid_evals/costs.json
similarity index 100%
rename from agents_mcp_usage/multi_mcp/eval_multi_mcp/costs.json
rename to agents_mcp_usage/evaluations/mermaid_evals/costs.json
diff --git a/agents_mcp_usage/multi_mcp/eval_multi_mcp/dashboard_config.py b/agents_mcp_usage/evaluations/mermaid_evals/dashboard_config.py
similarity index 99%
rename from agents_mcp_usage/multi_mcp/eval_multi_mcp/dashboard_config.py
rename to agents_mcp_usage/evaluations/mermaid_evals/dashboard_config.py
index 24cf37d..39e10d1 100644
--- a/agents_mcp_usage/multi_mcp/eval_multi_mcp/dashboard_config.py
+++ b/agents_mcp_usage/evaluations/mermaid_evals/dashboard_config.py
@@ -126,4 +126,4 @@
# The default configuration to use when the dashboard starts.
# You can change this to point to a different configuration.
-DEFAULT_CONFIG = MERBENCH_CONFIG
+DEFAULT_CONFIG = MERBENCH_CONFIG
\ No newline at end of file
diff --git a/agents_mcp_usage/multi_mcp/eval_multi_mcp/evals_pydantic_mcp.py b/agents_mcp_usage/evaluations/mermaid_evals/evals_pydantic_mcp.py
similarity index 99%
rename from agents_mcp_usage/multi_mcp/eval_multi_mcp/evals_pydantic_mcp.py
rename to agents_mcp_usage/evaluations/mermaid_evals/evals_pydantic_mcp.py
index 68b19b2..f596298 100644
--- a/agents_mcp_usage/multi_mcp/eval_multi_mcp/evals_pydantic_mcp.py
+++ b/agents_mcp_usage/evaluations/mermaid_evals/evals_pydantic_mcp.py
@@ -31,7 +31,7 @@
from pydantic_evals.evaluators import Evaluator, EvaluatorContext, LLMJudge
from pydantic_evals.reporting import EvaluationReport
-from agents_mcp_usage.multi_mcp.mermaid_diagrams import (
+from agents_mcp_usage.evaluations.mermaid_evals.mermaid_diagrams import (
invalid_mermaid_diagram_easy,
invalid_mermaid_diagram_medium,
invalid_mermaid_diagram_hard,
@@ -646,7 +646,7 @@ def get_timestamp_prefix() -> str:
def write_mermaid_results_to_csv(
- report: EvaluationReport, model: str, output_dir: str = "./mermaid_results"
+ report: EvaluationReport, model: str, output_dir: str = "./mermaid_eval_results"
) -> str:
"""Writes mermaid evaluation results with metrics to a CSV file.
@@ -750,7 +750,7 @@ async def run_evaluations(
model: str = DEFAULT_MODEL,
judge_model: str = DEFAULT_MODEL,
export_csv: bool = True,
- output_dir: str = "./mermaid_results",
+ output_dir: str = "./mermaid_eval_results",
) -> EvaluationReport:
"""Runs the evaluations on the mermaid diagram fixing task.
@@ -804,4 +804,4 @@ async def run_all():
model=agent_model, judge_model=judge_model, export_csv=True
)
- asyncio.run(run_all())
+ asyncio.run(run_all())
\ No newline at end of file
diff --git a/agents_mcp_usage/multi_mcp/eval_multi_mcp/merbench_ui.py b/agents_mcp_usage/evaluations/mermaid_evals/merbench_ui.py
similarity index 99%
rename from agents_mcp_usage/multi_mcp/eval_multi_mcp/merbench_ui.py
rename to agents_mcp_usage/evaluations/mermaid_evals/merbench_ui.py
index 20e2864..bfbf855 100644
--- a/agents_mcp_usage/multi_mcp/eval_multi_mcp/merbench_ui.py
+++ b/agents_mcp_usage/evaluations/mermaid_evals/merbench_ui.py
@@ -8,10 +8,10 @@
import re
from pydantic import ValidationError
-from agents_mcp_usage.multi_mcp.eval_multi_mcp.dashboard_config import (
+from agents_mcp_usage.evaluations.mermaid_evals.dashboard_config import (
DEFAULT_CONFIG,
)
-from agents_mcp_usage.multi_mcp.eval_multi_mcp.schemas import DashboardConfig
+from agents_mcp_usage.evaluations.mermaid_evals.schemas import DashboardConfig
# Load and validate the configuration
try:
@@ -841,7 +841,7 @@ def main() -> None:
# Cost configuration in sidebar
st.sidebar.subheader("💰 Cost Configuration")
- cost_file_path = os.path.join(os.path.dirname(__file__), "costs.csv")
+ cost_file_path = os.path.join(os.path.dirname(__file__), "costs.json")
model_costs, friendly_names = load_model_costs(cost_file_path)
available_models = sorted(df_initial["Model"].unique())
@@ -1033,4 +1033,4 @@ def main() -> None:
if __name__ == "__main__":
- main()
+ main()
\ No newline at end of file
diff --git a/agents_mcp_usage/evaluations/mermaid_evals/mermaid_diagrams.py b/agents_mcp_usage/evaluations/mermaid_evals/mermaid_diagrams.py
new file mode 100644
index 0000000..041865f
--- /dev/null
+++ b/agents_mcp_usage/evaluations/mermaid_evals/mermaid_diagrams.py
@@ -0,0 +1,265 @@
+invalid_mermaid_diagram_hard = """
+```mermaid
+graph LR
+ User((User)) --> |"Run script
(e.g., pydantic_mcp.py)"| Agent
+
+ # Agent Frameworks
+ subgraph "Agent"
+ direction TD
+ Agent[Agent]
+ ADK["Google ADK
(adk_mcp.py)"]
+ LG["LangGraph
(langgraph_mcp.py)"]
+ OAI["OpenAI Agents
(oai-agent_mcp.py)"]
+ PYD["Pydantic-AI
(pydantic_mcp.py)"]
+
+ Agent --> ADK
+ Agent --> LG
+ Agent --> OAI
+ Agent --> PYD
+ end
+
+ # MCP Server
+ subgraph "MCP"
+ direction TD
+ MCP["Model Context Protocol Server
(mcp_servers/example_server.py)"]
+ Tools["Tools
- add(a, b)
- get_current_time() {current_time}"]
+ Resources["Resources
- greeting://{{name}}"]
+ MCP --- Tools
+ MCP --- Resources
+ end
+
+ # LLM Providers
+ subgraph "LLM Providers"
+ direction TD
+ OAI_LLM["OpenAI Models"]
+ GEM["Google Gemini Models"]
+ OTHER["Other LLM Providers..."]
+ end
+
+ Logfire[("Logfire
Tracing")]
+
+ ADK --> MCP
+ LG -- > MCP
+ OAI --> MCP
+ PYD --> MCP
+
+ MCP --> OAI_LLM
+ MCP --> GEM
+ MCP --> OTHER
+
+ ADK --> Logfire
+ LG -- > Logfire
+ OAI --> Logfire
+ PYD --> Logfire
+
+ LLM_Response[("Response")] --> User
+ OAI_LLM --> LLM_Response
+ GEM --> LLM_Response
+ OTHER --> LLM_Response
+
+ style MCP fill:#f9f,stroke:#333,stroke-width:2px
+ style User fill:#bbf,stroke:#338,stroke-width:2px
+ style Logfire fill:#bfb,stroke:#383,stroke-width:2px
+ style LLM_Response fill:#fbb,stroke:#833,stroke-width:2px
+```
+"""
+
+# 7 syntax errors
+invalid_mermaid_diagram_medium = """
+```mermaid
+graph LR
+ User((User)) --> |"Run script
(e.g., pydantic_mcp.py)"| Agent
+
+ # Agent Frameworks
+ subgraph "Agent"
+ direction TB
+ Agent[Agent]
+ ADK["Google ADK
(adk_mcp.py)"]
+ LG["LangGraph
(langgraph_mcp.py)"]
+ OAI["OpenAI Agents
(oai-agent_mcp.py)"]
+ PYD["Pydantic-AI
(pydantic_mcp.py)"]
+
+ Agent --> ADK
+ Agent --> LG
+ Agent --> OAI
+ Agent --> PYD
+ end
+
+ # MCP Server
+ subgraph "MCP"
+ direction TB
+ MCP["Model Context Protocol Server
(mcp_servers/example_server.py)"]
+ Tools["Tools
- add(a, b)
- get_current_time() {current_time}"]
+ Resources["Resources
- greeting://{{name}}"]
+ MCP --- Tools
+ MCP --- Resources
+ end
+
+ # LLM Providers
+ subgraph "LLM Providers"
+ direction TB
+ OAI_LLM["OpenAI Models"]
+ GEM["Google Gemini Models"]
+ OTHER["Other LLM Providers..."]
+ end
+
+ Logfire[("Logfire
Tracing")]
+
+ ADK --> MCP
+ LG -- > MCP
+ OAI --> MCP
+ PYD --> MCP
+
+ MCP --> OAI_LLM
+ MCP --> GEM
+ MCP --> OTHER
+
+ ADK --> Logfire
+ LG -- > Logfire
+ OAI --> Logfire
+ PYD --> Logfire
+
+ LLM_Response[("Response")] --> User
+ OAI_LLM --> LLM_Response
+ GEM --> LLM_Response
+ OTHER --> LLM_Response
+
+ style MCP fill:#f9f,stroke:#333,stroke-width:2px
+ style User fill:#bbf,stroke:#338,stroke-width:2px
+ style Logfire fill:#bfb,stroke:#383,stroke-width:2px
+ style LLM_Response fill:#fbb,stroke:#833,stroke-width:2px
+```
+"""
+
+# 2 syntax errors
+invalid_mermaid_diagram_easy = """
+```mermaid
+graph LR
+ User((User)) --> |"Run script
(e.g., pydantic_mcp.py)"| Agent
+
+ %% Agent Frameworks
+ subgraph "Agent Frameworks"
+ direction TB
+ Agent[Agent]
+ ADK["Google ADK
(adk_mcp.py)"]
+ LG["LangGraph
(langgraph_mcp.py)"]
+ OAI["OpenAI Agents
(oai-agent_mcp.py)"]
+ PYD["Pydantic-AI
(pydantic_mcp.py)"]
+
+ Agent --> ADK
+ Agent --> LG
+ Agent --> OAI
+ Agent --> PYD
+ end
+
+ %% MCP Server
+ subgraph "MCP"
+ direction TB
+ MCP["Model Context Protocol Server
(mcp_servers/example_server.py)"]
+ Tools["Tools
- add(a, b)
- get_current_time() {current_time}"]
+ Resources["Resources
- greeting://{{name}}"]
+ MCP --- Tools
+ MCP --- Resources
+ end
+
+ %% LLM Providers
+ subgraph "LLM Providers"
+ direction TB
+ OAI_LLM["OpenAI Models"]
+ GEM["Google Gemini Models"]
+ OTHER["Other LLM Providers..."]
+ end
+
+ Logfire[("Logfire
Tracing")]
+
+ ADK --> MCP
+ LG --> MCP
+ OAI --> MCP
+ PYD --> MCP
+
+ MCP --> OAI_LLM
+ MCP --> GEMINI
+ MCP --> OTHER
+
+ ADK --> Logfire
+ LG --> Logfire
+ OAI --> Logfire
+ PYD --> Logfire
+
+ LLM_Response[("Response")] --> User
+ OAI_LLM --> LLM_Response
+ GEM --> LLM_Response
+ OTHER --> LLM_Response
+
+ style MCP fill:#f9f,stroke:#333,stroke-width:2px
+ style User fill:#bbf,stroke:#338,stroke-width:2px
+ style Logfire fill:#bfb,stroke:#383,stroke-width:2px
+ style LLM_Response fill:#fbb,stroke:#833,stroke-width:2px
+```
+"""
+
+valid_mermaid_diagram = """
+```mermaid
+graph LR
+ User((User)) --> |"Run script
(e.g., pydantic_mcp.py)"| Agent
+
+ %% Agent Frameworks
+ subgraph "Agent Frameworks"
+ direction TB
+ Agent[Agent]
+ ADK["Google ADK
(adk_mcp.py)"]
+ LG["LangGraph
(langgraph_mcp.py)"]
+ OAI["OpenAI Agents
(oai-agent_mcp.py)"]
+ PYD["Pydantic-AI
(pydantic_mcp.py)"]
+
+ Agent --> ADK
+ Agent --> LG
+ Agent --> OAI
+ Agent --> PYD
+ end
+
+ %% MCP Server
+ subgraph "MCP Server"
+ direction TB
+ MCP["Model Context Protocol Server
(mcp_servers/example_server.py)"]
+ Tools["Tools
- add(a, b)
- get_current_time() {current_time}"]
+ Resources["Resources
- greeting://{{name}}"]
+ MCP --- Tools
+ MCP --- Resources
+ end
+
+ %% LLM Providers
+ subgraph "LLM Providers"
+ direction TB
+ OAI_LLM["OpenAI Models"]
+ GEM["Google Gemini Models"]
+ OTHER["Other LLM Providers..."]
+ end
+
+ Logfire[("Logfire
Tracing")]
+
+ ADK --> MCP
+ LG --> MCP
+ OAI --> MCP
+ PYD --> MCP
+
+ MCP --> OAI_LLM
+ MCP --> GEM
+ MCP --> OTHER
+
+ ADK --> Logfire
+ LG --> Logfire
+ OAI --> Logfire
+ PYD --> Logfire
+
+ LLM_Response[("Response")] --> User
+ OAI_LLM --> LLM_Response
+ GEM --> LLM_Response
+ OTHER --> LLM_Response
+
+ style MCP fill:#f9f,stroke:#333,stroke-width:2px
+ style User fill:#bbf,stroke:#338,stroke-width:2px
+ style Logfire fill:#bfb,stroke:#383,stroke-width:2px
+ style LLM_Response fill:#fbb,stroke:#833,stroke-width:2px
+```
+"""
\ No newline at end of file
diff --git a/agents_mcp_usage/multi_mcp/eval_multi_mcp/run_multi_evals.py b/agents_mcp_usage/evaluations/mermaid_evals/run_multi_evals.py
similarity index 97%
rename from agents_mcp_usage/multi_mcp/eval_multi_mcp/run_multi_evals.py
rename to agents_mcp_usage/evaluations/mermaid_evals/run_multi_evals.py
index 29516b7..cdb8b01 100644
--- a/agents_mcp_usage/multi_mcp/eval_multi_mcp/run_multi_evals.py
+++ b/agents_mcp_usage/evaluations/mermaid_evals/run_multi_evals.py
@@ -33,7 +33,7 @@
from rich.table import Table
# Import shared functionality from the improved evals module
-from agents_mcp_usage.multi_mcp.eval_multi_mcp.evals_pydantic_mcp import (
+from agents_mcp_usage.evaluations.mermaid_evals.evals_pydantic_mcp import (
MermaidInput,
MermaidOutput,
fix_mermaid_diagram,
@@ -44,18 +44,20 @@
load_dotenv()
DEFAULT_MODELS = [
- # "gemini-2.5-pro-preview-06-05",
- # "gemini-2.5-pro-preview-05-06",
- # "gemini-2.5-pro-preview-03-25",
- "gemini-2.0-flash",
- "gemini-2.5-flash-preview-04-17",
+ "gemini-2.5-pro-preview-06-05",
+ "gemini-2.5-pro-preview-05-06",
+ "gemini-2.5-pro-preview-03-25",
+ "gemini-2.5-pro",
+ # "gemini-2.5-flash",
+ # "gemini-2.5-flash-preview-04-17",
# "openai:o4-mini",
# "openai:gpt-4.1",
# "openai:gpt-4.1-mini",
# "openai:gpt-4.1-nano",
# "bedrock:us.anthropic.claude-sonnet-4-20250514-v1:0",
# "bedrock:us.anthropic.claude-opus-4-20250514-v1:0",
- # "bedrock:us.anthropic.claude-3-7-sonnet-20250219-v1:0",
+ # "gemini-2.5-flash-lite-preview-06-17"
+ # "bedrock:us.anthropic.claude-3-7-sonnet-20240219-v1:0",
# "bedrock:us.anthropic.claude-3-5-sonnet-20240620-v1:0",
# "bedrock:us.anthropic.claude-3-5-haiku-20241022-v1:0",
]
@@ -506,7 +508,7 @@ async def main() -> None:
parser.add_argument(
"--runs",
type=int,
- default=3,
+ default=5,
help="Number of evaluation runs per model",
)
parser.add_argument(
@@ -557,4 +559,4 @@ async def main() -> None:
if __name__ == "__main__":
- asyncio.run(main())
+ asyncio.run(main())
\ No newline at end of file
diff --git a/agents_mcp_usage/multi_mcp/eval_multi_mcp/schemas.py b/agents_mcp_usage/evaluations/mermaid_evals/schemas.py
similarity index 98%
rename from agents_mcp_usage/multi_mcp/eval_multi_mcp/schemas.py
rename to agents_mcp_usage/evaluations/mermaid_evals/schemas.py
index cbc2f1c..5c194f5 100644
--- a/agents_mcp_usage/multi_mcp/eval_multi_mcp/schemas.py
+++ b/agents_mcp_usage/evaluations/mermaid_evals/schemas.py
@@ -97,4 +97,4 @@ class DashboardConfig(BaseModel):
primary_metric: PrimaryMetricConfig
grouping: GroupingConfig
plots: PlotConfig
- cost_calculation: CostCalculationConfig
+ cost_calculation: CostCalculationConfig
\ No newline at end of file
diff --git a/agents_mcp_usage/multi_mcp/README.md b/agents_mcp_usage/multi_mcp/README.md
index df9deee..f12ad09 100644
--- a/agents_mcp_usage/multi_mcp/README.md
+++ b/agents_mcp_usage/multi_mcp/README.md
@@ -1,8 +1,8 @@
-# Multi-MCP Usage & Evaluation Suite
+# Multi-MCP Usage
-This directory contains examples demonstrating the integration of tools from multiple Model Context Protocol (MCP) servers with various LLM agent frameworks, along with a comprehensive evaluation and benchmarking system.
+This directory contains examples demonstrating the integration of tools from multiple Model Context Protocol (MCP) servers with various LLM agent frameworks.
-Agents utilising multiple MCP servers can be dramatically more complex than an Agent using a single server. This is because as the number of servers grow the number of tools that the Agent must reason on when and how to use increases. As a result this component not only demonstrates an Agent's use of multiple MCP servers, but also includes a production-ready evaluation suite to validate performance, analyse costs, and compare models across multiple difficulty levels.
+Agents utilising multiple MCP servers can be dramatically more complex than an Agent using a single server. This is because as the number of servers grow the number of tools that the Agent must reason on when and how to use increases. For evaluating and benchmarking these agents, please see the [evaluation suite](../evaluations/mermaid_evals/README.md).
@@ -24,14 +24,14 @@ Agents utilising multiple MCP servers can be dramatically more complex than an A
## Launch ADK web UI for visual interaction
make adk_multi_ui
- # Run the multi-MCP evaluation
- uv run agents_mcp_usage/multi_mcp/eval_multi_mcp/evals_pydantic_mcp.py
+ # Run the evaluation suite
+ uv run agents_mcp_usage/evaluations/mermaid_evals/evals_pydantic_mcp.py
# Run multi-model benchmarking
- uv run agents_mcp_usage/multi_mcp/eval_multi_mcp/run_multi_evals.py --models "gemini-2.5-pro-preview-06-05,gemini-2.0-flash" --runs 5 --parallel
+ uv run agents_mcp_usage/evaluations/mermaid_evals/run_multi_evals.py --models "gemini-2.5-pro-preview-06-05,gemini-2.0-flash" --runs 5 --parallel
# Launch the evaluation dashboard
- uv run streamlit run agents_mcp_usage/multi_mcp/eval_multi_mcp/merbench_ui.py
+ uv run streamlit run agents_mcp_usage/evaluations/mermaid_evals/merbench_ui.py
```
5. Check the console output, Logfire, or dashboard for results.
@@ -176,105 +176,9 @@ sequenceDiagram
The sequence diagram shows how the agent coordinates between multiple specialised MCP servers. It highlights the parallel connection establishment, selective tool usage based on need, and proper connection management.
-## Agent Evaluations
+## Evaluations
-Research in LLM agent development has identified tool overload as a significant challenge for agent performance. When faced with too many tools, agents often struggle with:
-
-1. **Tool Selection Complexity**: Determining which tool from which server is most appropriate for a given subtask becomes exponentially more difficult as the number of available tools increases.
-
-2. **Context Management**: Maintaining awareness of which server provides which capabilities and how to properly format requests for each server adds cognitive load to the agent.
-
-3. **Error Recovery**: When tool usage fails, diagnosing whether the issue stems from incorrect tool selection, improper input formatting, or server-specific limitations becomes more challenging with multiple servers.
-
-4. **Reasoning Overhead**: The agent must dedicate more of its context window and reasoning capacity to tool management rather than task completion.
-
-The evaluation framework included in this component is essential for validating that agents can effectively navigate the increased complexity of multiple MCP servers. By measuring success against specific evaluation criteria, developers can ensure that the benefits of tool specialisation outweigh the potential pitfalls of tool overload.
-
-## Comprehensive Evaluation Suite
-
-The `eval_multi_mcp/` directory contains a production-ready evaluation system for benchmarking LLM agent performance across multiple frameworks and models. The suite tests agents on mermaid diagram correction tasks using multiple MCP servers, providing rich metrics and analysis capabilities.
-
-### Evaluation Components
-
-#### Core Modules
-- **`evals_pydantic_mcp.py`** - Single-model evaluation with comprehensive metrics collection
-- **`run_multi_evals.py`** - Multi-model parallel/sequential benchmarking with CSV export
-- **`merbench_ui.py`** - Interactive Streamlit dashboard for visualisation and analysis
-- **`dashboard_config.py`** - Configuration-driven UI setup for flexible dashboard customisation
-- **`costs.csv`** - Pricing integration for cost analysis and budget planning
-
-#### Test Difficulty Levels
-The evaluation includes three test cases of increasing complexity:
-1. **Easy** - Simple syntax errors in mermaid diagrams
-2. **Medium** - More complex structural issues requiring deeper reasoning
-3. **Hard** - Advanced mermaid syntax problems testing sophisticated tool usage
-
-#### Evaluation Metrics
-The system captures five key performance indicators:
-- **UsedBothMCPTools** - Validates proper coordination between multiple MCP servers
-- **UsageLimitNotExceeded** - Monitors resource consumption and efficiency
-- **MermaidDiagramValid** - Assesses technical correctness of outputs
-- **LLMJudge (Format)** - Evaluates response formatting and structure
-- **LLMJudge (Structure)** - Measures preservation of original diagram intent
-
-## Interactive Dashboard & Visualisation
-
-The Streamlit-based dashboard (`merbench_ui.py`) provides comprehensive analysis and comparison capabilities:
-
-### Dashboard Features
-- **Model Leaderboards** - Performance rankings by accuracy, cost efficiency, and execution speed
-- **Cost Analysis** - Detailed cost breakdowns with cost-per-success metrics and budget projections
-- **Failure Analysis** - Categorised failure reasons with debugging insights and error patterns
-- **Performance Trends** - Visualisation of model behaviour across difficulty levels and test iterations
-- **Resource Usage** - Token consumption patterns and API call efficiency metrics
-- **Comparative Analysis** - Side-by-side model performance comparison with statistical significance
-
-### Dashboard Quick Launch
-```bash
-# Launch the interactive evaluation dashboard
-uv run streamlit run agents_mcp_usage/multi_mcp/eval_multi_mcp/merbench_ui.py
-```
-
-The dashboard automatically loads evaluation results from the `mermaid_eval_results/` directory, providing immediate insights into model performance and cost efficiency.
-
-## Multi-Model Benchmarking
-
-The `run_multi_evals.py` script enables systematic comparison across multiple LLM models with flexible execution options:
-
-### Benchmarking Features
-- **Parallel Execution** - Simultaneous evaluation across models for faster results
-- **Sequential Mode** - Conservative execution for resource-constrained environments
-- **Configurable Runs** - Multiple iterations per model for statistical reliability
-- **Comprehensive Error Handling** - Robust retry logic with exponential backoff
-- **CSV Export** - Structured results for downstream analysis and reporting
-
-### Example Benchmarking Commands
-
-```bash
-# Parallel benchmarking across multiple models
-uv run agents_mcp_usage/multi_mcp/eval_multi_mcp/run_multi_evals.py \
- --models "gemini-2.5-pro-preview-06-05,gemini-2.0-flash,gemini-2.5-flash" \
- --runs 5 \
- --parallel \
- --timeout 600 \
- --output-dir ./benchmark_results
-
-# Sequential execution with custom judge model
-uv run agents_mcp_usage/multi_mcp/eval_multi_mcp/run_multi_evals.py \
- --models "gemini-2.5-pro-preview-06-05,claude-3-opus" \
- --runs 3 \
- --sequential \
- --judge-model "gemini-2.5-pro-preview-06-05" \
- --output-dir ./comparative_analysis
-```
-
-### Output Structure
-Results are organised with timestamped files:
-- **Individual model results** - `YYYY-MM-DD_HH-MM-SS_individual_{model}.csv`
-- **Combined analysis** - `YYYY-MM-DD_HH-MM-SS_combined_results.csv`
-- **Dashboard integration** - Automatic loading into visualisation interface
-
-The evaluation system enables robust, repeatable benchmarking across LLM models and agent frameworks, supporting both research investigations and production model selection decisions.
+Located at [agents_mcp_usage/evaluations/mermaid_evals](../evaluations/mermaid_evals/README.md). The evaluation suite provides comprehensive benchmarking for LLM agents using multiple MCP servers.
## Example Files