Skip to content

Commit 77725bb

Browse files
authored
Merge pull request #5 from andrewginns/add-levels-of-eval-difficulty
2 parents 6a48db8 + b97b853 commit 77725bb

File tree

19 files changed

+3863
-772
lines changed

19 files changed

+3863
-772
lines changed

README.md

Lines changed: 58 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,18 @@
1-
# Model Context Protocol (MCP) Agent Frameworks Demo
1+
# Model Context Protocol (MCP) Agent Frameworks Demo & Benchmarking Platform
22

33
This repository demonstrates LLM Agents using tools from Model Context Protocol (MCP) servers with several frameworks:
44
- Google Agent Development Kit (ADK)
55
- LangGraph Agents
66
- OpenAI Agents
77
- Pydantic-AI Agents
88

9-
Both single and multiple MCP server examples are demonstrated
10-
- [Agent with a single MCP server](agents_mcp_usage/basic_mcp/README.md)
11-
- [Agent with multiple MCP servers](agents_mcp_usage/multi_mcp/README.md)
12-
- Also includes Agent evaluations
9+
## Repository Structure
10+
11+
- [Agent with a single MCP server](agents_mcp_usage/basic_mcp/README.md) - Learning examples and basic patterns
12+
- [Agent with multiple MCP servers](agents_mcp_usage/multi_mcp/README.md) - Advanced usage with comprehensive evaluation suite
13+
- **Evaluation Dashboard**: Interactive Streamlit UI for model comparison
14+
- **Multi-Model Benchmarking**: Parallel/sequential evaluation across multiple LLMs
15+
- **Rich Metrics**: Usage analysis, cost comparison, and performance leaderboards
1316

1417
The repo also includes Python MCP Servers:
1518
- [`example_server.py`](mcp_servers/example_server.py) based on [MCP Python SDK Quickstart](https://github.com/modelcontextprotocol/python-sdk/blob/b4c7db6a50a5c88bae1db5c1f7fba44d16eebc6e/README.md?plain=1#L104) - Modified to include a datetime tool and run as a server invoked by Agents
@@ -217,10 +220,59 @@ uv run agents_mcp_usage/multi_mcp/multi_mcp_use/pydantic_mcp.py
217220

218221
# Run the multi-MCP evaluation
219222
uv run agents_mcp_usage/multi_mcp/eval_multi_mcp/evals_pydantic_mcp.py
223+
224+
# Run multi-model benchmarking
225+
uv run agents_mcp_usage/multi_mcp/eval_multi_mcp/run_multi_evals.py --models "gemini-2.5-pro,gemini-2.0-flash" --runs 5 --parallel
226+
227+
# Launch the evaluation dashboard
228+
uv run streamlit run agents_mcp_usage/multi_mcp/eval_multi_mcp/merbench_ui.py
220229
```
221230

222231
More details on multi-MCP implementation can be found in the [multi_mcp README](agents_mcp_usage/multi_mcp/README.md).
223232

233+
## Evaluation Suite & Benchmarking Dashboard
234+
235+
This repository includes a comprehensive evaluation system for benchmarking LLM agent performance across multiple frameworks and models. The evaluation suite tests agents on mermaid diagram correction tasks using multiple MCP servers, providing rich metrics and analysis capabilities.
236+
237+
### Key Evaluation Features
238+
239+
- **Multi-Level Difficulty**: Easy, medium, and hard test cases for comprehensive assessment
240+
- **Multi-Model Benchmarking**: Parallel or sequential evaluation across multiple LLM models
241+
- **Interactive Dashboard**: Streamlit-based UI for visualising results, cost analysis, and model comparison
242+
- **Rich Metrics Collection**: Token usage, cost analysis, success rates, and failure categorisation
243+
- **Robust Error Handling**: Comprehensive retry logic and detailed failure analysis
244+
- **Export Capabilities**: CSV results for downstream analysis and reporting
245+
246+
### Dashboard Features
247+
248+
The included Streamlit dashboard (`merbench_ui.py`) provides:
249+
250+
- **Model Leaderboards**: Performance rankings by accuracy, cost efficiency, and speed
251+
- **Cost Analysis**: Detailed cost breakdowns and cost-per-success metrics
252+
- **Failure Analysis**: Categorised failure reasons with debugging insights
253+
- **Performance Trends**: Visualisation of model behaviour across difficulty levels
254+
- **Resource Usage**: Token consumption and API call patterns
255+
- **Comparative Analysis**: Side-by-side model performance comparison
256+
257+
### Quick Evaluation Commands
258+
259+
```bash
260+
# Single model evaluation
261+
uv run agents_mcp_usage/multi_mcp/eval_multi_mcp/evals_pydantic_mcp.py
262+
263+
# Multi-model parallel benchmarking
264+
uv run agents_mcp_usage/multi_mcp/eval_multi_mcp/run_multi_evals.py \
265+
--models "gemini-2.5-pro,gemini-2.0-flash,gemini-2.5-flash-preview-04-17" \
266+
--runs 5 \
267+
--parallel \
268+
--output-dir ./results
269+
270+
# Launch interactive dashboard
271+
uv run streamlit run agents_mcp_usage/multi_mcp/eval_multi_mcp/merbench_ui.py
272+
```
273+
274+
The evaluation system enables robust, repeatable benchmarking across LLM models and agent frameworks, supporting both research and production model selection decisions.
275+
224276
## What is MCP?
225277

226278
The Model Context Protocol allows applications to provide context for LLMs in a standardised way, separating the concerns of providing context from the actual LLM interaction.
@@ -258,4 +310,4 @@ A key advantage highlighted is flexibility; MCP allows developers to more easily
258310
- OpenTelemetry support for leveraging existing tooling
259311
- Pydantic integration for analytics on validations
260312

261-
Logfire gives you visibility into how your code is running, which is especially valuable for LLM applications where understanding model behaviour is critical.
313+
Logfire gives you visibility into how your code is running, which is especially valuable for LLM applications where understanding model behaviour is critical.

agents_mcp_usage/basic_mcp/basic_mcp_use/adk_mcp.py

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -20,11 +20,14 @@
2020

2121

2222
async def main(query: str = "Greet Andrew and give him the current time") -> None:
23-
"""
24-
Main function to run the agent
23+
"""Runs the agent with a given query.
24+
25+
This function sets up the MCP server, creates an LLM agent, and runs it
26+
with a specified query. It also handles the cleanup of the MCP server
27+
connection.
2528
2629
Args:
27-
query (str): The query to run the agent with
30+
query: The query to run the agent with.
2831
"""
2932
# Set up MCP server connection
3033
server_params = StdioServerParameters(

agents_mcp_usage/basic_mcp/basic_mcp_use/langgraph_mcp.py

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@
2121
# Create server parameters for stdio connection
2222
server = StdioServerParameters(
2323
command="uv",
24-
args=["run", "mcp_servers/example_server.py", "stdio"],
24+
args=["run", "mcp_servers/example_server.py", "stdio"],
2525
)
2626

2727
model = ChatGoogleGenerativeAI(
@@ -30,11 +30,13 @@
3030

3131

3232
async def main(query: str = "Greet Andrew and give him the current time") -> None:
33-
"""
34-
Main function to run the agent
33+
"""Runs the LangGraph agent with a given query.
34+
35+
This function connects to the MCP server, loads the tools, creates a
36+
LangGraph agent, and invokes it with the provided query.
3537
3638
Args:
37-
query (str): The query to run the agent with
39+
query: The query to run the agent with.
3840
"""
3941
async with stdio_client(server) as (read, write):
4042
async with ClientSession(read, write) as session:

agents_mcp_usage/basic_mcp/basic_mcp_use/oai-agent_mcp.py

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -14,11 +14,13 @@
1414

1515

1616
async def main(query: str = "Greet Andrew and give him the current time") -> None:
17-
"""
18-
Main function to run the agent
17+
"""Runs the OpenAI agent with a given query.
18+
19+
This function creates an MCP server, initializes an OpenAI agent with the
20+
server, and runs the agent with the provided query.
1921
2022
Args:
21-
query (str): The query to run the agent with
23+
query: The query to run the agent with.
2224
"""
2325
# Create and use the MCP server in an async context
2426
async with MCPServerStdio(

agents_mcp_usage/basic_mcp/basic_mcp_use/pydantic_mcp.py

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -25,11 +25,13 @@
2525

2626

2727
async def main(query: str = "Greet Andrew and give him the current time") -> None:
28-
"""
29-
Main function to run the agent
28+
"""Runs the Pydantic agent with a given query.
29+
30+
This function runs the Pydantic agent with the provided query and prints the
31+
output.
3032
3133
Args:
32-
query (str): The query to run the agent with
34+
query: The query to run the agent with.
3335
"""
3436
async with agent.run_mcp_servers():
3537
result = await agent.run(query)

agents_mcp_usage/multi_mcp/README.md

Lines changed: 97 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,9 @@
1-
# Multi-MCP Usage
1+
# Multi-MCP Usage & Evaluation Suite
22

3-
This directory contains examples demonstrating the integration of tools from multiple Model Context Protocol (MCP) servers with various LLM agent frameworks.
3+
This directory contains examples demonstrating the integration of tools from multiple Model Context Protocol (MCP) servers with various LLM agent frameworks, along with a comprehensive evaluation and benchmarking system.
4+
5+
Agents utilising multiple MCP servers can be dramatically more complex than an Agent using a single server. This is because as the number of servers grow the number of tools that the Agent must reason on when and how to use increases. As a result this component not only demonstrates an Agent's use of multiple MCP servers, but also includes a production-ready evaluation suite to validate performance, analyse costs, and compare models across multiple difficulty levels.
46

5-
Agents utilising multiple MCP servers can be dramatically more complex than an Agent using a single server. This is because as the number of servers grow the number of tools that the Agent must reason on when and how to use increases. As a result this component not only demonstrates an Agent's use of multiple MCP servers, but also includes evaluations to validate that they are being used to successfully accomplish the task according to various evaluation criterias.
67

78

89
## Quickstart
@@ -22,9 +23,15 @@ Agents utilising multiple MCP servers can be dramatically more complex than an A
2223

2324
# Run the multi-MCP evaluation
2425
uv run agents_mcp_usage/multi_mcp/eval_multi_mcp/evals_pydantic_mcp.py
26+
27+
# Run multi-model benchmarking
28+
uv run agents_mcp_usage/multi_mcp/eval_multi_mcp/run_multi_evals.py --models "gemini-2.5-pro,gemini-2.0-flash" --runs 5 --parallel
29+
30+
# Launch the evaluation dashboard
31+
uv run streamlit run agents_mcp_usage/multi_mcp/eval_multi_mcp/merbench_ui.py
2532
```
2633

27-
5. Check the console output or Logfire for results.
34+
5. Check the console output, Logfire, or dashboard for results.
2835

2936

3037
### Multi-MCP Architecture
@@ -180,6 +187,92 @@ Research in LLM agent development has identified tool overload as a significant
180187

181188
The evaluation framework included in this component is essential for validating that agents can effectively navigate the increased complexity of multiple MCP servers. By measuring success against specific evaluation criteria, developers can ensure that the benefits of tool specialisation outweigh the potential pitfalls of tool overload.
182189

190+
## Comprehensive Evaluation Suite
191+
192+
The `eval_multi_mcp/` directory contains a production-ready evaluation system for benchmarking LLM agent performance across multiple frameworks and models. The suite tests agents on mermaid diagram correction tasks using multiple MCP servers, providing rich metrics and analysis capabilities.
193+
194+
### Evaluation Components
195+
196+
#### Core Modules
197+
- **`evals_pydantic_mcp.py`** - Single-model evaluation with comprehensive metrics collection
198+
- **`run_multi_evals.py`** - Multi-model parallel/sequential benchmarking with CSV export
199+
- **`merbench_ui.py`** - Interactive Streamlit dashboard for visualisation and analysis
200+
- **`dashboard_config.py`** - Configuration-driven UI setup for flexible dashboard customisation
201+
- **`costs.csv`** - Pricing integration for cost analysis and budget planning
202+
203+
#### Test Difficulty Levels
204+
The evaluation includes three test cases of increasing complexity:
205+
1. **Easy** - Simple syntax errors in mermaid diagrams
206+
2. **Medium** - More complex structural issues requiring deeper reasoning
207+
3. **Hard** - Advanced mermaid syntax problems testing sophisticated tool usage
208+
209+
#### Evaluation Metrics
210+
The system captures five key performance indicators:
211+
- **UsedBothMCPTools** - Validates proper coordination between multiple MCP servers
212+
- **UsageLimitNotExceeded** - Monitors resource consumption and efficiency
213+
- **MermaidDiagramValid** - Assesses technical correctness of outputs
214+
- **LLMJudge (Format)** - Evaluates response formatting and structure
215+
- **LLMJudge (Structure)** - Measures preservation of original diagram intent
216+
217+
## Interactive Dashboard & Visualisation
218+
219+
The Streamlit-based dashboard (`merbench_ui.py`) provides comprehensive analysis and comparison capabilities:
220+
221+
### Dashboard Features
222+
- **Model Leaderboards** - Performance rankings by accuracy, cost efficiency, and execution speed
223+
- **Cost Analysis** - Detailed cost breakdowns with cost-per-success metrics and budget projections
224+
- **Failure Analysis** - Categorised failure reasons with debugging insights and error patterns
225+
- **Performance Trends** - Visualisation of model behaviour across difficulty levels and test iterations
226+
- **Resource Usage** - Token consumption patterns and API call efficiency metrics
227+
- **Comparative Analysis** - Side-by-side model performance comparison with statistical significance
228+
229+
### Dashboard Quick Launch
230+
```bash
231+
# Launch the interactive evaluation dashboard
232+
uv run streamlit run agents_mcp_usage/multi_mcp/eval_multi_mcp/merbench_ui.py
233+
```
234+
235+
The dashboard automatically loads evaluation results from the `mermaid_eval_results/` directory, providing immediate insights into model performance and cost efficiency.
236+
237+
## Multi-Model Benchmarking
238+
239+
The `run_multi_evals.py` script enables systematic comparison across multiple LLM models with flexible execution options:
240+
241+
### Benchmarking Features
242+
- **Parallel Execution** - Simultaneous evaluation across models for faster results
243+
- **Sequential Mode** - Conservative execution for resource-constrained environments
244+
- **Configurable Runs** - Multiple iterations per model for statistical reliability
245+
- **Comprehensive Error Handling** - Robust retry logic with exponential backoff
246+
- **CSV Export** - Structured results for downstream analysis and reporting
247+
248+
### Example Benchmarking Commands
249+
250+
```bash
251+
# Parallel benchmarking across multiple models
252+
uv run agents_mcp_usage/multi_mcp/eval_multi_mcp/run_multi_evals.py \
253+
--models "gemini-2.5-pro,gemini-2.0-flash,gemini-2.5-flash-preview-04-17" \
254+
--runs 5 \
255+
--parallel \
256+
--timeout 600 \
257+
--output-dir ./benchmark_results
258+
259+
# Sequential execution with custom judge model
260+
uv run agents_mcp_usage/multi_mcp/eval_multi_mcp/run_multi_evals.py \
261+
--models "gemini-2.5-pro,claude-3-opus" \
262+
--runs 3 \
263+
--sequential \
264+
--judge-model "gemini-2.5-pro" \
265+
--output-dir ./comparative_analysis
266+
```
267+
268+
### Output Structure
269+
Results are organised with timestamped files:
270+
- **Individual model results** - `YYYY-MM-DD_HH-MM-SS_individual_{model}.csv`
271+
- **Combined analysis** - `YYYY-MM-DD_HH-MM-SS_combined_results.csv`
272+
- **Dashboard integration** - Automatic loading into visualisation interface
273+
274+
The evaluation system enables robust, repeatable benchmarking across LLM models and agent frameworks, supporting both research investigations and production model selection decisions.
275+
183276
## Example Files
184277

185278
### Pydantic-AI Multi-MCP

0 commit comments

Comments
 (0)