You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This repository demonstrates LLM Agents using tools from Model Context Protocol (MCP) servers with several frameworks:
4
4
- Google Agent Development Kit (ADK)
5
5
- LangGraph Agents
6
6
- OpenAI Agents
7
7
- Pydantic-AI Agents
8
8
9
-
Both single and multiple MCP server examples are demonstrated
10
-
-[Agent with a single MCP server](agents_mcp_usage/basic_mcp/README.md)
11
-
-[Agent with multiple MCP servers](agents_mcp_usage/multi_mcp/README.md)
12
-
- Also includes Agent evaluations
9
+
## Repository Structure
10
+
11
+
-[Agent with a single MCP server](agents_mcp_usage/basic_mcp/README.md) - Learning examples and basic patterns
12
+
-[Agent with multiple MCP servers](agents_mcp_usage/multi_mcp/README.md) - Advanced usage with comprehensive evaluation suite
13
+
-**Evaluation Dashboard**: Interactive Streamlit UI for model comparison
14
+
-**Multi-Model Benchmarking**: Parallel/sequential evaluation across multiple LLMs
15
+
-**Rich Metrics**: Usage analysis, cost comparison, and performance leaderboards
13
16
14
17
The repo also includes Python MCP Servers:
15
18
-[`example_server.py`](mcp_servers/example_server.py) based on [MCP Python SDK Quickstart](https://github.com/modelcontextprotocol/python-sdk/blob/b4c7db6a50a5c88bae1db5c1f7fba44d16eebc6e/README.md?plain=1#L104) - Modified to include a datetime tool and run as a server invoked by Agents
@@ -217,10 +220,59 @@ uv run agents_mcp_usage/multi_mcp/multi_mcp_use/pydantic_mcp.py
217
220
218
221
# Run the multi-MCP evaluation
219
222
uv run agents_mcp_usage/multi_mcp/eval_multi_mcp/evals_pydantic_mcp.py
223
+
224
+
# Run multi-model benchmarking
225
+
uv run agents_mcp_usage/multi_mcp/eval_multi_mcp/run_multi_evals.py --models "gemini-2.5-pro,gemini-2.0-flash" --runs 5 --parallel
226
+
227
+
# Launch the evaluation dashboard
228
+
uv run streamlit run agents_mcp_usage/multi_mcp/eval_multi_mcp/merbench_ui.py
220
229
```
221
230
222
231
More details on multi-MCP implementation can be found in the [multi_mcp README](agents_mcp_usage/multi_mcp/README.md).
223
232
233
+
## Evaluation Suite & Benchmarking Dashboard
234
+
235
+
This repository includes a comprehensive evaluation system for benchmarking LLM agent performance across multiple frameworks and models. The evaluation suite tests agents on mermaid diagram correction tasks using multiple MCP servers, providing rich metrics and analysis capabilities.
236
+
237
+
### Key Evaluation Features
238
+
239
+
-**Multi-Level Difficulty**: Easy, medium, and hard test cases for comprehensive assessment
240
+
-**Multi-Model Benchmarking**: Parallel or sequential evaluation across multiple LLM models
241
+
-**Interactive Dashboard**: Streamlit-based UI for visualising results, cost analysis, and model comparison
uv run streamlit run agents_mcp_usage/multi_mcp/eval_multi_mcp/merbench_ui.py
272
+
```
273
+
274
+
The evaluation system enables robust, repeatable benchmarking across LLM models and agent frameworks, supporting both research and production model selection decisions.
275
+
224
276
## What is MCP?
225
277
226
278
The Model Context Protocol allows applications to provide context for LLMs in a standardised way, separating the concerns of providing context from the actual LLM interaction.
@@ -258,4 +310,4 @@ A key advantage highlighted is flexibility; MCP allows developers to more easily
258
310
- OpenTelemetry support for leveraging existing tooling
259
311
- Pydantic integration for analytics on validations
260
312
261
-
Logfire gives you visibility into how your code is running, which is especially valuable for LLM applications where understanding model behaviour is critical.
313
+
Logfire gives you visibility into how your code is running, which is especially valuable for LLM applications where understanding model behaviour is critical.
Copy file name to clipboardExpand all lines: agents_mcp_usage/multi_mcp/README.md
+97-4Lines changed: 97 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,8 +1,9 @@
1
-
# Multi-MCP Usage
1
+
# Multi-MCP Usage & Evaluation Suite
2
2
3
-
This directory contains examples demonstrating the integration of tools from multiple Model Context Protocol (MCP) servers with various LLM agent frameworks.
3
+
This directory contains examples demonstrating the integration of tools from multiple Model Context Protocol (MCP) servers with various LLM agent frameworks, along with a comprehensive evaluation and benchmarking system.
4
+
5
+
Agents utilising multiple MCP servers can be dramatically more complex than an Agent using a single server. This is because as the number of servers grow the number of tools that the Agent must reason on when and how to use increases. As a result this component not only demonstrates an Agent's use of multiple MCP servers, but also includes a production-ready evaluation suite to validate performance, analyse costs, and compare models across multiple difficulty levels.
4
6
5
-
Agents utilising multiple MCP servers can be dramatically more complex than an Agent using a single server. This is because as the number of servers grow the number of tools that the Agent must reason on when and how to use increases. As a result this component not only demonstrates an Agent's use of multiple MCP servers, but also includes evaluations to validate that they are being used to successfully accomplish the task according to various evaluation criterias.
6
7
7
8
8
9
## Quickstart
@@ -22,9 +23,15 @@ Agents utilising multiple MCP servers can be dramatically more complex than an A
22
23
23
24
# Run the multi-MCP evaluation
24
25
uv run agents_mcp_usage/multi_mcp/eval_multi_mcp/evals_pydantic_mcp.py
26
+
27
+
# Run multi-model benchmarking
28
+
uv run agents_mcp_usage/multi_mcp/eval_multi_mcp/run_multi_evals.py --models "gemini-2.5-pro,gemini-2.0-flash" --runs 5 --parallel
29
+
30
+
# Launch the evaluation dashboard
31
+
uv run streamlit run agents_mcp_usage/multi_mcp/eval_multi_mcp/merbench_ui.py
25
32
```
26
33
27
-
5. Check the console outputor Logfire for results.
34
+
5. Check the console output, Logfire, or dashboard for results.
28
35
29
36
30
37
### Multi-MCP Architecture
@@ -180,6 +187,92 @@ Research in LLM agent development has identified tool overload as a significant
180
187
181
188
The evaluation framework included in this component is essential for validating that agents can effectively navigate the increased complexity of multiple MCP servers. By measuring success against specific evaluation criteria, developers can ensure that the benefits of tool specialisation outweigh the potential pitfalls of tool overload.
182
189
190
+
## Comprehensive Evaluation Suite
191
+
192
+
The `eval_multi_mcp/` directory contains a production-ready evaluation system for benchmarking LLM agent performance across multiple frameworks and models. The suite tests agents on mermaid diagram correction tasks using multiple MCP servers, providing rich metrics and analysis capabilities.
193
+
194
+
### Evaluation Components
195
+
196
+
#### Core Modules
197
+
-**`evals_pydantic_mcp.py`** - Single-model evaluation with comprehensive metrics collection
198
+
-**`run_multi_evals.py`** - Multi-model parallel/sequential benchmarking with CSV export
199
+
-**`merbench_ui.py`** - Interactive Streamlit dashboard for visualisation and analysis
200
+
-**`dashboard_config.py`** - Configuration-driven UI setup for flexible dashboard customisation
201
+
-**`costs.csv`** - Pricing integration for cost analysis and budget planning
202
+
203
+
#### Test Difficulty Levels
204
+
The evaluation includes three test cases of increasing complexity:
205
+
1.**Easy** - Simple syntax errors in mermaid diagrams
206
+
2.**Medium** - More complex structural issues requiring deeper reasoning
The system captures five key performance indicators:
211
+
-**UsedBothMCPTools** - Validates proper coordination between multiple MCP servers
212
+
-**UsageLimitNotExceeded** - Monitors resource consumption and efficiency
213
+
-**MermaidDiagramValid** - Assesses technical correctness of outputs
214
+
-**LLMJudge (Format)** - Evaluates response formatting and structure
215
+
-**LLMJudge (Structure)** - Measures preservation of original diagram intent
216
+
217
+
## Interactive Dashboard & Visualisation
218
+
219
+
The Streamlit-based dashboard (`merbench_ui.py`) provides comprehensive analysis and comparison capabilities:
220
+
221
+
### Dashboard Features
222
+
-**Model Leaderboards** - Performance rankings by accuracy, cost efficiency, and execution speed
223
+
-**Cost Analysis** - Detailed cost breakdowns with cost-per-success metrics and budget projections
224
+
-**Failure Analysis** - Categorised failure reasons with debugging insights and error patterns
225
+
-**Performance Trends** - Visualisation of model behaviour across difficulty levels and test iterations
226
+
-**Resource Usage** - Token consumption patterns and API call efficiency metrics
227
+
-**Comparative Analysis** - Side-by-side model performance comparison with statistical significance
228
+
229
+
### Dashboard Quick Launch
230
+
```bash
231
+
# Launch the interactive evaluation dashboard
232
+
uv run streamlit run agents_mcp_usage/multi_mcp/eval_multi_mcp/merbench_ui.py
233
+
```
234
+
235
+
The dashboard automatically loads evaluation results from the `mermaid_eval_results/` directory, providing immediate insights into model performance and cost efficiency.
236
+
237
+
## Multi-Model Benchmarking
238
+
239
+
The `run_multi_evals.py` script enables systematic comparison across multiple LLM models with flexible execution options:
240
+
241
+
### Benchmarking Features
242
+
-**Parallel Execution** - Simultaneous evaluation across models for faster results
243
+
-**Sequential Mode** - Conservative execution for resource-constrained environments
244
+
-**Configurable Runs** - Multiple iterations per model for statistical reliability
245
+
-**Comprehensive Error Handling** - Robust retry logic with exponential backoff
246
+
-**CSV Export** - Structured results for downstream analysis and reporting
247
+
248
+
### Example Benchmarking Commands
249
+
250
+
```bash
251
+
# Parallel benchmarking across multiple models
252
+
uv run agents_mcp_usage/multi_mcp/eval_multi_mcp/run_multi_evals.py \
-**Dashboard integration** - Automatic loading into visualisation interface
273
+
274
+
The evaluation system enables robust, repeatable benchmarking across LLM models and agent frameworks, supporting both research investigations and production model selection decisions.
0 commit comments