Skip to content

Commit b97b853

Browse files
committed
docs: Update with mention of evals and dashboard
1 parent 09cf03a commit b97b853

File tree

2 files changed

+155
-10
lines changed

2 files changed

+155
-10
lines changed

README.md

Lines changed: 58 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,18 @@
1-
# Model Context Protocol (MCP) Agent Frameworks Demo
1+
# Model Context Protocol (MCP) Agent Frameworks Demo & Benchmarking Platform
22

33
This repository demonstrates LLM Agents using tools from Model Context Protocol (MCP) servers with several frameworks:
44
- Google Agent Development Kit (ADK)
55
- LangGraph Agents
66
- OpenAI Agents
77
- Pydantic-AI Agents
88

9-
Both single and multiple MCP server examples are demonstrated
10-
- [Agent with a single MCP server](agents_mcp_usage/basic_mcp/README.md)
11-
- [Agent with multiple MCP servers](agents_mcp_usage/multi_mcp/README.md)
12-
- Also includes Agent evaluations
9+
## Repository Structure
10+
11+
- [Agent with a single MCP server](agents_mcp_usage/basic_mcp/README.md) - Learning examples and basic patterns
12+
- [Agent with multiple MCP servers](agents_mcp_usage/multi_mcp/README.md) - Advanced usage with comprehensive evaluation suite
13+
- **Evaluation Dashboard**: Interactive Streamlit UI for model comparison
14+
- **Multi-Model Benchmarking**: Parallel/sequential evaluation across multiple LLMs
15+
- **Rich Metrics**: Usage analysis, cost comparison, and performance leaderboards
1316

1417
The repo also includes Python MCP Servers:
1518
- [`example_server.py`](mcp_servers/example_server.py) based on [MCP Python SDK Quickstart](https://github.com/modelcontextprotocol/python-sdk/blob/b4c7db6a50a5c88bae1db5c1f7fba44d16eebc6e/README.md?plain=1#L104) - Modified to include a datetime tool and run as a server invoked by Agents
@@ -217,10 +220,59 @@ uv run agents_mcp_usage/multi_mcp/multi_mcp_use/pydantic_mcp.py
217220

218221
# Run the multi-MCP evaluation
219222
uv run agents_mcp_usage/multi_mcp/eval_multi_mcp/evals_pydantic_mcp.py
223+
224+
# Run multi-model benchmarking
225+
uv run agents_mcp_usage/multi_mcp/eval_multi_mcp/run_multi_evals.py --models "gemini-2.5-pro,gemini-2.0-flash" --runs 5 --parallel
226+
227+
# Launch the evaluation dashboard
228+
uv run streamlit run agents_mcp_usage/multi_mcp/eval_multi_mcp/merbench_ui.py
220229
```
221230

222231
More details on multi-MCP implementation can be found in the [multi_mcp README](agents_mcp_usage/multi_mcp/README.md).
223232

233+
## Evaluation Suite & Benchmarking Dashboard
234+
235+
This repository includes a comprehensive evaluation system for benchmarking LLM agent performance across multiple frameworks and models. The evaluation suite tests agents on mermaid diagram correction tasks using multiple MCP servers, providing rich metrics and analysis capabilities.
236+
237+
### Key Evaluation Features
238+
239+
- **Multi-Level Difficulty**: Easy, medium, and hard test cases for comprehensive assessment
240+
- **Multi-Model Benchmarking**: Parallel or sequential evaluation across multiple LLM models
241+
- **Interactive Dashboard**: Streamlit-based UI for visualising results, cost analysis, and model comparison
242+
- **Rich Metrics Collection**: Token usage, cost analysis, success rates, and failure categorisation
243+
- **Robust Error Handling**: Comprehensive retry logic and detailed failure analysis
244+
- **Export Capabilities**: CSV results for downstream analysis and reporting
245+
246+
### Dashboard Features
247+
248+
The included Streamlit dashboard (`merbench_ui.py`) provides:
249+
250+
- **Model Leaderboards**: Performance rankings by accuracy, cost efficiency, and speed
251+
- **Cost Analysis**: Detailed cost breakdowns and cost-per-success metrics
252+
- **Failure Analysis**: Categorised failure reasons with debugging insights
253+
- **Performance Trends**: Visualisation of model behaviour across difficulty levels
254+
- **Resource Usage**: Token consumption and API call patterns
255+
- **Comparative Analysis**: Side-by-side model performance comparison
256+
257+
### Quick Evaluation Commands
258+
259+
```bash
260+
# Single model evaluation
261+
uv run agents_mcp_usage/multi_mcp/eval_multi_mcp/evals_pydantic_mcp.py
262+
263+
# Multi-model parallel benchmarking
264+
uv run agents_mcp_usage/multi_mcp/eval_multi_mcp/run_multi_evals.py \
265+
--models "gemini-2.5-pro,gemini-2.0-flash,gemini-2.5-flash-preview-04-17" \
266+
--runs 5 \
267+
--parallel \
268+
--output-dir ./results
269+
270+
# Launch interactive dashboard
271+
uv run streamlit run agents_mcp_usage/multi_mcp/eval_multi_mcp/merbench_ui.py
272+
```
273+
274+
The evaluation system enables robust, repeatable benchmarking across LLM models and agent frameworks, supporting both research and production model selection decisions.
275+
224276
## What is MCP?
225277

226278
The Model Context Protocol allows applications to provide context for LLMs in a standardised way, separating the concerns of providing context from the actual LLM interaction.
@@ -258,4 +310,4 @@ A key advantage highlighted is flexibility; MCP allows developers to more easily
258310
- OpenTelemetry support for leveraging existing tooling
259311
- Pydantic integration for analytics on validations
260312

261-
Logfire gives you visibility into how your code is running, which is especially valuable for LLM applications where understanding model behaviour is critical.
313+
Logfire gives you visibility into how your code is running, which is especially valuable for LLM applications where understanding model behaviour is critical.

agents_mcp_usage/multi_mcp/README.md

Lines changed: 97 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,9 @@
1-
# Multi-MCP Usage
1+
# Multi-MCP Usage & Evaluation Suite
22

3-
This directory contains examples demonstrating the integration of tools from multiple Model Context Protocol (MCP) servers with various LLM agent frameworks.
3+
This directory contains examples demonstrating the integration of tools from multiple Model Context Protocol (MCP) servers with various LLM agent frameworks, along with a comprehensive evaluation and benchmarking system.
4+
5+
Agents utilising multiple MCP servers can be dramatically more complex than an Agent using a single server. This is because as the number of servers grow the number of tools that the Agent must reason on when and how to use increases. As a result this component not only demonstrates an Agent's use of multiple MCP servers, but also includes a production-ready evaluation suite to validate performance, analyse costs, and compare models across multiple difficulty levels.
46

5-
Agents utilising multiple MCP servers can be dramatically more complex than an Agent using a single server. This is because as the number of servers grow the number of tools that the Agent must reason on when and how to use increases. As a result this component not only demonstrates an Agent's use of multiple MCP servers, but also includes evaluations to validate that they are being used to successfully accomplish the task according to various evaluation criterias.
67

78

89
## Quickstart
@@ -22,9 +23,15 @@ Agents utilising multiple MCP servers can be dramatically more complex than an A
2223

2324
# Run the multi-MCP evaluation
2425
uv run agents_mcp_usage/multi_mcp/eval_multi_mcp/evals_pydantic_mcp.py
26+
27+
# Run multi-model benchmarking
28+
uv run agents_mcp_usage/multi_mcp/eval_multi_mcp/run_multi_evals.py --models "gemini-2.5-pro,gemini-2.0-flash" --runs 5 --parallel
29+
30+
# Launch the evaluation dashboard
31+
uv run streamlit run agents_mcp_usage/multi_mcp/eval_multi_mcp/merbench_ui.py
2532
```
2633

27-
5. Check the console output or Logfire for results.
34+
5. Check the console output, Logfire, or dashboard for results.
2835

2936

3037
### Multi-MCP Architecture
@@ -180,6 +187,92 @@ Research in LLM agent development has identified tool overload as a significant
180187

181188
The evaluation framework included in this component is essential for validating that agents can effectively navigate the increased complexity of multiple MCP servers. By measuring success against specific evaluation criteria, developers can ensure that the benefits of tool specialisation outweigh the potential pitfalls of tool overload.
182189

190+
## Comprehensive Evaluation Suite
191+
192+
The `eval_multi_mcp/` directory contains a production-ready evaluation system for benchmarking LLM agent performance across multiple frameworks and models. The suite tests agents on mermaid diagram correction tasks using multiple MCP servers, providing rich metrics and analysis capabilities.
193+
194+
### Evaluation Components
195+
196+
#### Core Modules
197+
- **`evals_pydantic_mcp.py`** - Single-model evaluation with comprehensive metrics collection
198+
- **`run_multi_evals.py`** - Multi-model parallel/sequential benchmarking with CSV export
199+
- **`merbench_ui.py`** - Interactive Streamlit dashboard for visualisation and analysis
200+
- **`dashboard_config.py`** - Configuration-driven UI setup for flexible dashboard customisation
201+
- **`costs.csv`** - Pricing integration for cost analysis and budget planning
202+
203+
#### Test Difficulty Levels
204+
The evaluation includes three test cases of increasing complexity:
205+
1. **Easy** - Simple syntax errors in mermaid diagrams
206+
2. **Medium** - More complex structural issues requiring deeper reasoning
207+
3. **Hard** - Advanced mermaid syntax problems testing sophisticated tool usage
208+
209+
#### Evaluation Metrics
210+
The system captures five key performance indicators:
211+
- **UsedBothMCPTools** - Validates proper coordination between multiple MCP servers
212+
- **UsageLimitNotExceeded** - Monitors resource consumption and efficiency
213+
- **MermaidDiagramValid** - Assesses technical correctness of outputs
214+
- **LLMJudge (Format)** - Evaluates response formatting and structure
215+
- **LLMJudge (Structure)** - Measures preservation of original diagram intent
216+
217+
## Interactive Dashboard & Visualisation
218+
219+
The Streamlit-based dashboard (`merbench_ui.py`) provides comprehensive analysis and comparison capabilities:
220+
221+
### Dashboard Features
222+
- **Model Leaderboards** - Performance rankings by accuracy, cost efficiency, and execution speed
223+
- **Cost Analysis** - Detailed cost breakdowns with cost-per-success metrics and budget projections
224+
- **Failure Analysis** - Categorised failure reasons with debugging insights and error patterns
225+
- **Performance Trends** - Visualisation of model behaviour across difficulty levels and test iterations
226+
- **Resource Usage** - Token consumption patterns and API call efficiency metrics
227+
- **Comparative Analysis** - Side-by-side model performance comparison with statistical significance
228+
229+
### Dashboard Quick Launch
230+
```bash
231+
# Launch the interactive evaluation dashboard
232+
uv run streamlit run agents_mcp_usage/multi_mcp/eval_multi_mcp/merbench_ui.py
233+
```
234+
235+
The dashboard automatically loads evaluation results from the `mermaid_eval_results/` directory, providing immediate insights into model performance and cost efficiency.
236+
237+
## Multi-Model Benchmarking
238+
239+
The `run_multi_evals.py` script enables systematic comparison across multiple LLM models with flexible execution options:
240+
241+
### Benchmarking Features
242+
- **Parallel Execution** - Simultaneous evaluation across models for faster results
243+
- **Sequential Mode** - Conservative execution for resource-constrained environments
244+
- **Configurable Runs** - Multiple iterations per model for statistical reliability
245+
- **Comprehensive Error Handling** - Robust retry logic with exponential backoff
246+
- **CSV Export** - Structured results for downstream analysis and reporting
247+
248+
### Example Benchmarking Commands
249+
250+
```bash
251+
# Parallel benchmarking across multiple models
252+
uv run agents_mcp_usage/multi_mcp/eval_multi_mcp/run_multi_evals.py \
253+
--models "gemini-2.5-pro,gemini-2.0-flash,gemini-2.5-flash-preview-04-17" \
254+
--runs 5 \
255+
--parallel \
256+
--timeout 600 \
257+
--output-dir ./benchmark_results
258+
259+
# Sequential execution with custom judge model
260+
uv run agents_mcp_usage/multi_mcp/eval_multi_mcp/run_multi_evals.py \
261+
--models "gemini-2.5-pro,claude-3-opus" \
262+
--runs 3 \
263+
--sequential \
264+
--judge-model "gemini-2.5-pro" \
265+
--output-dir ./comparative_analysis
266+
```
267+
268+
### Output Structure
269+
Results are organised with timestamped files:
270+
- **Individual model results** - `YYYY-MM-DD_HH-MM-SS_individual_{model}.csv`
271+
- **Combined analysis** - `YYYY-MM-DD_HH-MM-SS_combined_results.csv`
272+
- **Dashboard integration** - Automatic loading into visualisation interface
273+
274+
The evaluation system enables robust, repeatable benchmarking across LLM models and agent frameworks, supporting both research investigations and production model selection decisions.
275+
183276
## Example Files
184277

185278
### Pydantic-AI Multi-MCP

0 commit comments

Comments
 (0)