Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
64 changes: 58 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,18 @@
# Model Context Protocol (MCP) Agent Frameworks Demo
# Model Context Protocol (MCP) Agent Frameworks Demo & Benchmarking Platform

This repository demonstrates LLM Agents using tools from Model Context Protocol (MCP) servers with several frameworks:
- Google Agent Development Kit (ADK)
- LangGraph Agents
- OpenAI Agents
- Pydantic-AI Agents

Both single and multiple MCP server examples are demonstrated
- [Agent with a single MCP server](agents_mcp_usage/basic_mcp/README.md)
- [Agent with multiple MCP servers](agents_mcp_usage/multi_mcp/README.md)
- Also includes Agent evaluations
## Repository Structure

- [Agent with a single MCP server](agents_mcp_usage/basic_mcp/README.md) - Learning examples and basic patterns
- [Agent with multiple MCP servers](agents_mcp_usage/multi_mcp/README.md) - Advanced usage with comprehensive evaluation suite
- **Evaluation Dashboard**: Interactive Streamlit UI for model comparison
- **Multi-Model Benchmarking**: Parallel/sequential evaluation across multiple LLMs
- **Rich Metrics**: Usage analysis, cost comparison, and performance leaderboards

The repo also includes Python MCP Servers:
- [`example_server.py`](mcp_servers/example_server.py) based on [MCP Python SDK Quickstart](https://github.com/modelcontextprotocol/python-sdk/blob/b4c7db6a50a5c88bae1db5c1f7fba44d16eebc6e/README.md?plain=1#L104) - Modified to include a datetime tool and run as a server invoked by Agents
Expand Down Expand Up @@ -217,10 +220,59 @@ uv run agents_mcp_usage/multi_mcp/multi_mcp_use/pydantic_mcp.py

# Run the multi-MCP evaluation
uv run agents_mcp_usage/multi_mcp/eval_multi_mcp/evals_pydantic_mcp.py

# Run multi-model benchmarking
uv run agents_mcp_usage/multi_mcp/eval_multi_mcp/run_multi_evals.py --models "gemini-2.5-pro,gemini-2.0-flash" --runs 5 --parallel

# Launch the evaluation dashboard
uv run streamlit run agents_mcp_usage/multi_mcp/eval_multi_mcp/merbench_ui.py
```

More details on multi-MCP implementation can be found in the [multi_mcp README](agents_mcp_usage/multi_mcp/README.md).

## Evaluation Suite & Benchmarking Dashboard

This repository includes a comprehensive evaluation system for benchmarking LLM agent performance across multiple frameworks and models. The evaluation suite tests agents on mermaid diagram correction tasks using multiple MCP servers, providing rich metrics and analysis capabilities.

### Key Evaluation Features

- **Multi-Level Difficulty**: Easy, medium, and hard test cases for comprehensive assessment
- **Multi-Model Benchmarking**: Parallel or sequential evaluation across multiple LLM models
- **Interactive Dashboard**: Streamlit-based UI for visualising results, cost analysis, and model comparison
- **Rich Metrics Collection**: Token usage, cost analysis, success rates, and failure categorisation
- **Robust Error Handling**: Comprehensive retry logic and detailed failure analysis
- **Export Capabilities**: CSV results for downstream analysis and reporting

### Dashboard Features

The included Streamlit dashboard (`merbench_ui.py`) provides:

- **Model Leaderboards**: Performance rankings by accuracy, cost efficiency, and speed
- **Cost Analysis**: Detailed cost breakdowns and cost-per-success metrics
- **Failure Analysis**: Categorised failure reasons with debugging insights
- **Performance Trends**: Visualisation of model behaviour across difficulty levels
- **Resource Usage**: Token consumption and API call patterns
- **Comparative Analysis**: Side-by-side model performance comparison

### Quick Evaluation Commands

```bash
# Single model evaluation
uv run agents_mcp_usage/multi_mcp/eval_multi_mcp/evals_pydantic_mcp.py

# Multi-model parallel benchmarking
uv run agents_mcp_usage/multi_mcp/eval_multi_mcp/run_multi_evals.py \
--models "gemini-2.5-pro,gemini-2.0-flash,gemini-2.5-flash-preview-04-17" \
--runs 5 \
--parallel \
--output-dir ./results

# Launch interactive dashboard
uv run streamlit run agents_mcp_usage/multi_mcp/eval_multi_mcp/merbench_ui.py
```

The evaluation system enables robust, repeatable benchmarking across LLM models and agent frameworks, supporting both research and production model selection decisions.

## What is MCP?

The Model Context Protocol allows applications to provide context for LLMs in a standardised way, separating the concerns of providing context from the actual LLM interaction.
Expand Down Expand Up @@ -258,4 +310,4 @@ A key advantage highlighted is flexibility; MCP allows developers to more easily
- OpenTelemetry support for leveraging existing tooling
- Pydantic integration for analytics on validations

Logfire gives you visibility into how your code is running, which is especially valuable for LLM applications where understanding model behaviour is critical.
Logfire gives you visibility into how your code is running, which is especially valuable for LLM applications where understanding model behaviour is critical.
9 changes: 6 additions & 3 deletions agents_mcp_usage/basic_mcp/basic_mcp_use/adk_mcp.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,11 +20,14 @@


async def main(query: str = "Greet Andrew and give him the current time") -> None:
"""
Main function to run the agent
"""Runs the agent with a given query.

This function sets up the MCP server, creates an LLM agent, and runs it
with a specified query. It also handles the cleanup of the MCP server
connection.

Args:
query (str): The query to run the agent with
query: The query to run the agent with.
"""
# Set up MCP server connection
server_params = StdioServerParameters(
Expand Down
10 changes: 6 additions & 4 deletions agents_mcp_usage/basic_mcp/basic_mcp_use/langgraph_mcp.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
# Create server parameters for stdio connection
server = StdioServerParameters(
command="uv",
args=["run", "mcp_servers/example_server.py", "stdio"],
args=["run", "mcp_servers/example_server.py", "stdio"],
)

model = ChatGoogleGenerativeAI(
Expand All @@ -30,11 +30,13 @@


async def main(query: str = "Greet Andrew and give him the current time") -> None:
"""
Main function to run the agent
"""Runs the LangGraph agent with a given query.

This function connects to the MCP server, loads the tools, creates a
LangGraph agent, and invokes it with the provided query.

Args:
query (str): The query to run the agent with
query: The query to run the agent with.
"""
async with stdio_client(server) as (read, write):
async with ClientSession(read, write) as session:
Expand Down
8 changes: 5 additions & 3 deletions agents_mcp_usage/basic_mcp/basic_mcp_use/oai-agent_mcp.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,11 +14,13 @@


async def main(query: str = "Greet Andrew and give him the current time") -> None:
"""
Main function to run the agent
"""Runs the OpenAI agent with a given query.

This function creates an MCP server, initializes an OpenAI agent with the
server, and runs the agent with the provided query.

Args:
query (str): The query to run the agent with
query: The query to run the agent with.
"""
# Create and use the MCP server in an async context
async with MCPServerStdio(
Expand Down
8 changes: 5 additions & 3 deletions agents_mcp_usage/basic_mcp/basic_mcp_use/pydantic_mcp.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,11 +25,13 @@


async def main(query: str = "Greet Andrew and give him the current time") -> None:
"""
Main function to run the agent
"""Runs the Pydantic agent with a given query.

This function runs the Pydantic agent with the provided query and prints the
output.

Args:
query (str): The query to run the agent with
query: The query to run the agent with.
"""
async with agent.run_mcp_servers():
result = await agent.run(query)
Expand Down
101 changes: 97 additions & 4 deletions agents_mcp_usage/multi_mcp/README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
# Multi-MCP Usage
# Multi-MCP Usage & Evaluation Suite

This directory contains examples demonstrating the integration of tools from multiple Model Context Protocol (MCP) servers with various LLM agent frameworks.
This directory contains examples demonstrating the integration of tools from multiple Model Context Protocol (MCP) servers with various LLM agent frameworks, along with a comprehensive evaluation and benchmarking system.

Agents utilising multiple MCP servers can be dramatically more complex than an Agent using a single server. This is because as the number of servers grow the number of tools that the Agent must reason on when and how to use increases. As a result this component not only demonstrates an Agent's use of multiple MCP servers, but also includes a production-ready evaluation suite to validate performance, analyse costs, and compare models across multiple difficulty levels.

Agents utilising multiple MCP servers can be dramatically more complex than an Agent using a single server. This is because as the number of servers grow the number of tools that the Agent must reason on when and how to use increases. As a result this component not only demonstrates an Agent's use of multiple MCP servers, but also includes evaluations to validate that they are being used to successfully accomplish the task according to various evaluation criterias.


## Quickstart
Expand All @@ -22,9 +23,15 @@ Agents utilising multiple MCP servers can be dramatically more complex than an A

# Run the multi-MCP evaluation
uv run agents_mcp_usage/multi_mcp/eval_multi_mcp/evals_pydantic_mcp.py

# Run multi-model benchmarking
uv run agents_mcp_usage/multi_mcp/eval_multi_mcp/run_multi_evals.py --models "gemini-2.5-pro,gemini-2.0-flash" --runs 5 --parallel

# Launch the evaluation dashboard
uv run streamlit run agents_mcp_usage/multi_mcp/eval_multi_mcp/merbench_ui.py
```

5. Check the console output or Logfire for results.
5. Check the console output, Logfire, or dashboard for results.


### Multi-MCP Architecture
Expand Down Expand Up @@ -180,6 +187,92 @@ Research in LLM agent development has identified tool overload as a significant

The evaluation framework included in this component is essential for validating that agents can effectively navigate the increased complexity of multiple MCP servers. By measuring success against specific evaluation criteria, developers can ensure that the benefits of tool specialisation outweigh the potential pitfalls of tool overload.

## Comprehensive Evaluation Suite

The `eval_multi_mcp/` directory contains a production-ready evaluation system for benchmarking LLM agent performance across multiple frameworks and models. The suite tests agents on mermaid diagram correction tasks using multiple MCP servers, providing rich metrics and analysis capabilities.

### Evaluation Components

#### Core Modules
- **`evals_pydantic_mcp.py`** - Single-model evaluation with comprehensive metrics collection
- **`run_multi_evals.py`** - Multi-model parallel/sequential benchmarking with CSV export
- **`merbench_ui.py`** - Interactive Streamlit dashboard for visualisation and analysis
- **`dashboard_config.py`** - Configuration-driven UI setup for flexible dashboard customisation
- **`costs.csv`** - Pricing integration for cost analysis and budget planning

#### Test Difficulty Levels
The evaluation includes three test cases of increasing complexity:
1. **Easy** - Simple syntax errors in mermaid diagrams
2. **Medium** - More complex structural issues requiring deeper reasoning
3. **Hard** - Advanced mermaid syntax problems testing sophisticated tool usage

#### Evaluation Metrics
The system captures five key performance indicators:
- **UsedBothMCPTools** - Validates proper coordination between multiple MCP servers
- **UsageLimitNotExceeded** - Monitors resource consumption and efficiency
- **MermaidDiagramValid** - Assesses technical correctness of outputs
- **LLMJudge (Format)** - Evaluates response formatting and structure
- **LLMJudge (Structure)** - Measures preservation of original diagram intent

## Interactive Dashboard & Visualisation

The Streamlit-based dashboard (`merbench_ui.py`) provides comprehensive analysis and comparison capabilities:

### Dashboard Features
- **Model Leaderboards** - Performance rankings by accuracy, cost efficiency, and execution speed
- **Cost Analysis** - Detailed cost breakdowns with cost-per-success metrics and budget projections
- **Failure Analysis** - Categorised failure reasons with debugging insights and error patterns
- **Performance Trends** - Visualisation of model behaviour across difficulty levels and test iterations
- **Resource Usage** - Token consumption patterns and API call efficiency metrics
- **Comparative Analysis** - Side-by-side model performance comparison with statistical significance

### Dashboard Quick Launch
```bash
# Launch the interactive evaluation dashboard
uv run streamlit run agents_mcp_usage/multi_mcp/eval_multi_mcp/merbench_ui.py
```

The dashboard automatically loads evaluation results from the `mermaid_eval_results/` directory, providing immediate insights into model performance and cost efficiency.

## Multi-Model Benchmarking

The `run_multi_evals.py` script enables systematic comparison across multiple LLM models with flexible execution options:

### Benchmarking Features
- **Parallel Execution** - Simultaneous evaluation across models for faster results
- **Sequential Mode** - Conservative execution for resource-constrained environments
- **Configurable Runs** - Multiple iterations per model for statistical reliability
- **Comprehensive Error Handling** - Robust retry logic with exponential backoff
- **CSV Export** - Structured results for downstream analysis and reporting

### Example Benchmarking Commands

```bash
# Parallel benchmarking across multiple models
uv run agents_mcp_usage/multi_mcp/eval_multi_mcp/run_multi_evals.py \
--models "gemini-2.5-pro,gemini-2.0-flash,gemini-2.5-flash-preview-04-17" \
--runs 5 \
--parallel \
--timeout 600 \
--output-dir ./benchmark_results

# Sequential execution with custom judge model
uv run agents_mcp_usage/multi_mcp/eval_multi_mcp/run_multi_evals.py \
--models "gemini-2.5-pro,claude-3-opus" \
--runs 3 \
--sequential \
--judge-model "gemini-2.5-pro" \
--output-dir ./comparative_analysis
```

### Output Structure
Results are organised with timestamped files:
- **Individual model results** - `YYYY-MM-DD_HH-MM-SS_individual_{model}.csv`
- **Combined analysis** - `YYYY-MM-DD_HH-MM-SS_combined_results.csv`
- **Dashboard integration** - Automatic loading into visualisation interface

The evaluation system enables robust, repeatable benchmarking across LLM models and agent frameworks, supporting both research investigations and production model selection decisions.

## Example Files

### Pydantic-AI Multi-MCP
Expand Down
Loading