Skip to content

Commit e60b380

Browse files
authored
Merge pull request #8 from andrewginns/refactor-evals
Refactor: Move evaluation suite to dedicated evaluations module
2 parents 0ba54fc + dc2077f commit e60b380

File tree

13 files changed

+317
-141
lines changed

13 files changed

+317
-141
lines changed

Makefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ lint:
66
uv run ruff check .
77

88
leaderboard:
9-
uv run -- streamlit run agents_mcp_usage/multi_mcp/eval_multi_mcp/merbench_ui.py
9+
uv run -- streamlit run agents_mcp_usage/evaluations/mermaid_evals/merbench_ui.py
1010

1111
adk_basic_ui:
1212
uv run adk web agents_mcp_usage/basic_mcp

README.md

Lines changed: 14 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,8 @@ This repository demonstrates LLM Agents using tools from Model Context Protocol
99
## Repository Structure
1010

1111
- [Agent with a single MCP server](agents_mcp_usage/basic_mcp/README.md) - Learning examples and basic patterns
12-
- [Agent with multiple MCP servers](agents_mcp_usage/multi_mcp/README.md) - Advanced usage with comprehensive evaluation suite
12+
- [Agent with multiple MCP servers](agents_mcp_usage/multi_mcp/README.md) - Advanced usage with MCP server coordination
13+
- [Evaluation suite](agents_mcp_usage/evaluations/mermaid_evals/README.md) - Comprehensive benchmarking tools
1314
- **Evaluation Dashboard**: Interactive Streamlit UI for model comparison
1415
- **Multi-Model Benchmarking**: Parallel/sequential evaluation across multiple LLMs
1516
- **Rich Metrics**: Usage analysis, cost comparison, and performance leaderboards
@@ -67,8 +68,12 @@ This project aims to teach:
6768
- **[agents_mcp_usage/multi_mcp/](agents_mcp_usage/multi_mcp/)** - Advanced multi-MCP server integration examples
6869
- **multi_mcp_use/** - Contains examples of using multiple MCP servers simultaneously:
6970
- `pydantic_mcp.py` - Example of using multiple MCP servers with Pydantic-AI Agent
70-
- **eval_multi_mcp/** - Contains evaluation examples for multi-MCP usage:
71-
- `evals_pydantic_mcp.py` - Example of evaluating the use of multiple MCP servers with Pydantic-AI
71+
72+
- **[agents_mcp_usage/evaluations/](agents_mcp_usage/evaluations/)** - Evaluation modules for benchmarking
73+
- **mermaid_evals/** - Comprehensive evaluation suite for mermaid diagram fixing tasks
74+
- `evals_pydantic_mcp.py` - Core evaluation module for single-model testing
75+
- `run_multi_evals.py` - Multi-model benchmarking with parallel execution
76+
- `merbench_ui.py` - Interactive dashboard for result visualization
7277

7378
- **Demo Python MCP Servers**
7479
- `mcp_servers/example_server.py` - Simple MCP server that runs locally, implemented in Python
@@ -221,13 +226,13 @@ graph LR
221226
uv run agents_mcp_usage/multi_mcp/multi_mcp_use/pydantic_mcp.py
222227

223228
# Run the multi-MCP evaluation
224-
uv run agents_mcp_usage/multi_mcp/eval_multi_mcp/evals_pydantic_mcp.py
229+
uv run agents_mcp_usage/evaluations/mermaid_evals/evals_pydantic_mcp.py
225230

226231
# Run multi-model benchmarking
227-
uv run agents_mcp_usage/multi_mcp/eval_multi_mcp/run_multi_evals.py --models "gemini-2.5-pro-preview-06-05,gemini-2.0-flash" --runs 5 --parallel
232+
uv run agents_mcp_usage/evaluations/mermaid_evals/run_multi_evals.py --models "gemini-2.5-pro-preview-06-05,gemini-2.0-flash" --runs 5 --parallel
228233

229234
# Launch the evaluation dashboard
230-
uv run streamlit run agents_mcp_usage/multi_mcp/eval_multi_mcp/merbench_ui.py
235+
uv run streamlit run agents_mcp_usage/evaluations/mermaid_evals/merbench_ui.py
231236
```
232237

233238
More details on multi-MCP implementation can be found in the [multi_mcp README](agents_mcp_usage/multi_mcp/README.md).
@@ -260,17 +265,17 @@ The included Streamlit dashboard (`merbench_ui.py`) provides:
260265

261266
```bash
262267
# Single model evaluation
263-
uv run agents_mcp_usage/multi_mcp/eval_multi_mcp/evals_pydantic_mcp.py
268+
uv run agents_mcp_usage/evaluations/mermaid_evals/evals_pydantic_mcp.py
264269

265270
# Multi-model parallel benchmarking
266-
uv run agents_mcp_usage/multi_mcp/eval_multi_mcp/run_multi_evals.py \
271+
uv run agents_mcp_usage/evaluations/mermaid_evals/run_multi_evals.py \
267272
--models "gemini-2.5-pro-preview-06-05,gemini-2.0-flash,gemini-2.5-flash" \
268273
--runs 5 \
269274
--parallel \
270275
--output-dir ./results
271276

272277
# Launch interactive dashboard
273-
uv run streamlit run agents_mcp_usage/multi_mcp/eval_multi_mcp/merbench_ui.py
278+
uv run streamlit run agents_mcp_usage/evaluations/mermaid_evals/merbench_ui.py
274279
```
275280

276281
The evaluation system enables robust, repeatable benchmarking across LLM models and agent frameworks, supporting both research and production model selection decisions.

agents_mcp_usage/evaluations/__init__.py

Whitespace-only changes.

agents_mcp_usage/multi_mcp/eval_multi_mcp/README.md renamed to agents_mcp_usage/evaluations/mermaid_evals/README.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Multi-MCP Mermaid Diagram Evaluation System
1+
# Mermaid Diagram Evaluation System
22

33
This directory contains evaluation modules for testing LLM agents on mermaid diagram fixing tasks using multiple MCP (Model Context Protocol) servers. The system evaluates how well language models can fix invalid mermaid diagrams while utilizing multiple external tools.
44

@@ -21,7 +21,7 @@ The system tests LLM agents on their ability to:
2121

2222
The evaluation includes three test cases of increasing difficulty:
2323
1. **Easy** - Simple syntax errors in mermaid diagrams
24-
2. **Medium** - More complex structural issues
24+
2. **Medium** - More complex structural issues
2525
3. **Hard** - Advanced mermaid syntax problems
2626

2727
## Output Schema
@@ -164,26 +164,26 @@ Results are exported to CSV files with the following columns:
164164

165165
```bash
166166
# Run evaluation with default model
167-
uv run agents_mcp_usage/multi_mcp/eval_multi_mcp/evals_pydantic_mcp.py
167+
uv run agents_mcp_usage/evaluations/mermaid_evals/evals_pydantic_mcp.py
168168

169169
# Customize model and judge
170170
AGENT_MODEL="gemini-2.5-pro-preview-06-05" JUDGE_MODEL="gemini-2.0-flash" \
171-
uv run agents_mcp_usage/multi_mcp/eval_multi_mcp/evals_pydantic_mcp.py
171+
uv run agents_mcp_usage/evaluations/mermaid_evals/evals_pydantic_mcp.py
172172
```
173173

174174
### Multi-Model Evaluation
175175

176176
```bash
177177
# Run evaluation across multiple models
178-
uv run agents_mcp_usage/multi_mcp/eval_multi_mcp/run_multi_evals.py \
178+
uv run agents_mcp_usage/evaluations/mermaid_evals/run_multi_evals.py \
179179
--models "gemini-2.5-pro-preview-06-05,gemini-2.0-flash" \
180180
--runs 5 \
181181
--parallel \
182182
--timeout 600 \
183183
--output-dir ./results
184184

185185
# Sequential execution with custom judge
186-
uv run agents_mcp_usage/multi_mcp/eval_multi_mcp/run_multi_evals.py \
186+
uv run agents_mcp_usage/evaluations/mermaid_evals/run_multi_evals.py \
187187
--models "gemini-2.5-pro-preview-06-05,claude-3-opus" \
188188
--runs 3 \
189189
--sequential \
@@ -247,4 +247,4 @@ The system implements robust error handling:
247247
- **pydantic-evals** - Evaluation framework and metrics
248248
- **logfire** - Logging and monitoring
249249
- **rich** - Console output and progress bars
250-
- **asyncio** - Asynchronous evaluation execution
250+
- **asyncio** - Asynchronous evaluation execution

agents_mcp_usage/evaluations/mermaid_evals/__init__.py

Whitespace-only changes.
File renamed without changes.

agents_mcp_usage/multi_mcp/eval_multi_mcp/dashboard_config.py renamed to agents_mcp_usage/evaluations/mermaid_evals/dashboard_config.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -126,4 +126,4 @@
126126

127127
# The default configuration to use when the dashboard starts.
128128
# You can change this to point to a different configuration.
129-
DEFAULT_CONFIG = MERBENCH_CONFIG
129+
DEFAULT_CONFIG = MERBENCH_CONFIG

agents_mcp_usage/multi_mcp/eval_multi_mcp/evals_pydantic_mcp.py renamed to agents_mcp_usage/evaluations/mermaid_evals/evals_pydantic_mcp.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@
3131
from pydantic_evals.evaluators import Evaluator, EvaluatorContext, LLMJudge
3232
from pydantic_evals.reporting import EvaluationReport
3333

34-
from agents_mcp_usage.multi_mcp.mermaid_diagrams import (
34+
from agents_mcp_usage.evaluations.mermaid_evals.mermaid_diagrams import (
3535
invalid_mermaid_diagram_easy,
3636
invalid_mermaid_diagram_medium,
3737
invalid_mermaid_diagram_hard,
@@ -646,7 +646,7 @@ def get_timestamp_prefix() -> str:
646646

647647

648648
def write_mermaid_results_to_csv(
649-
report: EvaluationReport, model: str, output_dir: str = "./mermaid_results"
649+
report: EvaluationReport, model: str, output_dir: str = "./mermaid_eval_results"
650650
) -> str:
651651
"""Writes mermaid evaluation results with metrics to a CSV file.
652652
@@ -750,7 +750,7 @@ async def run_evaluations(
750750
model: str = DEFAULT_MODEL,
751751
judge_model: str = DEFAULT_MODEL,
752752
export_csv: bool = True,
753-
output_dir: str = "./mermaid_results",
753+
output_dir: str = "./mermaid_eval_results",
754754
) -> EvaluationReport:
755755
"""Runs the evaluations on the mermaid diagram fixing task.
756756
@@ -804,4 +804,4 @@ async def run_all():
804804
model=agent_model, judge_model=judge_model, export_csv=True
805805
)
806806

807-
asyncio.run(run_all())
807+
asyncio.run(run_all())

agents_mcp_usage/multi_mcp/eval_multi_mcp/merbench_ui.py renamed to agents_mcp_usage/evaluations/mermaid_evals/merbench_ui.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -8,10 +8,10 @@
88
import re
99
from pydantic import ValidationError
1010

11-
from agents_mcp_usage.multi_mcp.eval_multi_mcp.dashboard_config import (
11+
from agents_mcp_usage.evaluations.mermaid_evals.dashboard_config import (
1212
DEFAULT_CONFIG,
1313
)
14-
from agents_mcp_usage.multi_mcp.eval_multi_mcp.schemas import DashboardConfig
14+
from agents_mcp_usage.evaluations.mermaid_evals.schemas import DashboardConfig
1515

1616
# Load and validate the configuration
1717
try:
@@ -841,7 +841,7 @@ def main() -> None:
841841

842842
# Cost configuration in sidebar
843843
st.sidebar.subheader("💰 Cost Configuration")
844-
cost_file_path = os.path.join(os.path.dirname(__file__), "costs.csv")
844+
cost_file_path = os.path.join(os.path.dirname(__file__), "costs.json")
845845
model_costs, friendly_names = load_model_costs(cost_file_path)
846846
available_models = sorted(df_initial["Model"].unique())
847847

@@ -1033,4 +1033,4 @@ def main() -> None:
10331033

10341034

10351035
if __name__ == "__main__":
1036-
main()
1036+
main()

0 commit comments

Comments
 (0)