You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
uv run streamlit run agents_mcp_usage/multi_mcp/eval_multi_mcp/merbench_ui.py
278
+
uv run streamlit run agents_mcp_usage/evaluations/mermaid_evals/merbench_ui.py
274
279
```
275
280
276
281
The evaluation system enables robust, repeatable benchmarking across LLM models and agent frameworks, supporting both research and production model selection decisions.
Copy file name to clipboardExpand all lines: agents_mcp_usage/evaluations/mermaid_evals/README.md
+7-7Lines changed: 7 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,4 +1,4 @@
1
-
# Multi-MCP Mermaid Diagram Evaluation System
1
+
# Mermaid Diagram Evaluation System
2
2
3
3
This directory contains evaluation modules for testing LLM agents on mermaid diagram fixing tasks using multiple MCP (Model Context Protocol) servers. The system evaluates how well language models can fix invalid mermaid diagrams while utilizing multiple external tools.
4
4
@@ -21,7 +21,7 @@ The system tests LLM agents on their ability to:
21
21
22
22
The evaluation includes three test cases of increasing difficulty:
23
23
1.**Easy** - Simple syntax errors in mermaid diagrams
24
-
2.**Medium** - More complex structural issues
24
+
2.**Medium** - More complex structural issues
25
25
3.**Hard** - Advanced mermaid syntax problems
26
26
27
27
## Output Schema
@@ -164,26 +164,26 @@ Results are exported to CSV files with the following columns:
164
164
165
165
```bash
166
166
# Run evaluation with default model
167
-
uv run agents_mcp_usage/multi_mcp/eval_multi_mcp/evals_pydantic_mcp.py
167
+
uv run agents_mcp_usage/evaluations/mermaid_evals/evals_pydantic_mcp.py
0 commit comments