Add levels of difficulty to evals #3

andrewginns · 2025-05-22T17:20:35Z

This PR introduces structured levels of difficulty for mermaid diagram evaluation cases, expands and refines the evaluation logic, and adds a new standalone MCP server tool for mermaid diagram validation.

Problem:

Previously, all mermaid diagram evaluation cases were at a single (implicit) difficulty level, and validation logic was embedded and tightly coupled in one place. This made it hard to test more nuanced LLM capabilities, extend evaluation coverage, or debug diagram validation issues. Additionally, mermaid validation was dependent on subprocess logic and lacked a dedicated, reusable server tool.

Solution:

Added explicit “easy”, “medium”, and “hard” levels for invalid mermaid diagram cases.
Incorporated a new mcp_servers/mermaid_validator.py MCP server tool, leveraging mermaid-cli via npx and structured logging with loguru.
Refactored evaluation code to use the new server tool, support more robust string cleaning, and improve logging.
Updated dependencies in pyproject.toml and uv.lock to include loguru for advanced logging.
Improved test case metadata and expanded the evaluation rubric.

Unlocks:

Enables targeted and incremental evaluation of LLMs’ ability to repair mermaid diagrams of varying complexity.
Facilitates debugging and extension of mermaid diagram validation as a reusable, standalone service.
Allows for richer analysis and troubleshooting with enhanced logging and separation of concerns.

Detailed breakdown of changes:

Provide a detailed description of what you changed.

agents_mcp_usage/multi_mcp/eval_multi_mcp/evals_pydantic_mcp.py
- Refactored to add “easy”, “medium”, and “hard” difficulty levels for invalid diagram test cases.
- Switched from subprocess-based validation to using the new MCP server’s validation tool.
- Improved diagram string cleaning and validation logic.
- Enhanced logging for evaluation steps.
agents_mcp_usage/multi_mcp/mermaid_diagrams.py
- Added new invalid_mermaid_diagram_medium definition.
- Clarified and reordered diagram definitions.
mcp_servers/mermaid_validator.py (new)
- Implements an MCP server exposing mermaid diagram validation as a tool.
- Uses mermaid-cli via npx, with robust cleanup, error handling, and loguru-based logging.
- Includes both tool and prompt endpoints for validation, as well as a resource for an example diagram.
pyproject.toml and uv.lock
- Added loguru as a dependency for structured logging.
- Updated lockfile to reflect new and updated dependencies.

…on for consistent invocation and custom MCP decorated functions. Add logging

feat: Add levels of mermaid fixes to perform. Convert npx MCP to pyth…

d07902e

…on for consistent invocation and custom MCP decorated functions. Add logging

andrewginns merged commit 3eb1d03 into main May 22, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add levels of difficulty to evals #3

Add levels of difficulty to evals #3

Uh oh!

andrewginns commented May 22, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Add levels of difficulty to evals #3

Add levels of difficulty to evals #3

Uh oh!

Conversation

andrewginns commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem:

Solution:

Unlocks:

Detailed breakdown of changes:

Uh oh!

Uh oh!

Uh oh!

andrewginns commented May 22, 2025 •

edited

Loading