Add levels of difficulty to evals #3
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR introduces structured levels of difficulty for mermaid diagram evaluation cases, expands and refines the evaluation logic, and adds a new standalone MCP server tool for mermaid diagram validation.
Problem:
Previously, all mermaid diagram evaluation cases were at a single (implicit) difficulty level, and validation logic was embedded and tightly coupled in one place. This made it hard to test more nuanced LLM capabilities, extend evaluation coverage, or debug diagram validation issues. Additionally, mermaid validation was dependent on subprocess logic and lacked a dedicated, reusable server tool.
Solution:
mcp_servers/mermaid_validator.py
MCP server tool, leveragingmermaid-cli
via npx and structured logging with loguru.pyproject.toml
anduv.lock
to includeloguru
for advanced logging.Unlocks:
Detailed breakdown of changes:
Provide a detailed description of what you changed.
agents_mcp_usage/multi_mcp/eval_multi_mcp/evals_pydantic_mcp.py
agents_mcp_usage/multi_mcp/mermaid_diagrams.py
invalid_mermaid_diagram_medium
definition.mcp_servers/mermaid_validator.py (new)
mermaid-cli
vianpx
, with robust cleanup, error handling, and loguru-based logging.pyproject.toml and uv.lock
loguru
as a dependency for structured logging.