diff --git a/agents_mcp_usage/evaluations/mermaid_evals/README.md b/agents_mcp_usage/evaluations/mermaid_evals/README.md index 5808c5c..553e033 100644 --- a/agents_mcp_usage/evaluations/mermaid_evals/README.md +++ b/agents_mcp_usage/evaluations/mermaid_evals/README.md @@ -1,35 +1,155 @@ # Mermaid Diagram Evaluation System -This directory contains evaluation modules for testing LLM agents on mermaid diagram fixing tasks using multiple MCP (Model Context Protocol) servers. The system evaluates how well language models can fix invalid mermaid diagrams while utilizing multiple external tools. +A benchmarking system for evaluating LLM agents' ability to fix syntactically incorrect Mermaid diagrams using Model Context Protocol (MCP) servers. + +**Key Features:** +- 🎯 Tests three difficulty levels of invalid Mermaid diagrams +- 🔧 Utilises multiple MCP servers for validation and syntax errors +- 📊 Comprehensive evaluation metrics and scoring +- 🏆 Live leaderboard at + +## Architecture + +```mermaid +flowchart TD + Runner(("run_multi_evals.py
or
evals_pydantic_mcp.py")) + Agent["Pydantic-AI Agent"] + MCP["MCP Servers"] + LLM["LLM Provider"] + Evaluators["Evaluators"] + CSV["CSV Results"] + Dashboard["merbench_ui.py
(Local Dashboard)"] + Script["preprocess_merbench_data.py"] + JSON["JSON for Merbench"] + + Runner --> Agent + Agent --> MCP + Agent --> LLM + Runner --> Evaluators + Evaluators --> CSV + CSV --> Dashboard + CSV -.-> Script + Script -.-> JSON + + style Dashboard fill:#9cf,stroke:#36f,stroke-width:2px +``` + +## Quick Overview + +* **Test cases**: 3 invalid Mermaid diagrams (easy/medium/hard) +* **Success metric**: Diagram syntax validity (100% = all cases pass) +* **Evaluation runners**: + - `evals_pydantic_mcp.py` - Single-model evaluation + - `run_multi_evals.py` - Multi-model parallel evaluation +* **Results visualisation**: + - `merbench_ui.py` - Local Streamlit dashboard (recommended) + - CSV → JSON → Public web dashboard (optional) +* **Performance scoring**: Based solely on diagram validity (LLMJudges implemented but not scored) + +## Prerequisites & Setup + +### Requirements +- Python 3.11+ +- [`uv`](https://github.com/astral-sh/uv) for dependency management + +### Installation +```bash +# Install all dependencies +uv sync +``` + +### Configuration +The evaluation system uses Pydantic-AI and requires environment variables for API keys: +```bash +# Required for most models +export GEMINI_API_KEY="your-key-here" -## Overview +# Optional for specific models +export OPENAI_API_KEY="your-key-here" +``` -The evaluation system consists of two main components: +## Quick Start Guide -1. **`evals_pydantic_mcp.py`** - Core evaluation module for single-model testing -2. **`run_multi_evals.py`** - Multi-model evaluation runner with parallel execution +### 1. Run a Single-Model Evaluation +```bash +# Test a specific model (defaults to gemini-2.5-flash) +uv run agents_mcp_usage/evaluations/mermaid_evals/evals_pydantic_mcp.py +``` -## Evaluation Task +### 2. Run Multi-Model Benchmarking +```bash +# Run evaluation across multiple models in parallel +uv run agents_mcp_usage/evaluations/mermaid_evals/run_multi_evals.py +``` -The system tests LLM agents on their ability to: -- Fix syntactically invalid mermaid diagrams -- Use both MCP servers (example server for time, mermaid validator for validation) -- Handle errors gracefully with proper categorization -- Provide meaningful failure reasons for debugging +### 3. View Results in Local Dashboard (Recommended) +```bash +# Launch the interactive Streamlit dashboard to visualise your results +make leaderboard -### Test Cases +# Or run directly: +uv run streamlit run agents_mcp_usage/evaluations/mermaid_evals/merbench_ui.py +``` -The evaluation includes three test cases of increasing difficulty: -1. **Easy** - Simple syntax errors in mermaid diagrams -2. **Medium** - More complex structural issues -3. **Hard** - Advanced mermaid syntax problems +The local dashboard (`merbench_ui.py`) provides: +- 📊 Interactive leaderboards and performance metrics +- 📈 Pareto frontier visualisations (performance vs cost/tokens) +- 🔍 Deep dive analysis with success rates and failure patterns +- 💰 Configurable model costs +- 🎯 Advanced filtering by provider, model, and test case -## Output Schema +### 4. Export Results for Web Dashboard (Optional) +```bash +# Convert CSV results to JSON format for the public Merbench website +uv run agents_mcp_usage/evaluations/mermaid_evals/scripts/preprocess_merbench_data.py \ + mermaid_eval_results/_combined_results.csv \ + agents_mcp_usage/evaluations/mermaid_evals/results/_processed.json +``` -### MermaidOutput +## Evaluation Task & Test Cases + +The system challenges LLM agents to: +1. **Identify and fix syntax errors** in invalid Mermaid diagrams +2. **Utilise MCP servers** for validation and enhancement +3. **Handle errors gracefully** with proper categorisation +4. **Provide meaningful failure reasons** for debugging + +### Test Case Difficulty Levels + +1. **Easy** (2 syntax errors) + - Incorrect node references (`GEMINI` vs `GEM`) + - Minor naming inconsistencies + - Example: `MCP --> GEMINI` where `GEMINI` is undefined + +2. **Medium** (7 syntax errors) + - Invalid comment syntax (using `#` instead of `%%`) + - Malformed arrows with spaces (`-- >` instead of `-->`) + - Direction inconsistencies (`TB` vs `TD`) + - Example: `LG -- > MCP` has invalid spacing in arrow + +3. **Hard** (Complex structural errors) + - Multiple comment syntax errors + - Arrow spacing issues throughout + - Complex nested subgraph problems + - Requires understanding of overall diagram structure + +### Example Error Types +```mermaid +%% Invalid examples from actual test cases: +graph LR + %% Error 1: Space in arrow + A -- > B %% Should be: A --> B + + %% Error 2: Undefined node reference + C --> UNDEFINED_NODE %% Node doesn't exist + + %% Error 3: Wrong comment syntax + # This is wrong %% Should use %% +``` -The main output schema captures comprehensive information about each evaluation: +## Output Schema & Metrics +### MermaidOutput Schema ```python class MermaidOutput(BaseModel): fixed_diagram: str # The corrected mermaid diagram @@ -38,213 +158,157 @@ class MermaidOutput(BaseModel): tools_used: List[str] = [] # Which MCP tools were called ``` -### Metrics Captured - -The system automatically captures detailed usage metrics: - +### Captured Metrics - **`requests`** - Number of API requests made - **`request_tokens`** - Total tokens in requests - **`response_tokens`** - Total tokens in responses - **`total_tokens`** - Sum of request and response tokens - **`details`** - Additional model-specific usage details -## Failure Reasons - -The system provides meaningful failure categorization for debugging and analysis: +## Evaluation Criteria -### Agent-Level Failures (from `fix_mermaid_diagram`) +### 1. **MermaidDiagramValid** (Primary Score) +- **Weight**: 100% of final score +- **Purpose**: Validates diagram syntax using MCP server +- **Score**: 1.0 (valid) or 0.0 (invalid) -- **`usage_limit_exceeded`** - Agent hit configured usage limits -- **`response_validation_failed`** - Agent response failed Pydantic validation -- **`agent_timeout`** - Agent operation timed out -- **`http_error_{status_code}`** - HTTP errors (e.g., `http_error_502`, `http_error_503`) -- **`timeout_error`** - General timeout errors -- **`connection_error`** - Network/connection issues -- **`rate_limit_error`** - API rate limiting or quota exceeded -- **`error_{ExceptionType}`** - Other specific exceptions (fallback) - -### Evaluation-Level Failures (from `run_multi_evals`) - -- **`evaluation_timeout`** - Entire evaluation run timed out -- **`evaluation_validation_failed`** - Evaluation framework validation error -- **`model_api_error`** - Model API-specific errors -- **`network_error`** - Network-related evaluation failures -- **`evaluation_error_{ExceptionType}`** - Other evaluation framework errors - -## Evaluators - -The system uses five different evaluators to assess performance: - -### 1. UsedBothMCPTools -- **Score**: 0.0, 0.5, or 1.0 -- **Purpose**: Checks if the agent used tools from both MCP servers -- **Scoring**: - - 1.0: Used both example server and mermaid validator - - 0.5: Used only one MCP server - - 0.0: Used no MCP tools or only non-MCP tools - -### 2. UsageLimitNotExceeded -- **Score**: 0.0 or 1.0 -- **Purpose**: Detects if the case failed due to usage limits -- **Scoring**: - - 1.0: No usage limit failure - - 0.0: Failed due to `usage_limit_exceeded` - -### 3. MermaidDiagramValid -- **Score**: 0.0 or 1.0 -- **Purpose**: Validates the fixed diagram using the mermaid validator MCP server -- **Features**: - - Skips validation if there was a prior failure - - Strips markdown formatting and backticks - - Uses retry logic for transient validation errors +### 2. **UsedBothMCPTools** +- **Purpose**: Tracks MCP server utilisation - **Scoring**: - - 1.0: Diagram passes mermaid syntax validation - - 0.0: Diagram is invalid or validation failed + - 1.0: Used both servers (example + validator) + - 0.5: Used only one server + - 0.0: No MCP tool usage -### 4. LLMJudge (Format Check) -- **Score**: 0.0 to 1.0 (continuous) -- **Purpose**: Evaluates if response contains only a mermaid diagram -- **Rubric**: "The response only contains a mermaid diagram inside the fixed_diagram field, no other text" +### 3. **UsageLimitNotExceeded** +- **Purpose**: Detects token limit failures +- **Score**: 1.0 (success) or 0.0 (limit exceeded) -### 5. LLMJudge (Structure Check) -- **Score**: 0.0 to 1.0 (continuous) -- **Purpose**: Evaluates if the fixed diagram maintains original structure and intent -- **Rubric**: "The fixed_diagram field should maintain the same overall structure and intent as the expected output diagram while fixing any syntax errors" +### 4. **LLMJudge Evaluators** (Not scored) +- **Format Check**: Response contains only Mermaid diagram +- **Structure Check**: Maintains original diagram intent -## Retry Logic +## MCP Servers -The system includes robust retry logic for handling transient API failures: +### 1. Example Server (`mcp_servers/example_server.py`) +- **Purpose**: Provides utility tools +- **Features**: Time functions for adding timestamps -### Retryable Errors -- HTTP status codes: 429, 500, 502, 503, 504 -- Connection errors and network issues -- General `OSError` exceptions +### 2. Mermaid Validator (`mcp_servers/mermaid_validator.py`) +- **Purpose**: Validates Mermaid syntax +- **Features**: Returns detailed error messages -### Retry Configuration -- **Max attempts**: 3 -- **Base delay**: 1 second -- **Exponential backoff**: 1s → 2s → 4s -- **Max delay**: 30 seconds -- **Jitter**: ±50% randomization to prevent thundering herd +## Error Handling -### Non-Retryable Errors -- HTTP 4xx errors (except 429) -- Validation errors -- Authentication errors +### Failure Categories -## CSV Output Format +**Agent-level failures:** +- `usage_limit_exceeded` - Token limit reached +- `response_validation_failed` - Invalid output format +- `agent_timeout` - Execution timeout +- `http_error_*` - HTTP status errors +- `connection_error` - Network issues -Results are exported to CSV files with the following columns: +**Evaluation-level failures:** +- `evaluation_timeout` - Overall timeout +- `model_api_error` - API authentication/access +- `evaluation_error_{ExceptionType}` - Unexpected errors -### Basic Information -- **Model** - LLM model used -- **Run** - Run number (for multi-run evaluations) -- **Case** - Test case name (easy/medium/hard) -- **Duration** - Task execution time in seconds -- **Fixed_Diagram_Length** - Length of the output diagram -- **Failure_Reason** - Categorized failure reason (if any) -- **Tools_Used** - Pipe-separated list of MCP tools used +### Retry Strategy +- **Attempts**: 3 retries for transient errors +- **Backoff**: Exponential (1s → 2s → 4s) +- **Jitter**: ±50% randomisation +- **Max delay**: 30 seconds +- **Retryable**: HTTP 429/5xx, network errors, OSError +- **Non-retryable**: HTTP 4xx (except 429), validation errors -### Evaluator Scores -- **Score_UsedBothMCPTools** - MCP tool usage score -- **Score_UsageLimitNotExceeded** - Usage limit check score -- **Score_MermaidDiagramValid** - Diagram validity score -- **Score_LLMJudge** - Format evaluation scores (2 columns) +## Output Files -### Metrics -- **Metric_requests** - Number of API requests -- **Metric_request_tokens** - Input token count -- **Metric_response_tokens** - Output token count -- **Metric_total_tokens** - Total token usage -- **Metric_details** - Additional usage details +### Directory Structure +``` +mermaid_eval_results/ # Default output directory +├── YYYY-MM-DD_HH-MM-SS_mermaid_results_{model}.csv # Single model +├── YYYY-MM-DD_HH-MM-SS_individual_{model}.csv # Multi-model individual +└── YYYY-MM-DD_HH-MM-SS_combined_results.csv # Multi-model combined +``` -## Usage +The local dashboard (`merbench_ui.py`) automatically detects and loads these CSV files from the `mermaid_eval_results/` directory. -### Single Model Evaluation +### File Contents +- Test case details and inputs +- Fixed diagrams and validation results +- Performance metrics and scores +- Error messages and failure reasons -```bash -# Run evaluation with default model -uv run agents_mcp_usage/evaluations/mermaid_evals/evals_pydantic_mcp.py +## Monitoring & Debugging -# Customize model and judge -AGENT_MODEL="gemini-2.5-pro-preview-06-05" JUDGE_MODEL="gemini-2.0-flash" \ -uv run agents_mcp_usage/evaluations/mermaid_evals/evals_pydantic_mcp.py -``` +All evaluation runs are traced with **Logfire** for comprehensive monitoring: +- Tool call traces +- Retry attempts and reasons +- Execution durations +- Categorised failure analysis -### Multi-Model Evaluation +## Troubleshooting -```bash -# Run evaluation across multiple models -uv run agents_mcp_usage/evaluations/mermaid_evals/run_multi_evals.py \ - --models "gemini-2.5-pro-preview-06-05,gemini-2.0-flash" \ - --runs 5 \ - --parallel \ - --timeout 600 \ - --output-dir ./results - -# Sequential execution with custom judge -uv run agents_mcp_usage/evaluations/mermaid_evals/run_multi_evals.py \ - --models "gemini-2.5-pro-preview-06-05,claude-3-opus" \ - --runs 3 \ - --sequential \ - --judge-model "gemini-2.5-pro-preview-06-05" \ - --output-dir ./eval_results -``` +### Common Issues -### Available Options +1. **API Key Errors** + - Ensure environment variables are set correctly + - Check API key permissions and quotas -- **`--models`** - Comma-separated list of models to evaluate -- **`--runs`** - Number of evaluation runs per model (default: 3) -- **`--judge-model`** - Model for LLM judging (default: gemini-2.5-pro-preview-06-05) -- **`--parallel`** - Run evaluations in parallel (default: true) -- **`--sequential`** - Force sequential execution -- **`--timeout`** - Timeout in seconds per evaluation run (default: 600) -- **`--output-dir`** - Directory to save results (default: ./mermaid_eval_results) +2. **MCP Server Connection Issues** + - Verify MCP servers are properly installed + - Check server process logs -## MCP Servers +3. **Validation Failures** + - Review the fixed diagram for syntax errors + - Check Mermaid version compatibility -The evaluation uses two MCP servers: +4. **Performance Issues** + - Reduce parallel execution for rate limits + - Monitor token usage per model -1. **Example Server** (`mcp_servers/example_server.py`) - - Provides time-related tools - - Used to add timestamps to diagrams +## Advanced Usage -2. **Mermaid Validator** (`mcp_servers/mermaid_validator.py`) - - Validates mermaid diagram syntax - - Returns validation results with error details +### Custom Model Configuration -## Output Files +You can customise model parameters when running evaluations: -### Single Model -- `YYYY-MM-DD_HH-MM-SS_mermaid_results_{model}.csv` +```bash +# Single model +# Edit `evals_pydantic_mcp.py` to customise the `agent_model` string -### Multi-Model -- `YYYY-MM-DD_HH-MM-SS_individual_{model}.csv` - Per-model results -- `YYYY-MM-DD_HH-MM-SS_combined_results.csv` - All models combined +# Multi-model with specific model list +# Edit `run_multi_evals.py` to customise the `MODELS` list +``` -## Logging and Monitoring +### Running Specific Test Cases -The system integrates with Logfire for comprehensive monitoring: +To test individual cases during development: +1. Import specific test cases from `mermaid_diagrams.py` +2. Modify the test loop in `evals_pydantic_mcp.py` +3. Focus on debugging specific error types -- **Agent operations** - MCP server interactions, tool usage -- **Retry attempts** - Failure reasons, backoff delays -- **Evaluation progress** - Success rates, timing metrics -- **Error categorization** - Detailed failure analysis +### Adding New Test Cases -## Error Handling Best Practices +1. Add your invalid diagram to `mermaid_diagrams.py` +2. Document the specific errors it contains +3. Ensure it has a corresponding valid version +4. Update the difficulty categorisation -The system implements robust error handling: +## Contributing -1. **Graceful degradation** - Partial results rather than complete failure -2. **Meaningful categorization** - Specific failure reasons for debugging -3. **Retry logic** - Automatic recovery from transient issues -4. **Comprehensive logging** - Full context for error analysis -5. **Resource cleanup** - Proper MCP server lifecycle management +When adding new test cases or evaluators: +1. Follow the existing schema patterns +2. Include comprehensive error handling +3. Add appropriate retry logic +4. Document failure categories +5. Test with multiple models before submitting +6. Update costs.json with new model pricing if needed -## Dependencies +## Related Resources -- **pydantic-ai** - Core agent framework with MCP support -- **pydantic-evals** - Evaluation framework and metrics -- **logfire** - Logging and monitoring -- **rich** - Console output and progress bars -- **asyncio** - Asynchronous evaluation execution \ No newline at end of file +- [Mermaid Documentation](https://mermaid.js.org/) +- [MCP Protocol Specification](https://github.com/anthropics/mcp) +- [Pydantic-AI Documentation](https://github.com/pydantic/pydantic-ai) +- [Streamlit Documentation](https://streamlit.io/) (for local dashboard) +- [Public Leaderboard](https://andrew.ginns.uk/merbench) diff --git a/agents_mcp_usage/evaluations/mermaid_evals/costs.json b/agents_mcp_usage/evaluations/mermaid_evals/costs.json index df86d38..b3f7e0a 100644 --- a/agents_mcp_usage/evaluations/mermaid_evals/costs.json +++ b/agents_mcp_usage/evaluations/mermaid_evals/costs.json @@ -8,151 +8,427 @@ "gemini-2.5-pro-preview-03-25": { "friendly_name": "Gemini 2.5 Pro Preview (Mar)", "input": [ - {"up_to": 200000, "price": 1.25}, - {"up_to": "inf", "price": 2.50} + { + "up_to": 200000, + "price": 1.25 + }, + { + "up_to": "inf", + "price": 2.5 + } ], "output": { "default": [ - {"up_to": 200000, "price": 10.00}, - {"up_to": "inf", "price": 15.00} + { + "up_to": 200000, + "price": 10.0 + }, + { + "up_to": "inf", + "price": 15.0 + } ] } }, "gemini-2.5-pro-preview-05-06": { "friendly_name": "Gemini 2.5 Pro Preview (May)", "input": [ - {"up_to": 200000, "price": 1.25}, - {"up_to": "inf", "price": 2.50} + { + "up_to": 200000, + "price": 1.25 + }, + { + "up_to": "inf", + "price": 2.5 + } ], "output": { "default": [ - {"up_to": 200000, "price": 10.00}, - {"up_to": "inf", "price": 15.00} + { + "up_to": 200000, + "price": 10.0 + }, + { + "up_to": "inf", + "price": 15.0 + } ] } }, "gemini-2.5-pro-preview-06-05": { - "friendly_name": "Gemini 2.5 Pro Preview (Jun)", + "friendly_name": "Gemini 2.5 Pro", "input": [ - {"up_to": 200000, "price": 1.25}, - {"up_to": "inf", "price": 2.50} + { + "up_to": 200000, + "price": 1.25 + }, + { + "up_to": "inf", + "price": 2.5 + } ], "output": { "default": [ - {"up_to": 200000, "price": 10.00}, - {"up_to": "inf", "price": 15.00} + { + "up_to": 200000, + "price": 10.0 + }, + { + "up_to": "inf", + "price": 15.0 + } ] } }, "gemini-2.5-pro-preview": { "friendly_name": "Gemini 2.5 Pro Preview", "input": [ - {"up_to": 200000, "price": 1.25}, - {"up_to": "inf", "price": 2.50} + { + "up_to": 200000, + "price": 1.25 + }, + { + "up_to": "inf", + "price": 2.5 + } ], "output": { "default": [ - {"up_to": 200000, "price": 10.00}, - {"up_to": "inf", "price": 15.00} + { + "up_to": 200000, + "price": 10.0 + }, + { + "up_to": "inf", + "price": 15.0 + } ] } }, "gemini-1.5-pro": { "friendly_name": "Gemini 1.5 Pro", "input": [ - {"up_to": 128000, "price": 1.25}, - {"up_to": "inf", "price": 2.50} + { + "up_to": 128000, + "price": 1.25 + }, + { + "up_to": "inf", + "price": 2.5 + } ], "output": { "default": [ - {"up_to": 128000, "price": 5.00}, - {"up_to": "inf", "price": 10.00} + { + "up_to": 128000, + "price": 5.0 + }, + { + "up_to": "inf", + "price": 10.0 + } ] } }, "gemini-1.5-flash": { "friendly_name": "Gemini 1.5 Flash", "input": [ - {"up_to": 128000, "price": 0.075}, - {"up_to": "inf", "price": 0.15} + { + "up_to": 128000, + "price": 0.075 + }, + { + "up_to": "inf", + "price": 0.15 + } ], "output": { "default": [ - {"up_to": 128000, "price": 0.30}, - {"up_to": "inf", "price": 0.60} + { + "up_to": 128000, + "price": 0.3 + }, + { + "up_to": "inf", + "price": 0.6 + } ] } }, "gemini-2.0-flash": { "friendly_name": "Gemini 2.0 Flash", - "input": [{"up_to": "inf", "price": 0.10}], - "output": {"default": [{"up_to": "inf", "price": 0.40}]} + "input": [ + { + "up_to": "inf", + "price": 0.1 + } + ], + "output": { + "default": [ + { + "up_to": "inf", + "price": 0.4 + } + ] + } }, "gemini-2.5-flash-preview-04-17": { "friendly_name": "Gemini 2.5 Flash Preview (Apr)", - "input": [{"up_to": "inf", "price": 0.15}], + "input": [ + { + "up_to": "inf", + "price": 0.15 + } + ], "output": { - "non_thinking": [{"up_to": "inf", "price": 0.60}], - "thinking": [{"up_to": "inf", "price": 3.50}] + "non_thinking": [ + { + "up_to": "inf", + "price": 0.6 + } + ], + "thinking": [ + { + "up_to": "inf", + "price": 3.5 + } + ] } }, "gemini-2.5-flash-preview": { "friendly_name": "Gemini 2.5 Flash Preview", - "input": [{"up_to": "inf", "price": 0.15}], + "input": [ + { + "up_to": "inf", + "price": 0.15 + } + ], "output": { - "non_thinking": [{"up_to": "inf", "price": 0.60}], - "thinking": [{"up_to": "inf", "price": 3.50}] + "non_thinking": [ + { + "up_to": "inf", + "price": 0.6 + } + ], + "thinking": [ + { + "up_to": "inf", + "price": 3.5 + } + ] + } + }, + "gemini-2.5-flash": { + "friendly_name": "Gemini 2.5 Flash", + "input": [ + { + "up_to": 200000, + "price": 0.15 + }, + { + "up_to": "inf", + "price": 0.3 + } + ], + "output": { + "default": [ + { + "up_to": 200000, + "price": 1.25 + }, + { + "up_to": "inf", + "price": 2.5 + } + ] + } + }, + "gemini-2.5-flash-lite-preview-06-17": { + "friendly_name": "Gemini 2.5 Flash Lite Preview (Jun)", + "input": [ + { + "up_to": "inf", + "price": 0.1 + } + ], + "output": { + "default": [ + { + "up_to": "inf", + "price": 0.4 + } + ] } }, "openai:o4-mini": { "friendly_name": "OpenAI o4-mini", - "input": [{"up_to": "inf", "price": 1.10}], - "output": {"default": [{"up_to": "inf", "price": 4.40}]} + "input": [ + { + "up_to": "inf", + "price": 1.1 + } + ], + "output": { + "default": [ + { + "up_to": "inf", + "price": 4.4 + } + ] + } }, "openai:o3": { "friendly_name": "OpenAI o3", - "input": [{"up_to": "inf", "price": 10.00}], - "output": {"default": [{"up_to": "inf", "price": 40.00}]} + "input": [ + { + "up_to": "inf", + "price": 10.0 + } + ], + "output": { + "default": [ + { + "up_to": "inf", + "price": 40.0 + } + ] + } }, "openai:gpt-4.1": { "friendly_name": "GPT-4.1", - "input": [{"up_to": "inf", "price": 2.00}], - "output": {"default": [{"up_to": "inf", "price": 8.00}]} + "input": [ + { + "up_to": "inf", + "price": 2.0 + } + ], + "output": { + "default": [ + { + "up_to": "inf", + "price": 8.0 + } + ] + } }, "openai:gpt-4.1-mini": { "friendly_name": "GPT-4.1 Mini", - "input": [{"up_to": "inf", "price": 0.40}], - "output": {"default": [{"up_to": "inf", "price": 1.60}]} + "input": [ + { + "up_to": "inf", + "price": 0.4 + } + ], + "output": { + "default": [ + { + "up_to": "inf", + "price": 1.6 + } + ] + } }, "openai:gpt-4.1-nano": { "friendly_name": "GPT-4.1 Nano", - "input": [{"up_to": "inf", "price": 0.10}], - "output": {"default": [{"up_to": "inf", "price": 0.40}]} + "input": [ + { + "up_to": "inf", + "price": 0.1 + } + ], + "output": { + "default": [ + { + "up_to": "inf", + "price": 0.4 + } + ] + } }, "bedrock:us.anthropic.claude-sonnet-4-20250514-v1:0": { "friendly_name": "Claude 4 Sonnet", - "input": [{"up_to": "inf", "price": 3.00}], - "output": {"default": [{"up_to": "inf", "price": 15.00}]} + "input": [ + { + "up_to": "inf", + "price": 3.0 + } + ], + "output": { + "default": [ + { + "up_to": "inf", + "price": 15.0 + } + ] + } }, "bedrock:us.anthropic.claude-opus-4-20250514-v1:0": { "friendly_name": "Claude 4 Opus", - "input": [{"up_to": "inf", "price": 15.00}], - "output": {"default": [{"up_to": "inf", "price": 75.00}]} + "input": [ + { + "up_to": "inf", + "price": 15.0 + } + ], + "output": { + "default": [ + { + "up_to": "inf", + "price": 75.0 + } + ] + } }, "bedrock:us.anthropic.claude-3-7-sonnet-20250219-v1:0": { "friendly_name": "Claude 3.7 Sonnet", - "input": [{"up_to": "inf", "price": 3.00}], - "output": {"default": [{"up_to": "inf", "price": 15.00}]} + "input": [ + { + "up_to": "inf", + "price": 3.0 + } + ], + "output": { + "default": [ + { + "up_to": "inf", + "price": 15.0 + } + ] + } }, "bedrock:us.anthropic.claude-3-5-sonnet-20240620-v1:0": { "friendly_name": "Claude 3.5 Sonnet", - "input": [{"up_to": "inf", "price": 3.00}], - "output": {"default": [{"up_to": "inf", "price": 15.00}]} + "input": [ + { + "up_to": "inf", + "price": 3.0 + } + ], + "output": { + "default": [ + { + "up_to": "inf", + "price": 15.0 + } + ] + } }, "bedrock:us.anthropic.claude-3-5-haiku-20241022-v1:0": { "friendly_name": "Claude 3.5 Haiku", - "input": [{"up_to": "inf", "price": 1.00}], - "output": {"default": [{"up_to": "inf", "price": 4.00}]} + "input": [ + { + "up_to": "inf", + "price": 1.0 + } + ], + "output": { + "default": [ + { + "up_to": "inf", + "price": 4.0 + } + ] + } } } } \ No newline at end of file diff --git a/agents_mcp_usage/evaluations/mermaid_evals/evals_pydantic_mcp.py b/agents_mcp_usage/evaluations/mermaid_evals/evals_pydantic_mcp.py index f596298..52f6baf 100644 --- a/agents_mcp_usage/evaluations/mermaid_evals/evals_pydantic_mcp.py +++ b/agents_mcp_usage/evaluations/mermaid_evals/evals_pydantic_mcp.py @@ -796,7 +796,7 @@ async def fix_with_model(inputs: MermaidInput) -> MermaidOutput: # agent_model = os.getenv("AGENT_MODEL", DEFAULT_MODEL) # agent_model = "gemini-2.5-pro-preview-06-05" # agent_model = "openai:o4-mini" - agent_model = "gemini-2.5-flash-preview-04-17" + agent_model = "gemini-2.5-flash" judge_model = os.getenv("JUDGE_MODEL", DEFAULT_MODEL) async def run_all(): diff --git a/agents_mcp_usage/evaluations/mermaid_evals/merbench_ui.py b/agents_mcp_usage/evaluations/mermaid_evals/merbench_ui.py index bfbf855..bb5a1c8 100644 --- a/agents_mcp_usage/evaluations/mermaid_evals/merbench_ui.py +++ b/agents_mcp_usage/evaluations/mermaid_evals/merbench_ui.py @@ -94,7 +94,7 @@ def _convert_inf_strings(data): def find_all_combined_results_csvs(directory_path: str) -> list[str]: - """Finds all '*_combined_results.csv' files, sorted by modification time. + """Finds all '*_results.csv' files, sorted by modification time. Args: directory_path: The path to the directory to search. @@ -105,7 +105,7 @@ def find_all_combined_results_csvs(directory_path: str) -> list[str]: if not os.path.isdir(directory_path): return [] try: - search_pattern = os.path.join(directory_path, "*_combined_results.csv") + search_pattern = os.path.join(directory_path, "*_results.csv") files = glob.glob(search_pattern) return sorted(files, key=os.path.getmtime, reverse=True) except Exception as e: @@ -782,6 +782,33 @@ def create_cost_breakdown_plot( return fig +def extract_provider_from_model_name(model_name: str) -> str: + """Extract provider from model name based on common patterns. + + Args: + model_name: The model name string + + Returns: + The provider name + """ + if model_name.startswith("gemini-"): + return "Google" + elif model_name.startswith("openai:"): + return "OpenAI" + elif model_name.startswith("bedrock:"): + if "claude" in model_name: + return "Anthropic (Bedrock)" + return "Amazon Bedrock" + elif model_name.startswith("claude-"): + return "Anthropic" + elif "claude" in model_name.lower(): + return "Anthropic" + elif "gpt" in model_name.lower(): + return "OpenAI" + else: + return "Other" + + def main() -> None: """The main Streamlit application entrypoint.""" eval_config = EVAL_CONFIG # Use the validated config @@ -824,6 +851,9 @@ def main() -> None: st.error("No data loaded. Please check the selected files.") return + # Add provider column to the dataframe + df_initial["provider"] = df_initial["Model"].apply(extract_provider_from_model_name) + # Grouping filter grouping_config = eval_config.grouping st.sidebar.subheader(f"🎯 {grouping_config.label} Filter") @@ -839,6 +869,39 @@ def main() -> None: default=available_groups, ) + # Model filtering section + st.sidebar.subheader("🤖 Model Filters") + + # Provider filter + available_providers = sorted(df_initial["provider"].unique()) + selected_providers = st.sidebar.multiselect( + "Filter by provider:", + options=available_providers, + default=available_providers, + help="Select one or more providers to filter models" + ) + + # Filter models based on selected providers first + df_provider_filtered = df_initial[df_initial["provider"].isin(selected_providers)] + available_models = sorted(df_provider_filtered["Model"].unique()) + + # Advanced filters in expander + with st.sidebar.expander("⚙️ Advanced Filters", expanded=False): + # Individual model selection + selected_models = st.multiselect( + "Select specific models:", + options=available_models, + default=available_models, + help="Select individual models to include in the analysis", + key="model_selection" + ) + + # Get selected models from session state or use all available models + if "model_selection" in st.session_state: + selected_models = st.session_state.model_selection + else: + selected_models = available_models + # Cost configuration in sidebar st.sidebar.subheader("💰 Cost Configuration") cost_file_path = os.path.join(os.path.dirname(__file__), "costs.json") @@ -902,7 +965,14 @@ def main() -> None: final_cost_config = cost_config.copy() final_cost_config.update(user_cost_override) - df = process_data(df_initial, final_cost_config, eval_config) + # Apply model filter before processing data + df_model_filtered = df_initial[df_initial["Model"].isin(selected_models)] + + if df_model_filtered.empty: + st.warning("No data available for the selected models. Please adjust your filters.") + return + + df = process_data(df_model_filtered, final_cost_config, eval_config) # --- Main Panel --- st.header("📊 Overview") @@ -914,9 +984,19 @@ def main() -> None: cols[2].metric("Test Cases", df[grouping_config.column].nunique()) cols[3].metric("Files Loaded", len(selected_files)) - st.info( - f"**Showing averaged results for {grouping_config.label.lower()}:** {', '.join(selected_groups) if selected_groups else 'None'}" - ) + # Filter status information + filter_info = [] + if selected_groups: + filter_info.append(f"**{grouping_config.label}:** {', '.join(selected_groups)}") + if len(selected_providers) < len(available_providers): + filter_info.append(f"**Providers:** {', '.join(selected_providers)}") + if len(selected_models) < len(available_models): + filter_info.append(f"**Models:** {len(selected_models)} of {len(available_models)} selected") + + if filter_info: + st.info("🔍 Active filters: " + " | ".join(filter_info)) + else: + st.info("📊 Showing all available data") # --- Leaderboard & Pareto --- st.header("🏅 Leaderboard") @@ -964,6 +1044,7 @@ def main() -> None: df, selected_groups, x_axis_mode, eval_config.model_dump(), friendly_names ), use_container_width=True, + key="pareto_frontier_plot" ) # --- Deep Dive Analysis --- @@ -997,6 +1078,7 @@ def main() -> None: df, selected_groups, eval_config.model_dump() ), use_container_width=True, + key="success_rates_plot" ) if "Failure Analysis" in tab_map: with tab_map["Failure Analysis"]: @@ -1005,6 +1087,7 @@ def main() -> None: df, selected_groups, eval_config.model_dump() ), use_container_width=True, + key="failure_analysis_plot" ) if "Resource Usage" in tab_map: with tab_map["Resource Usage"]: @@ -1014,6 +1097,7 @@ def main() -> None: df, selected_groups, eval_config.model_dump() ), use_container_width=True, + key="token_breakdown_plot" ) if "cost_breakdown" in active_plots: st.plotly_chart( @@ -1021,6 +1105,7 @@ def main() -> None: df, selected_groups, eval_config.model_dump() ), use_container_width=True, + key="cost_breakdown_plot" ) if "Raw Data" in tab_map: with tab_map["Raw Data"]: diff --git a/agents_mcp_usage/evaluations/mermaid_evals/run_multi_evals.py b/agents_mcp_usage/evaluations/mermaid_evals/run_multi_evals.py index cdb8b01..42b0cbb 100644 --- a/agents_mcp_usage/evaluations/mermaid_evals/run_multi_evals.py +++ b/agents_mcp_usage/evaluations/mermaid_evals/run_multi_evals.py @@ -44,12 +44,11 @@ load_dotenv() DEFAULT_MODELS = [ - "gemini-2.5-pro-preview-06-05", - "gemini-2.5-pro-preview-05-06", - "gemini-2.5-pro-preview-03-25", - "gemini-2.5-pro", - # "gemini-2.5-flash", - # "gemini-2.5-flash-preview-04-17", + # "gemini-2.5-pro-preview-06-05", + # "gemini-2.5-pro-preview-05-06", + # "gemini-2.5-pro-preview-03-25", + # "gemini-2.0-flash", + "gemini-2.5-flash", # "openai:o4-mini", # "openai:gpt-4.1", # "openai:gpt-4.1-mini", diff --git a/agents_mcp_usage/evaluations/mermaid_evals/scripts/preprocess_merbench_data.py b/agents_mcp_usage/evaluations/mermaid_evals/scripts/preprocess_merbench_data.py new file mode 100644 index 0000000..844f98d --- /dev/null +++ b/agents_mcp_usage/evaluations/mermaid_evals/scripts/preprocess_merbench_data.py @@ -0,0 +1,182 @@ +#!/usr/bin/env python3 +import pandas as pd +import json +import sys +from pathlib import Path + +# Add parent directory to path to import modules +sys.path.append(str(Path(__file__).parent.parent)) + +from agents_mcp_usage.evaluations.mermaid_evals.dashboard_config import DEFAULT_CONFIG +from agents_mcp_usage.evaluations.mermaid_evals.schemas import DashboardConfig + +def parse_metric_details(metric_details_str): + """Safely parse JSON string from Metric_details column.""" + if pd.isna(metric_details_str) or not metric_details_str: + return {} + try: + return json.loads(metric_details_str.replace("'", '"')) + except (json.JSONDecodeError, TypeError): + return {} + +def calculate_failure_analysis_data(df): + """Calculate failure counts by model and failure type.""" + failure_series = [ + {"name": "Invalid Diagram", "column": "Score_MermaidDiagramValid", "condition": "== 0"}, + {"name": "MCP Tool Failure", "column": "Score_UsedBothMCPTools", "condition": "< 1"}, + {"name": "Usage Limit Exceeded", "column": "Score_UsageLimitNotExceeded", "condition": "== 0"}, + ] + + models = sorted(df["Model"].unique()) + failure_data = [] + + for model in models: + model_data = df[df["Model"] == model] + failure_counts = {"Model": model} + + for series in failure_series: + condition_str = f"`{series['column']}` {series['condition']}" + count = model_data.eval(condition_str).sum() + failure_counts[series["name"]] = int(count) + + failure_data.append(failure_counts) + + return failure_data + +def process_csv_for_static_site(csv_path): + """Process CSV file and return data structure for static site.""" + # Load configuration + config = DashboardConfig(**DEFAULT_CONFIG) + + # Read CSV + df = pd.read_csv(csv_path) + + # Replace NaN values with 0 for numeric columns + numeric_columns = ['Metric_request_tokens', 'Metric_response_tokens', 'Metric_total_tokens'] + for col in numeric_columns: + if col in df.columns: + df[col] = df[col].fillna(0) + + # Extract grouping column (test case types) + df['test_group'] = df['Case'].apply(lambda x: x.split('_')[-1] if '_' in x else 'other') + + # Parse metric details to extract token information + if "Metric_details" in df.columns: + metric_details = df["Metric_details"].apply(parse_metric_details) + df["thinking_tokens"] = metric_details.apply(lambda x: x.get("thoughts_tokens", 0)) + df["text_tokens"] = metric_details.apply(lambda x: x.get("text_prompt_tokens", 0)) + else: + df["thinking_tokens"] = 0 + df["text_tokens"] = 0 + + # Calculate total tokens + df["total_tokens"] = df["Metric_total_tokens"].fillna(0) + + # Calculate success rate (primary metric) + df["Success_Rate"] = df["Score_MermaidDiagramValid"] * 100 + + # Extract provider from model name + def extract_provider(model_name): + if model_name.startswith("gemini-"): + return "Google" + elif "claude" in model_name.lower(): + return "Anthropic" + elif "gpt" in model_name.lower(): + return "OpenAI" + else: + return "Other" + + df["provider"] = df["Model"].apply(extract_provider) + + # Create leaderboard data + leaderboard = df.groupby("Model").agg({ + "Success_Rate": "mean", + "Duration": "mean", + "total_tokens": "mean", + "Case": "count", # Number of runs + "provider": "first" + }).reset_index() + + leaderboard.columns = ["Model", "Success_Rate", "Avg_Duration", "Avg_Tokens", "Runs", "Provider"] + leaderboard = leaderboard.sort_values("Success_Rate", ascending=False) + + # Create data for Pareto frontier plot + pareto_data = df.groupby("Model").agg({ + "Success_Rate": "mean", + "Duration": "mean", + "total_tokens": "mean", + "Metric_request_tokens": lambda x: x[x > 0].mean() if any(x > 0) else 0, + "Metric_response_tokens": lambda x: x[x > 0].mean() if any(x > 0) else 0 + }).reset_index() + + # Fill any remaining NaN values with 0 + pareto_data = pareto_data.fillna(0) + + # Create test group performance data + test_groups_data = df.groupby(["Model", "test_group"]).agg({ + "Score_MermaidDiagramValid": "mean", + "Score_UsageLimitNotExceeded": "mean", + "Score_UsedBothMCPTools": "mean" + }).reset_index() + + # Calculate failure analysis data + failure_analysis_data = calculate_failure_analysis_data(df) + + # Calculate aggregate statistics + stats = { + "total_runs": len(df), + "models_evaluated": df["Model"].nunique(), + "test_cases": df["Case"].nunique(), + "test_groups": sorted(df["test_group"].unique().tolist()), + "providers": sorted(df["provider"].unique().tolist()), + "models": sorted(df["Model"].unique().tolist()) + } + + # Create final data structure + output_data = { + "stats": stats, + "leaderboard": leaderboard.to_dict(orient="records"), + "pareto_data": pareto_data.to_dict(orient="records"), + "test_groups_data": test_groups_data.to_dict(orient="records"), + "failure_analysis_data": failure_analysis_data, + "raw_data": df[[ + "Model", "Case", "test_group", "Duration", + "Score_MermaidDiagramValid", "Score_UsageLimitNotExceeded", + "Score_UsedBothMCPTools", "total_tokens", "provider", + "Metric_request_tokens", "Metric_response_tokens" + ]].to_dict(orient="records"), + "config": { + "title": config.title, + "description": config.description, + "primary_metric": { + "name": "Success_Rate", + "label": "Success Rate (%)" + } + } + } + + return output_data + +def main(): + csv_path = "/home/ubuntu/projects/agents-mcp-usage/mermaid_eval_results/Jun_gemini_results.csv" + output_path = "/home/ubuntu/projects/agents-mcp-usage/agents_mcp_usage/evaluations/mermaid_evals/results/Jun_gemini_results_processed.json" + + print(f"Processing {csv_path}...") + data = process_csv_for_static_site(csv_path) + + # Convert the data to JSON string, replacing NaN with null + json_str = json.dumps(data, indent=2) + # Replace NaN values with null for valid JSON + json_str = json_str.replace(": NaN", ": null") + + # Write output + with open(output_path, 'w') as f: + f.write(json_str) + + print(f"Data processed and saved to {output_path}") + print(f"- Total runs: {data['stats']['total_runs']}") + print(f"- Models evaluated: {data['stats']['models_evaluated']}") + print(f"- Test cases: {data['stats']['test_cases']}") + +if __name__ == "__main__": + main() diff --git a/mermaid_eval_results/Jun_gemini_results.csv b/mermaid_eval_results/Jun_gemini_results.csv new file mode 100644 index 0000000..7238878 --- /dev/null +++ b/mermaid_eval_results/Jun_gemini_results.csv @@ -0,0 +1,106 @@ +Model,Run,Case,Duration,Fixed_Diagram_Length,Failure_Reason,Tools_Used,Score_MermaidDiagramValid,Score_UsageLimitNotExceeded,Score_UsedBothMCPTools,Metric_details,Metric_request_tokens,Metric_requests,Metric_response_tokens,Metric_total_tokens +gemini-2.5-pro-preview-06-05,1,fix_invalid_diagram_easy,25.314810254,1602,,get_current_time|validate_mermaid_diagram,1.0,1.0,1.0,"{'thoughts_tokens': 1109, 'text_prompt_tokens': 2999}",2999,2,1180,5288 +gemini-2.5-pro-preview-06-05,1,fix_invalid_diagram_medium,29.164007858,1508,,get_current_time|validate_mermaid_diagram,0.0,1.0,1.0,"{'thoughts_tokens': 1082, 'text_prompt_tokens': 3007}",3007,2,1155,5244 +gemini-2.5-pro-preview-06-05,1,fix_invalid_diagram_hard,88.240292399,1573,,get_current_time|validate_mermaid_diagram,0.0,1.0,1.0,"{'thoughts_tokens': 3882, 'text_prompt_tokens': 11479}",11479,4,2447,17808 +gemini-2.5-pro-preview-06-05,2,fix_invalid_diagram_easy,46.903850207,1658,,get_current_time|validate_mermaid_diagram,1.0,1.0,1.0,"{'thoughts_tokens': 2033, 'text_prompt_tokens': 3861}",3861,3,1196,7090 +gemini-2.5-pro-preview-06-05,2,fix_invalid_diagram_medium,60.831734772,1570,,get_current_time|validate_mermaid_diagram,0.0,1.0,1.0,"{'thoughts_tokens': 5420, 'text_prompt_tokens': 3007}",3007,2,1171,9598 +gemini-2.5-pro-preview-06-05,2,fix_invalid_diagram_hard,32.209129106,1572,,get_current_time|validate_mermaid_diagram,0.0,1.0,1.0,"{'thoughts_tokens': 1867, 'text_prompt_tokens': 3007}",3007,2,1173,6047 +gemini-2.5-pro-preview-06-05,3,fix_invalid_diagram_easy,31.297773585,1649,,get_current_time|validate_mermaid_diagram,1.0,1.0,1.0,"{'thoughts_tokens': 1671, 'text_prompt_tokens': 2999}",2999,2,1196,5866 +gemini-2.5-pro-preview-06-05,3,fix_invalid_diagram_medium,83.609651342,1569,,get_current_time|validate_mermaid_diagram,0.0,1.0,1.0,"{'thoughts_tokens': 5399, 'text_prompt_tokens': 7453}",7453,4,1820,14672 +gemini-2.5-pro-preview-06-05,3,fix_invalid_diagram_hard,53.065635183,1445,,get_current_time|validate_mermaid_diagram,0.0,1.0,1.0,"{'thoughts_tokens': 3100, 'text_prompt_tokens': 6507}",6507,3,1744,11351 +gemini-2.5-pro-preview-06-05,4,fix_invalid_diagram_easy,32.600817879,1573,,validate_mermaid_diagram|get_current_time,1.0,1.0,1.0,"{'thoughts_tokens': 1394, 'text_prompt_tokens': 3107}",3107,2,1192,5693 +gemini-2.5-pro-preview-06-05,4,fix_invalid_diagram_medium,33.96250988,1627,,validate_mermaid_diagram|get_current_time,1.0,1.0,1.0,"{'thoughts_tokens': 1552, 'text_prompt_tokens': 3107}",3107,2,1190,5849 +gemini-2.5-pro-preview-06-05,4,fix_invalid_diagram_hard,40.243743396,1509,,get_current_time|validate_mermaid_diagram,0.0,1.0,1.0,"{'thoughts_tokens': 2127, 'text_prompt_tokens': 3107}",3107,2,1154,6388 +gemini-2.5-pro-preview-06-05,5,fix_invalid_diagram_easy,27.064979554,1595,,get_current_time|validate_mermaid_diagram,1.0,1.0,1.0,"{'thoughts_tokens': 1159, 'text_prompt_tokens': 2999}",2999,2,1179,5337 +gemini-2.5-pro-preview-06-05,5,fix_invalid_diagram_medium,86.028303837,1562,,get_current_time|validate_mermaid_diagram,0.0,1.0,1.0,"{'thoughts_tokens': 4707, 'text_prompt_tokens': 11366}",11366,4,2399,18472 +gemini-2.5-pro-preview-06-05,5,fix_invalid_diagram_hard,32.780646607,1440,,get_current_time|validate_mermaid_diagram,0.0,1.0,1.0,"{'thoughts_tokens': 1563, 'text_prompt_tokens': 3007}",3007,2,1133,5703 +gemini-2.5-pro-preview-05-06,1,fix_invalid_diagram_easy,90.311101073,1636,,get_current_time|validate_mermaid_diagram,1.0,1.0,1.0,"{'thoughts_tokens': 190003, 'text_prompt_tokens': 6385}",6385,4,1480,197868 +gemini-2.5-pro-preview-05-06,1,fix_invalid_diagram_medium,142.709473835,0,usage_limit_exceeded,,0.0,0.0,0.0,,,,, +gemini-2.5-pro-preview-05-06,1,fix_invalid_diagram_hard,59.031657247,0,response_validation_failed,,0.0,1.0,0.0,,,,, +gemini-2.5-pro-preview-05-06,2,fix_invalid_diagram_easy,19.541418498,0,response_validation_failed,,0.0,1.0,0.0,,,,, +gemini-2.5-pro-preview-05-06,2,fix_invalid_diagram_medium,125.222876626,0,usage_limit_exceeded,,0.0,0.0,0.0,,,,, +gemini-2.5-pro-preview-05-06,2,fix_invalid_diagram_hard,36.623935689,0,response_validation_failed,,0.0,1.0,0.0,,,,, +gemini-2.5-pro-preview-05-06,3,fix_invalid_diagram_easy,60.148244504,1658,,get_current_time|validate_mermaid_diagram,1.0,1.0,1.0,"{'thoughts_tokens': 15739, 'text_prompt_tokens': 54413}",54413,4,2191,72343 +gemini-2.5-pro-preview-05-06,3,fix_invalid_diagram_medium,123.194014333,1690,,get_current_time|validate_mermaid_diagram,1.0,1.0,1.0,"{'thoughts_tokens': 119869, 'text_prompt_tokens': 60324}",60324,5,3883,184076 +gemini-2.5-pro-preview-05-06,3,fix_invalid_diagram_hard,16.74454915,0,response_validation_failed,,0.0,1.0,0.0,,,,, +gemini-2.5-pro-preview-05-06,4,fix_invalid_diagram_easy,64.164246072,1651,,get_current_time|validate_mermaid_diagram,1.0,1.0,1.0,"{'thoughts_tokens': 18452, 'text_prompt_tokens': 56413}",56413,4,3431,78296 +gemini-2.5-pro-preview-05-06,4,fix_invalid_diagram_medium,42.140703106,1509,,get_current_time|validate_mermaid_diagram,0.0,1.0,1.0,"{'thoughts_tokens': 85357, 'text_prompt_tokens': 4519}",4519,3,1342,91218 +gemini-2.5-pro-preview-05-06,4,fix_invalid_diagram_hard,95.642718199,0,response_validation_failed,,0.0,1.0,0.0,,,,, +gemini-2.5-pro-preview-05-06,5,fix_invalid_diagram_easy,62.127767944,1647,,get_current_time|validate_mermaid_diagram,1.0,1.0,1.0,"{'thoughts_tokens': 8942, 'text_prompt_tokens': 56065}",56065,4,3177,68184 +gemini-2.5-pro-preview-05-06,5,fix_invalid_diagram_medium,112.559107875,0,usage_limit_exceeded,,0.0,0.0,0.0,,,,, +gemini-2.5-pro-preview-05-06,5,fix_invalid_diagram_hard,112.154635461,0,usage_limit_exceeded,,0.0,0.0,0.0,,,,, +gemini-2.5-pro-preview-03-25,1,fix_invalid_diagram_easy,67.505358126,1638,,get_current_time|validate_mermaid_diagram,1.0,1.0,1.0,"{'thoughts_tokens': 39337, 'text_prompt_tokens': 54862}",54862,4,2603,96802 +gemini-2.5-pro-preview-03-25,1,fix_invalid_diagram_medium,22.877179664,0,response_validation_failed,,0.0,1.0,0.0,,,,, +gemini-2.5-pro-preview-03-25,1,fix_invalid_diagram_hard,46.778307508,1508,,get_current_time|validate_mermaid_diagram,0.0,1.0,1.0,"{'thoughts_tokens': 12915, 'text_prompt_tokens': 4004}",4004,3,1315,18234 +gemini-2.5-pro-preview-03-25,2,fix_invalid_diagram_easy,69.72560593,1601,,get_current_time|validate_mermaid_diagram,1.0,1.0,1.0,"{'thoughts_tokens': 68495, 'text_prompt_tokens': 39228}",39228,4,1996,109719 +gemini-2.5-pro-preview-03-25,2,fix_invalid_diagram_medium,107.98203716,0,usage_limit_exceeded,,0.0,0.0,0.0,,,,, +gemini-2.5-pro-preview-03-25,2,fix_invalid_diagram_hard,136.260362709,0,usage_limit_exceeded,,0.0,0.0,0.0,,,,, +gemini-2.5-pro-preview-03-25,3,fix_invalid_diagram_easy,80.179718025,1625,,get_current_time|validate_mermaid_diagram,1.0,1.0,1.0,"{'thoughts_tokens': 152940, 'text_prompt_tokens': 3874}",3874,3,1511,158325 +gemini-2.5-pro-preview-03-25,3,fix_invalid_diagram_medium,54.728993541,0,usage_limit_exceeded,,0.0,0.0,0.0,,,,, +gemini-2.5-pro-preview-03-25,3,fix_invalid_diagram_hard,99.128847196,0,usage_limit_exceeded,,0.0,0.0,0.0,,,,, +gemini-2.5-pro-preview-03-25,4,fix_invalid_diagram_easy,46.010087457,1581,,get_current_time|validate_mermaid_diagram,0.0,1.0,1.0,"{'thoughts_tokens': 25388, 'text_prompt_tokens': 3980}",3980,3,1295,30663 +gemini-2.5-pro-preview-03-25,4,fix_invalid_diagram_medium,90.896605166,0,usage_limit_exceeded,,0.0,0.0,0.0,,,,, +gemini-2.5-pro-preview-03-25,4,fix_invalid_diagram_hard,249.374101535,0,usage_limit_exceeded,,0.0,0.0,0.0,,,,, +gemini-2.5-pro-preview-03-25,5,fix_invalid_diagram_easy,115.493402964,1713,,get_current_time|validate_mermaid_diagram,1.0,1.0,1.0,"{'thoughts_tokens': 96056, 'text_prompt_tokens': 55991}",55991,4,3221,155268 +gemini-2.5-pro-preview-03-25,5,fix_invalid_diagram_medium,90.85619701,0,usage_limit_exceeded,,0.0,0.0,0.0,,,,, +gemini-2.5-pro-preview-03-25,5,fix_invalid_diagram_hard,233.217439735,0,usage_limit_exceeded,,0.0,0.0,0.0,,,,, +gemini-2.0-flash,1,fix_invalid_diagram_easy,8.909464048,1496,,,0.0,1.0,0.0,"{'text_prompt_tokens': 736, 'text_candidates_tokens': 498}",736,1,498,1234 +gemini-2.0-flash,1,fix_invalid_diagram_medium,5.837458282,1487,,,0.0,1.0,0.0,"{'text_prompt_tokens': 736, 'text_candidates_tokens': 503}",736,1,503,1239 +gemini-2.0-flash,1,fix_invalid_diagram_hard,6.413123275,1480,,,0.0,1.0,0.0,"{'text_prompt_tokens': 736, 'text_candidates_tokens': 502}",736,1,502,1238 +gemini-2.0-flash,2,fix_invalid_diagram_easy,5.292221706,1496,,,0.0,1.0,0.0,"{'text_prompt_tokens': 736, 'text_candidates_tokens': 498}",736,1,498,1234 +gemini-2.0-flash,2,fix_invalid_diagram_medium,12.366546526,1567,,get_current_time|validate_mermaid_diagram,0.0,1.0,1.0,"{'text_prompt_tokens': 2723, 'text_candidates_tokens': 1073}",2723,2,1073,3796 +gemini-2.0-flash,2,fix_invalid_diagram_hard,6.458367757,1491,,,0.0,1.0,0.0,"{'text_prompt_tokens': 736, 'text_candidates_tokens': 506}",736,1,506,1242 +gemini-2.0-flash,3,fix_invalid_diagram_easy,7.122025352,1576,,,0.0,1.0,0.0,"{'text_prompt_tokens': 736, 'text_candidates_tokens': 525}",736,1,525,1261 +gemini-2.0-flash,3,fix_invalid_diagram_medium,15.055406281,1567,,get_current_time|validate_mermaid_diagram,0.0,1.0,1.0,"{'text_prompt_tokens': 2723, 'text_candidates_tokens': 1073}",2723,2,1073,3796 +gemini-2.0-flash,3,fix_invalid_diagram_hard,6.581593788,1480,,,0.0,1.0,0.0,"{'text_prompt_tokens': 736, 'text_candidates_tokens': 502}",736,1,502,1238 +gemini-2.0-flash,4,fix_invalid_diagram_easy,8.904104594,1561,,,0.0,1.0,0.0,"{'text_prompt_tokens': 736, 'text_candidates_tokens': 520}",736,1,520,1256 +gemini-2.0-flash,4,fix_invalid_diagram_medium,3.709682467,1480,,,0.0,1.0,0.0,"{'text_prompt_tokens': 736, 'text_candidates_tokens': 502}",736,1,502,1238 +gemini-2.0-flash,4,fix_invalid_diagram_hard,7.48106499,1487,,,0.0,1.0,0.0,"{'text_prompt_tokens': 736, 'text_candidates_tokens': 503}",736,1,503,1239 +gemini-2.0-flash,5,fix_invalid_diagram_easy,4.313346779,1496,,,0.0,1.0,0.0,"{'text_prompt_tokens': 736, 'text_candidates_tokens': 498}",736,1,498,1234 +gemini-2.0-flash,5,fix_invalid_diagram_medium,4.285199703,1487,,,0.0,1.0,0.0,"{'text_prompt_tokens': 736, 'text_candidates_tokens': 503}",736,1,503,1239 +gemini-2.0-flash,5,fix_invalid_diagram_hard,3.759377617,1487,,,0.0,1.0,0.0,"{'text_prompt_tokens': 736, 'text_candidates_tokens': 503}",736,1,503,1239 +gemini-2.5-flash-preview-04-17,1,fix_invalid_diagram_easy,60.637950634,1594,,get_current_time|validate_mermaid_diagram,0.0,1.0,1.0,"{'thoughts_tokens': 58731, 'text_prompt_tokens': 15093, 'cached_content_tokens': 2900, 'text_cache_tokens': 2900}",15093,5,2470,76294 +gemini-2.5-flash-preview-04-17,1,fix_invalid_diagram_medium,31.479987348,1569,,get_current_time|validate_mermaid_diagram,0.0,1.0,1.0,"{'thoughts_tokens': 30019, 'text_prompt_tokens': 8696}",8696,4,1852,40567 +gemini-2.5-flash-preview-04-17,1,fix_invalid_diagram_hard,5.009089436,1562,,,0.0,1.0,0.0,"{'thoughts_tokens': 396, 'text_prompt_tokens': 810}",810,1,524,1730 +gemini-2.5-flash-preview-04-17,2,fix_invalid_diagram_easy,27.092670181,1631,,get_current_time|validate_mermaid_diagram,1.0,1.0,1.0,"{'thoughts_tokens': 63275, 'text_prompt_tokens': 7623}",7623,3,1833,72731 +gemini-2.5-flash-preview-04-17,2,fix_invalid_diagram_medium,15.495790935,1567,,get_current_time,0.0,1.0,0.5,"{'thoughts_tokens': 2538, 'text_prompt_tokens': 1669}",1669,2,552,4759 +gemini-2.5-flash-preview-04-17,2,fix_invalid_diagram_hard,15.96490713,1570,,get_current_time|validate_mermaid_diagram,0.0,1.0,1.0,"{'thoughts_tokens': 1481, 'text_prompt_tokens': 3007}",3007,2,1173,5661 +gemini-2.5-flash-preview-04-17,3,fix_invalid_diagram_easy,54.049785353,1581,,get_current_time|validate_mermaid_diagram,0.0,1.0,1.0,"{'thoughts_tokens': 5893, 'text_prompt_tokens': 4407, 'cached_content_tokens': 1795, 'text_cache_tokens': 1795}",4407,3,1179,11479 +gemini-2.5-flash-preview-04-17,3,fix_invalid_diagram_medium,21.902809368,1570,,get_current_time|validate_mermaid_diagram,0.0,1.0,1.0,"{'thoughts_tokens': 2130, 'text_prompt_tokens': 3868}",3868,3,1173,7171 +gemini-2.5-flash-preview-04-17,3,fix_invalid_diagram_hard,55.958218108,1563,,get_current_time|validate_mermaid_diagram,0.0,1.0,1.0,"{'thoughts_tokens': 18135, 'text_prompt_tokens': 3534}",3534,2,1685,23354 +gemini-2.5-flash-preview-04-17,4,fix_invalid_diagram_easy,11.505478102,1576,,,0.0,1.0,0.0,"{'thoughts_tokens': 9091, 'text_prompt_tokens': 811}",811,1,526,10428 +gemini-2.5-flash-preview-04-17,4,fix_invalid_diagram_medium,28.035491061,1568,,get_current_time|validate_mermaid_diagram,0.0,1.0,1.0,"{'thoughts_tokens': 12355, 'text_prompt_tokens': 6558}",6558,3,1804,20717 +gemini-2.5-flash-preview-04-17,4,fix_invalid_diagram_hard,21.009636819,1561,,get_current_time|validate_mermaid_diagram,0.0,1.0,1.0,"{'thoughts_tokens': 17356, 'text_prompt_tokens': 3902}",3902,3,1189,22447 +gemini-2.5-flash-preview-04-17,5,fix_invalid_diagram_easy,29.185720392,0,response_validation_failed,,0.0,1.0,0.0,,,,, +gemini-2.5-flash-preview-04-17,5,fix_invalid_diagram_medium,7.030499633,51,,,0.0,1.0,0.0,"{'thoughts_tokens': 1282, 'text_prompt_tokens': 810}",810,1,17,2109 +gemini-2.5-flash-preview-04-17,5,fix_invalid_diagram_hard,28.157089277,1568,,get_current_time|validate_mermaid_diagram,0.0,1.0,1.0,"{'thoughts_tokens': 2803, 'text_prompt_tokens': 3868}",3868,3,1173,7844 +gemini-2.5-flash,1,fix_invalid_diagram_easy,16.271507577,1616,,validate_mermaid_diagram,1.0,1.0,0.5,"{'thoughts_tokens': 19715, 'text_prompt_tokens': 3060}",3060,2,1164,23939 +gemini-2.5-flash,1,fix_invalid_diagram_medium,15.337420676,1568,,validate_mermaid_diagram|get_current_time,0.0,1.0,1.0,"{'thoughts_tokens': 1114, 'text_prompt_tokens': 5361}",5361,3,1174,7649 +gemini-2.5-flash,1,fix_invalid_diagram_hard,6.332825299,1569,,get_current_time,0.0,1.0,0.5,"{'thoughts_tokens': 417, 'text_prompt_tokens': 1669}",1669,2,552,2638 +gemini-2.5-flash,2,fix_invalid_diagram_easy,5.406013529,1581,,get_current_time,0.0,1.0,0.5,"{'thoughts_tokens': 392, 'text_prompt_tokens': 1671}",1671,2,554,2617 +gemini-2.5-flash,2,fix_invalid_diagram_medium,4.596515261,1568,,get_current_time,0.0,1.0,0.5,"{'thoughts_tokens': 335, 'text_prompt_tokens': 1669}",1669,2,552,2556 +gemini-2.5-flash,2,fix_invalid_diagram_hard,10.369176736,1567,,get_current_time|validate_mermaid_diagram,0.0,1.0,1.0,"{'thoughts_tokens': 1767, 'text_prompt_tokens': 3981}",3981,3,1184,6932 +gemini-2.5-flash,3,fix_invalid_diagram_easy,6.68512552,1583,,get_current_time,0.0,1.0,0.5,"{'thoughts_tokens': 3177, 'text_prompt_tokens': 1671}",1671,2,554,5402 +gemini-2.5-flash,3,fix_invalid_diagram_medium,8.790923666,1569,,get_current_time,0.0,1.0,0.5,"{'thoughts_tokens': 652, 'text_prompt_tokens': 1669}",1669,2,554,2875 +gemini-2.5-flash,3,fix_invalid_diagram_hard,12.593662815,1569,,get_current_time,0.0,1.0,0.5,"{'thoughts_tokens': 1515, 'text_prompt_tokens': 1669}",1669,2,554,3738 +gemini-2.5-flash,4,fix_invalid_diagram_easy,23.406116351,1785,,get_current_time|validate_mermaid_diagram,1.0,1.0,1.0,"{'thoughts_tokens': 41536, 'text_prompt_tokens': 4005}",4005,3,1266,46807 +gemini-2.5-flash,4,fix_invalid_diagram_medium,5.594700624,1567,,get_current_time,0.0,1.0,0.5,"{'thoughts_tokens': 228, 'text_prompt_tokens': 1669}",1669,2,552,2449 +gemini-2.5-flash,4,fix_invalid_diagram_hard,21.817208931,1568,,get_current_time|validate_mermaid_diagram,0.0,1.0,1.0,"{'thoughts_tokens': 36903, 'text_prompt_tokens': 3980}",3980,3,1185,42068 +gemini-2.5-flash,5,fix_invalid_diagram_easy,14.751506297,1713,,get_current_time|validate_mermaid_diagram,1.0,1.0,1.0,"{'thoughts_tokens': 4346, 'text_prompt_tokens': 3970}",3970,3,1219,9535 +gemini-2.5-flash,5,fix_invalid_diagram_medium,26.279386281,1568,,validate_mermaid_diagram|get_current_time,0.0,1.0,1.0,"{'thoughts_tokens': 15068, 'text_prompt_tokens': 9113}",9113,4,1804,25985 +gemini-2.5-flash,5,fix_invalid_diagram_hard,14.502334274,1570,,validate_mermaid_diagram|get_current_time,0.0,1.0,1.0,"{'thoughts_tokens': 852, 'text_prompt_tokens': 5361}",5361,3,1174,7387 +gemini-2.5-flash-lite-preview-06-17,1,fix_invalid_diagram_easy,4.444876941,1576,,add|validate_mermaid_diagram,0.0,1.0,0.5,{'text_prompt_tokens': 3088},3088,2,1196,4284 +gemini-2.5-flash-lite-preview-06-17,1,fix_invalid_diagram_medium,4.627413202,1564,,add|validate_mermaid_diagram,0.0,1.0,0.5,{'text_prompt_tokens': 3089},3089,2,1202,4291 +gemini-2.5-flash-lite-preview-06-17,1,fix_invalid_diagram_hard,3.704063431,181,,add|validate_mermaid_diagram,0.0,1.0,0.5,{'text_prompt_tokens': 3091},3091,2,683,3774 +gemini-2.5-flash-lite-preview-06-17,2,fix_invalid_diagram_easy,4.447734786,1576,,add|validate_mermaid_diagram,0.0,1.0,0.5,{'text_prompt_tokens': 3091},3091,2,1191,4282 +gemini-2.5-flash-lite-preview-06-17,2,fix_invalid_diagram_medium,4.576908765,1558,,add|validate_mermaid_diagram,0.0,1.0,0.5,{'text_prompt_tokens': 3091},3091,2,1180,4271 +gemini-2.5-flash-lite-preview-06-17,2,fix_invalid_diagram_hard,5.032189281,1943,,add|get_current_time|validate_mermaid_diagram,0.0,1.0,1.0,{'text_prompt_tokens': 3136},3136,2,1476,4612 +gemini-2.5-flash-lite-preview-06-17,3,fix_invalid_diagram_easy,4.472401128,1576,,add|validate_mermaid_diagram,0.0,1.0,0.5,{'text_prompt_tokens': 3088},3088,2,1186,4274 +gemini-2.5-flash-lite-preview-06-17,3,fix_invalid_diagram_medium,3.732016304,109,,get_current_time|validate_mermaid_diagram,0.0,1.0,1.0,{'text_prompt_tokens': 3107},3107,2,671,3778 +gemini-2.5-flash-lite-preview-06-17,3,fix_invalid_diagram_hard,4.756348604,1837,,add|get_current_time|validate_mermaid_diagram,0.0,1.0,1.0,{'text_prompt_tokens': 3136},3136,2,1435,4571 +gemini-2.5-flash-lite-preview-06-17,4,fix_invalid_diagram_easy,4.71480991,1583,,get_current_time|validate_mermaid_diagram,1.0,1.0,1.0,{'text_prompt_tokens': 3104},3104,2,1214,4318 +gemini-2.5-flash-lite-preview-06-17,4,fix_invalid_diagram_medium,3.797644523,101,,get_current_time|validate_mermaid_diagram,0.0,1.0,1.0,{'text_prompt_tokens': 3107},3107,2,669,3776 +gemini-2.5-flash-lite-preview-06-17,4,fix_invalid_diagram_hard,4.755809993,1564,,add|validate_mermaid_diagram,0.0,1.0,0.5,{'text_prompt_tokens': 3092},3092,2,1204,4296 +gemini-2.5-flash-lite-preview-06-17,5,fix_invalid_diagram_easy,4.602118065,1576,,add|validate_mermaid_diagram,0.0,1.0,0.5,"{'text_prompt_tokens': 3088, 'cached_content_tokens': 1747, 'text_cache_tokens': 1747}",3088,2,1185,4273 +gemini-2.5-flash-lite-preview-06-17,5,fix_invalid_diagram_medium,4.466044834,1562,,get_current_time|validate_mermaid_diagram,0.0,1.0,1.0,{'text_prompt_tokens': 3107},3107,2,1205,4312 +gemini-2.5-flash-lite-preview-06-17,5,fix_invalid_diagram_hard,4.130914105,224,,add|get_current_time|validate_mermaid_diagram,0.0,1.0,1.0,{'text_prompt_tokens': 3136},3136,2,725,3861 \ No newline at end of file