Update Merbench Evaluation Docs and Scripts #11

andrewginns · 2025-07-02T09:22:08Z

This PR significantly updates the Merbench evaluation system for LLM agents working with Mermaid diagrams. It refines evaluation scripts, enhances dashboard functionality, updates cost tracking for models, and improves documentation for setup and usage.

Problem:

What is the problem this PR solves?

The Merbench evaluation scripts, dashboard, and documentation were outdated or incomplete.
Model cost tracking and configuration needed to be expanded for new models.
The evaluation workflow and results export process lacked clarity and robustness.
The documentation did not comprehensively guide users through setup, evaluation, and analysis.

Solution:

Rewrote and expanded the main evaluation README to clearly explain system architecture, usage, metrics, error handling, and troubleshooting.
Refactored and enhanced the Streamlit dashboard (merbench_ui.py) with improved filtering, provider/model breakdowns, and more robust data handling.
Updated model cost tracking (costs.json) to include new models and correct/expand pricing structures.
Improved evaluation runner scripts, setting gemini-2.5-flash as the new default and updating supported models.
Added a standalone CSV-to-JSON export script for public dashboard integration.
Made the evaluation code more robust and maintainable, including better error handling, retry logic, and configurability.

Unlocks:

Easier and more accurate benchmarking of LLM agents on diagram correction tasks.
Detailed cost analysis, supporting more models and providers.
Streamlined export to public leaderboard or external dashboards.
Simpler onboarding and troubleshooting for new users.

Detailed breakdown of changes:

Documentation:
- Completely rewrote README.md for the Mermaid evaluation system.
- Added architecture diagrams, usage examples, troubleshooting, and contributing guidelines.
Dashboard (merbench_ui.py):
- Added provider and model filters, improved sidebar controls, and dynamic filter status.
- Enhanced grouping and advanced filter UX.
- Improved handling of empty data and error cases.
Model Cost Tracking (costs.json):
- Added/upgraded model definitions, including Google Gemini (various), OpenAI GPT-4.1, and Anthropic Claude series.
- Updated pricing structures and token thresholds.
Evaluation Scripts:
- Set gemini-2.5-flash as the new default model for benchmarking.
- Updated multi-eval runner to only include currently supported models.
- Improved script configurability and code comments.
CSV/JSON Exporter:
- Added preprocess_merbench_data.py for converting evaluation CSVs to JSON for public leaderboards or static dashboards.
General:
- Improved error handling, retry logic, and output schema in evaluation code.
- Refined output directory and file naming conventions for results.

andrewginns added 4 commits June 29, 2025 14:16

feat: Provider and model filters, updated costs

def0116

chore: June '25 Gemini results - 2.5 Pro 06-05 Judge

48fe56f

feat: Add failure reason to eval results conversion for merbench page

17110cd

docs: Update README for merbench and set stable default models

443e2e5

andrewginns merged commit 4183572 into main Jul 2, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update Merbench Evaluation Docs and Scripts #11

Update Merbench Evaluation Docs and Scripts #11

Uh oh!

andrewginns commented Jul 2, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Update Merbench Evaluation Docs and Scripts #11

Update Merbench Evaluation Docs and Scripts #11

Uh oh!

Conversation

andrewginns commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem:

Solution:

Unlocks:

Detailed breakdown of changes:

Uh oh!

Uh oh!

Uh oh!

andrewginns commented Jul 2, 2025 •

edited

Loading