Skip to content

Conversation

andrewginns
Copy link
Owner

@andrewginns andrewginns commented Jul 2, 2025

This PR significantly updates the Merbench evaluation system for LLM agents working with Mermaid diagrams. It refines evaluation scripts, enhances dashboard functionality, updates cost tracking for models, and improves documentation for setup and usage.

Problem:

What is the problem this PR solves?

  • The Merbench evaluation scripts, dashboard, and documentation were outdated or incomplete.
  • Model cost tracking and configuration needed to be expanded for new models.
  • The evaluation workflow and results export process lacked clarity and robustness.
  • The documentation did not comprehensively guide users through setup, evaluation, and analysis.

Solution:

  • Rewrote and expanded the main evaluation README to clearly explain system architecture, usage, metrics, error handling, and troubleshooting.
  • Refactored and enhanced the Streamlit dashboard (merbench_ui.py) with improved filtering, provider/model breakdowns, and more robust data handling.
  • Updated model cost tracking (costs.json) to include new models and correct/expand pricing structures.
  • Improved evaluation runner scripts, setting gemini-2.5-flash as the new default and updating supported models.
  • Added a standalone CSV-to-JSON export script for public dashboard integration.
  • Made the evaluation code more robust and maintainable, including better error handling, retry logic, and configurability.

Unlocks:

  • Easier and more accurate benchmarking of LLM agents on diagram correction tasks.
  • Detailed cost analysis, supporting more models and providers.
  • Streamlined export to public leaderboard or external dashboards.
  • Simpler onboarding and troubleshooting for new users.

Detailed breakdown of changes:

  • Documentation:

    • Completely rewrote README.md for the Mermaid evaluation system.
    • Added architecture diagrams, usage examples, troubleshooting, and contributing guidelines.
  • Dashboard (merbench_ui.py):

    • Added provider and model filters, improved sidebar controls, and dynamic filter status.
    • Enhanced grouping and advanced filter UX.
    • Improved handling of empty data and error cases.
  • Model Cost Tracking (costs.json):

    • Added/upgraded model definitions, including Google Gemini (various), OpenAI GPT-4.1, and Anthropic Claude series.
    • Updated pricing structures and token thresholds.
  • Evaluation Scripts:

    • Set gemini-2.5-flash as the new default model for benchmarking.
    • Updated multi-eval runner to only include currently supported models.
    • Improved script configurability and code comments.
  • CSV/JSON Exporter:

    • Added preprocess_merbench_data.py for converting evaluation CSVs to JSON for public leaderboards or static dashboards.
  • General:

    • Improved error handling, retry logic, and output schema in evaluation code.
    • Refined output directory and file naming conventions for results.

@andrewginns andrewginns merged commit 4183572 into main Jul 2, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant