Update Merbench Evaluation Docs and Scripts #11
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR significantly updates the Merbench evaluation system for LLM agents working with Mermaid diagrams. It refines evaluation scripts, enhances dashboard functionality, updates cost tracking for models, and improves documentation for setup and usage.
Problem:
What is the problem this PR solves?
Solution:
merbench_ui.py
) with improved filtering, provider/model breakdowns, and more robust data handling.costs.json
) to include new models and correct/expand pricing structures.gemini-2.5-flash
as the new default and updating supported models.Unlocks:
Detailed breakdown of changes:
Documentation:
README.md
for the Mermaid evaluation system.Dashboard (
merbench_ui.py
):Model Cost Tracking (
costs.json
):Evaluation Scripts:
gemini-2.5-flash
as the new default model for benchmarking.CSV/JSON Exporter:
preprocess_merbench_data.py
for converting evaluation CSVs to JSON for public leaderboards or static dashboards.General: