Tweak scripts that run Merbench #13
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Tweaks the scripts related to running Merbench, streamlining the workflow and improving documentation and script flexibility.
Problem:
Previously, the scripts and documentation for running Merbench evaluations were somewhat rigid, with hard-coded file paths and less clear instructions. The dashboard configuration also contained verbose or outdated explanatory text, and support for additional model providers was not explicit. Furthermore, an old results CSV was still present in the repository.
Solution:
preprocess_merbench_data.py
script to useargparse
for flexible input/output file specification and to set sensible defaults based on project structure and the current month.Jun_gemini_results.csv
results file to reduce repository clutter.Unlocks:
These changes make it easier to run new Merbench evaluations, add support for more model providers, and reduce confusion for users and contributors. The improved documentation and script flexibility should also help with onboarding and reproducibility.
Detailed breakdown of changes:
agents_mcp_usage/evaluations/mermaid_evals/scripts/preprocess_merbench_data.py
: Refactored to useargparse
, added automatic path handling and month-based naming, improved provider detection, and made use of project root utility.agents_mcp_usage/evaluations/mermaid_evals/dashboard_config.py
: Reworded and clarified dashboard description, streamlined explanations and metric definitions.agents_mcp_usage/evaluations/mermaid_evals/README.md
: Updated example command for preprocessing results to match script changes.agents_mcp_usage/evaluations/mermaid_evals/run_multi_evals.py
: Commented additional Amazon Bedrock model options for future evaluations.mermaid_eval_results/Jun_gemini_results.csv
: Removed obsolete results file from version control.