Tweak scripts that run Merbench #13

andrewginns · 2025-07-04T15:24:14Z

Tweaks the scripts related to running Merbench, streamlining the workflow and improving documentation and script flexibility.

Problem:

Previously, the scripts and documentation for running Merbench evaluations were somewhat rigid, with hard-coded file paths and less clear instructions. The dashboard configuration also contained verbose or outdated explanatory text, and support for additional model providers was not explicit. Furthermore, an old results CSV was still present in the repository.

Solution:

Refactored the preprocess_merbench_data.py script to use argparse for flexible input/output file specification and to set sensible defaults based on project structure and the current month.
Improved provider detection logic in the script, adding explicit support for Amazon models.
Updated the dashboard configuration with clearer, more concise text emphasising the evaluation process and metrics.
Enhanced documentation in the README for easier-to-follow script usage.
Removed the outdated Jun_gemini_results.csv results file to reduce repository clutter.
Added more explicit (commented) support for Amazon Bedrock models in evaluation scripts.

Unlocks:

These changes make it easier to run new Merbench evaluations, add support for more model providers, and reduce confusion for users and contributors. The improved documentation and script flexibility should also help with onboarding and reproducibility.

Detailed breakdown of changes:

agents_mcp_usage/evaluations/mermaid_evals/scripts/preprocess_merbench_data.py: Refactored to use argparse, added automatic path handling and month-based naming, improved provider detection, and made use of project root utility.
agents_mcp_usage/evaluations/mermaid_evals/dashboard_config.py: Reworded and clarified dashboard description, streamlined explanations and metric definitions.
agents_mcp_usage/evaluations/mermaid_evals/README.md: Updated example command for preprocessing results to match script changes.
agents_mcp_usage/evaluations/mermaid_evals/run_multi_evals.py: Commented additional Amazon Bedrock model options for future evaluations.
mermaid_eval_results/Jun_gemini_results.csv: Removed obsolete results file from version control.

andrewginns added 4 commits July 2, 2025 09:50

refact: Add named args to preprocess script

223701f

chore: Remove top level results

5b83398

chore: Changes to copy, models, and provider attribution

a40fe34

chore: Remove unused f prefix

308a41e

andrewginns merged commit 6891bff into main Jul 4, 2025

andrewginns deleted the refactor-evals branch July 6, 2025 10:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tweak scripts that run Merbench #13

Tweak scripts that run Merbench #13

Uh oh!

andrewginns commented Jul 4, 2025 •

edited

Loading

Uh oh!

Uh oh!

Tweak scripts that run Merbench #13

Tweak scripts that run Merbench #13

Uh oh!

Conversation

andrewginns commented Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem:

Solution:

Unlocks:

Detailed breakdown of changes:

Uh oh!

Uh oh!

andrewginns commented Jul 4, 2025 •

edited

Loading