Skip to content

Conversation

andrewginns
Copy link
Owner

@andrewginns andrewginns commented Jul 4, 2025

Tweaks the scripts related to running Merbench, streamlining the workflow and improving documentation and script flexibility.

Problem:

Previously, the scripts and documentation for running Merbench evaluations were somewhat rigid, with hard-coded file paths and less clear instructions. The dashboard configuration also contained verbose or outdated explanatory text, and support for additional model providers was not explicit. Furthermore, an old results CSV was still present in the repository.

Solution:

  • Refactored the preprocess_merbench_data.py script to use argparse for flexible input/output file specification and to set sensible defaults based on project structure and the current month.
  • Improved provider detection logic in the script, adding explicit support for Amazon models.
  • Updated the dashboard configuration with clearer, more concise text emphasising the evaluation process and metrics.
  • Enhanced documentation in the README for easier-to-follow script usage.
  • Removed the outdated Jun_gemini_results.csv results file to reduce repository clutter.
  • Added more explicit (commented) support for Amazon Bedrock models in evaluation scripts.

Unlocks:

These changes make it easier to run new Merbench evaluations, add support for more model providers, and reduce confusion for users and contributors. The improved documentation and script flexibility should also help with onboarding and reproducibility.

Detailed breakdown of changes:

  • agents_mcp_usage/evaluations/mermaid_evals/scripts/preprocess_merbench_data.py: Refactored to use argparse, added automatic path handling and month-based naming, improved provider detection, and made use of project root utility.
  • agents_mcp_usage/evaluations/mermaid_evals/dashboard_config.py: Reworded and clarified dashboard description, streamlined explanations and metric definitions.
  • agents_mcp_usage/evaluations/mermaid_evals/README.md: Updated example command for preprocessing results to match script changes.
  • agents_mcp_usage/evaluations/mermaid_evals/run_multi_evals.py: Commented additional Amazon Bedrock model options for future evaluations.
  • mermaid_eval_results/Jun_gemini_results.csv: Removed obsolete results file from version control.

@andrewginns andrewginns merged commit 6891bff into main Jul 4, 2025
@andrewginns andrewginns deleted the refactor-evals branch July 6, 2025 10:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant