Skip to content

Conversation

@neubig
Copy link
Contributor

@neubig neubig commented Feb 9, 2026

Summary

This PR adds eval_visualization_page as an optional string field to the ScoreEntry schema for scores.json files.

Changes

  • Added eval_visualization_page: Optional[str] field to the ScoreEntry class in scripts/validate_schema.py
  • Field is optional (defaults to None) and described as "URL to the evaluation visualization page"

Example usage

After this change, score entries can optionally include:

{
  "benchmark": "swe-bench",
  "score": 74.2,
  "metric": "accuracy",
  "cost_per_instance": 1.19,
  "average_runtime": 534.0,
  "full_archive": "https://results.eval.all-hands.dev/...",
  "tags": ["swe-bench"],
  "agent_version": "v1.8.3",
  "submission_time": "2026-01-26T16:02:48.428351+00:00",
  "eval_visualization_page": "https://laminar.sh/shared/evals/..."
}

Validation

  • ✅ All 66 existing tests pass
  • ✅ Schema validation passes for all 28 existing result files
  • ✅ Backwards compatible (existing files without the field remain valid)

@neubig can click here to continue refining the PR

@github-actions
Copy link

github-actions bot commented Feb 9, 2026

📊 Progress Report

============================================================
OpenHands Index Results - Progress Report
============================================================

Target: Complete all model × benchmark pairs
  11 models × 5 benchmarks = 55 pairs
  (each pair requires all 3 metrics: score, cost_per_instance, average_runtime)

============================================================
OVERALL PROGRESS: ⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛ 100.0%
  Complete: 55 / 55 pairs
============================================================

✅ Schema Validation

============================================================
Schema Validation Report
============================================================

Results directory: /home/runner/work/openhands-index-results/openhands-index-results/results
Files validated: 28
  Passed: 28
  Failed: 0

============================================================
VALIDATION PASSED
============================================================

This report measures progress towards the 3D array goal (benchmarks × models × metrics) as described in #2.

@neubig neubig marked this pull request as ready for review February 9, 2026 14:25
@neubig neubig merged commit a8b914b into main Feb 9, 2026
1 check passed
@neubig neubig deleted the add-eval-visualization-page branch February 9, 2026 15:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants