Add eval_visualization_page optional field to ScoreEntry schema #526

neubig · 2026-02-09T14:21:15Z

Summary

This PR adds eval_visualization_page as an optional string field to the ScoreEntry schema for scores.json files.

Changes

Added eval_visualization_page: Optional[str] field to the ScoreEntry class in scripts/validate_schema.py
Field is optional (defaults to None) and described as "URL to the evaluation visualization page"

Example usage

After this change, score entries can optionally include:

{
  "benchmark": "swe-bench",
  "score": 74.2,
  "metric": "accuracy",
  "cost_per_instance": 1.19,
  "average_runtime": 534.0,
  "full_archive": "https://results.eval.all-hands.dev/...",
  "tags": ["swe-bench"],
  "agent_version": "v1.8.3",
  "submission_time": "2026-01-26T16:02:48.428351+00:00",
  "eval_visualization_page": "https://laminar.sh/shared/evals/..."
}

Validation

✅ All 66 existing tests pass
✅ Schema validation passes for all 28 existing result files
✅ Backwards compatible (existing files without the field remain valid)

@neubig can click here to continue refining the PR

Co-authored-by: openhands <[email protected]>

github-actions · 2026-02-09T14:21:32Z

📊 Progress Report

============================================================
OpenHands Index Results - Progress Report
============================================================

Target: Complete all model × benchmark pairs
  11 models × 5 benchmarks = 55 pairs
  (each pair requires all 3 metrics: score, cost_per_instance, average_runtime)

============================================================
OVERALL PROGRESS: ⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛ 100.0%
  Complete: 55 / 55 pairs
============================================================

✅ Schema Validation

============================================================
Schema Validation Report
============================================================

Results directory: /home/runner/work/openhands-index-results/openhands-index-results/results
Files validated: 28
  Passed: 28
  Failed: 0

============================================================
VALIDATION PASSED
============================================================

This report measures progress towards the 3D array goal (benchmarks × models × metrics) as described in #2.

Add Laminar visualization page URLs to existing score entries that have matching eval_ids from our existing evals database. Backfilled entries: - swt-bench: https://laminar.sh/shared/evals/cbf7f857-21d0-4526-b830-25d4e88ab0ed - swe-bench-multimodal: https://laminar.sh/shared/evals/0fe00101-1f0a-4698-8a43-70fd665f4936 - gaia: https://laminar.sh/shared/evals/09286fea-ded4-4f32-a8b1-97e3a5847b83 - commit0: https://laminar.sh/shared/evals/977b16fe-82e5-422d-bf85-7de3906e0b32 Co-authored-by: openhands <[email protected]>

Add eval_visualization_page optional field to ScoreEntry schema

32375e9

Co-authored-by: openhands <[email protected]>

neubig marked this pull request as ready for review February 9, 2026 14:25

xingyaoww approved these changes Feb 9, 2026

View reviewed changes

neubig merged commit a8b914b into main Feb 9, 2026
1 check passed

neubig deleted the add-eval-visualization-page branch February 9, 2026 15:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add eval_visualization_page optional field to ScoreEntry schema #526

Add eval_visualization_page optional field to ScoreEntry schema #526

Uh oh!

neubig commented Feb 9, 2026

Uh oh!

github-actions bot commented Feb 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add eval_visualization_page optional field to ScoreEntry schema #526

Add eval_visualization_page optional field to ScoreEntry schema #526

Uh oh!

Conversation

neubig commented Feb 9, 2026

Summary

Changes

Example usage

Validation

Uh oh!

github-actions bot commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📊 Progress Report

✅ Schema Validation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions bot commented Feb 9, 2026 •

edited

Loading