Replace dates with version numbers in directory names #210

neubig · 2026-01-17T12:18:57Z

Summary

This PR updates the directory naming convention from YYYYMM_model_name to {version}_{model_name} format (e.g., v1.8.3_claude-4.5-sonnet).

Fixes #209

Changes

Directory Renaming

Renamed all result directories from date-based format (e.g., 202511_claude-4.5-sonnet) to version-based format (e.g., v1.8.3_claude-4.5-sonnet)
Updated directory_name field in all metadata.json files to match the new directory names

Documentation

Updated README.md with the new naming convention:
- Changed directory structure examples
- Updated metadata.json examples
- Updated "Adding New Results" section with new format

Validation

Added validation for directory_name format in validate_schema.py:
- Validates that directory_name follows the {version}_{model_name} pattern
- Validates that directory_name matches the agent_version and model fields

Tests

Updated existing tests to use the new naming convention
Added new tests for directory naming validation:
- test_valid_directory_name_format: Validates correct format passes
- test_invalid_directory_name_date_format: Validates old date format fails
- test_invalid_directory_name_mismatch: Validates mismatched version/model fails

Testing

All 36 tests pass:

pytest tests/ -v
========== 36 passed in 0.14s ==========

Schema validation passes for all result directories:

python scripts/validate_schema.py
============================================================
Schema Validation Report
============================================================
Files validated: 28
  Passed: 28
  Failed: 0
============================================================
VALIDATION PASSED
============================================================

@neubig can click here to continue refining the PR

This change updates the directory naming convention from YYYYMM_model_name to {version}_{model_name} format (e.g., v1.8.3_claude-4.5-sonnet). Changes: - Renamed all result directories to use version numbers instead of dates - Updated directory_name field in all metadata.json files - Updated README.md documentation with new naming convention - Added validation for directory_name format in validate_schema.py - Updated tests to use new naming convention - Added tests for directory naming validation Fixes #209 Co-authored-by: openhands <[email protected]>

github-actions · 2026-01-17T12:19:12Z

📊 Progress Report

============================================================
OpenHands Index Results - Progress Report
============================================================

Target: 3D array of benchmarks × models × metrics
  6 benchmarks × 10 models × 3 metrics = 180 cells

Model Coverage: 90.0% (9/10)
  Expected: ['claude-4.5-opus', 'claude-4.5-sonnet', 'gemini-3-pro', 'gemini-3-flash', 'gpt-5.2-high-reasoning', 'gpt-5.2', 'kimi-k2-thinking', 'minimax-m2', 'deepseek-v3.2-reasoner', 'qwen-3-coder']
  Found: ['claude-4.5-opus', 'claude-4.5-sonnet', 'deepseek-v3.2-reasoner', 'gemini-3-flash', 'gemini-3-pro', 'gpt-5.2', 'gpt-5.2-high-reasoning', 'kimi-k2-thinking', 'qwen-3-coder']
  Covered: ['claude-4.5-opus', 'claude-4.5-sonnet', 'deepseek-v3.2-reasoner', 'gemini-3-flash', 'gemini-3-pro', 'gpt-5.2', 'gpt-5.2-high-reasoning', 'kimi-k2-thinking', 'qwen-3-coder']

Benchmark Coverage: 66.67% (4/6)
  Expected: ['swe-bench', 'swe-bench-multimodal', 'multi-swe-bench', 'swt-bench', 'commit0', 'gaia']
  Found: ['commit0', 'gaia', 'swe-bench', 'swe-bench-multimodal']
  Covered: ['commit0', 'gaia', 'swe-bench', 'swe-bench-multimodal']

Metric Coverage: 100.0% (3/3)
  Expected: ['accuracy', 'cost_per_instance', 'average_runtime']
  Found: ['accuracy', 'average_runtime', 'cost_per_instance']
  Covered: ['accuracy', 'average_runtime', 'cost_per_instance']

3D Array Coverage: 35.0%
  Filled cells: 63 / 180

============================================================
OVERALL PROGRESS: ⬛⬛⬛⬛⬜⬜⬜⬜⬜⬜⬜ 35.0%
============================================================

✅ Schema Validation

============================================================
Schema Validation Report
============================================================

Results directory: /home/runner/work/openhands-index-results/openhands-index-results/results
Files validated: 28
  Passed: 28
  Failed: 0

============================================================
VALIDATION PASSED
============================================================

This report measures progress towards the 3D array goal (benchmarks × models × metrics) as described in #2.

This workflow runs on: - Push to main (when results/ directory changes) - Pull requests to main (when results/ directory changes) It validates: - Directory names follow {version}_{model_name} format - metadata.json contains valid semantic version - directory_name matches agent_version and model fields Co-authored-by: openhands <[email protected]>

github-actions · 2026-01-17T12:22:44Z

📊 Progress Report

============================================================
OpenHands Index Results - Progress Report
============================================================

Target: 3D array of benchmarks × models × metrics
  6 benchmarks × 10 models × 3 metrics = 180 cells

Model Coverage: 90.0% (9/10)
  Expected: ['claude-4.5-opus', 'claude-4.5-sonnet', 'gemini-3-pro', 'gemini-3-flash', 'gpt-5.2-high-reasoning', 'gpt-5.2', 'kimi-k2-thinking', 'minimax-m2', 'deepseek-v3.2-reasoner', 'qwen-3-coder']
  Found: ['claude-4.5-opus', 'claude-4.5-sonnet', 'deepseek-v3.2-reasoner', 'gemini-3-flash', 'gemini-3-pro', 'gpt-5.2', 'gpt-5.2-high-reasoning', 'kimi-k2-thinking', 'qwen-3-coder']
  Covered: ['claude-4.5-opus', 'claude-4.5-sonnet', 'deepseek-v3.2-reasoner', 'gemini-3-flash', 'gemini-3-pro', 'gpt-5.2', 'gpt-5.2-high-reasoning', 'kimi-k2-thinking', 'qwen-3-coder']

Benchmark Coverage: 66.67% (4/6)
  Expected: ['swe-bench', 'swe-bench-multimodal', 'multi-swe-bench', 'swt-bench', 'commit0', 'gaia']
  Found: ['commit0', 'gaia', 'swe-bench', 'swe-bench-multimodal']
  Covered: ['commit0', 'gaia', 'swe-bench', 'swe-bench-multimodal']

Metric Coverage: 100.0% (3/3)
  Expected: ['accuracy', 'cost_per_instance', 'average_runtime']
  Found: ['accuracy', 'average_runtime', 'cost_per_instance']
  Covered: ['accuracy', 'average_runtime', 'cost_per_instance']

3D Array Coverage: 35.0%
  Filled cells: 63 / 180

============================================================
OVERALL PROGRESS: ⬛⬛⬛⬛⬜⬜⬜⬜⬜⬜⬜ 35.0%
============================================================

✅ Schema Validation

============================================================
Schema Validation Report
============================================================

Results directory: /home/runner/work/openhands-index-results/openhands-index-results/results
Files validated: 28
  Passed: 28
  Failed: 0

============================================================
VALIDATION PASSED
============================================================

This report measures progress towards the 3D array goal (benchmarks × models × metrics) as described in #2.

This reverts commit 11011d4.

github-actions · 2026-01-17T12:23:14Z

📊 Progress Report

============================================================
OpenHands Index Results - Progress Report
============================================================

Target: 3D array of benchmarks × models × metrics
  6 benchmarks × 10 models × 3 metrics = 180 cells

Model Coverage: 90.0% (9/10)
  Expected: ['claude-4.5-opus', 'claude-4.5-sonnet', 'gemini-3-pro', 'gemini-3-flash', 'gpt-5.2-high-reasoning', 'gpt-5.2', 'kimi-k2-thinking', 'minimax-m2', 'deepseek-v3.2-reasoner', 'qwen-3-coder']
  Found: ['claude-4.5-opus', 'claude-4.5-sonnet', 'deepseek-v3.2-reasoner', 'gemini-3-flash', 'gemini-3-pro', 'gpt-5.2', 'gpt-5.2-high-reasoning', 'kimi-k2-thinking', 'qwen-3-coder']
  Covered: ['claude-4.5-opus', 'claude-4.5-sonnet', 'deepseek-v3.2-reasoner', 'gemini-3-flash', 'gemini-3-pro', 'gpt-5.2', 'gpt-5.2-high-reasoning', 'kimi-k2-thinking', 'qwen-3-coder']

Benchmark Coverage: 66.67% (4/6)
  Expected: ['swe-bench', 'swe-bench-multimodal', 'multi-swe-bench', 'swt-bench', 'commit0', 'gaia']
  Found: ['commit0', 'gaia', 'swe-bench', 'swe-bench-multimodal']
  Covered: ['commit0', 'gaia', 'swe-bench', 'swe-bench-multimodal']

Metric Coverage: 100.0% (3/3)
  Expected: ['accuracy', 'cost_per_instance', 'average_runtime']
  Found: ['accuracy', 'average_runtime', 'cost_per_instance']
  Covered: ['accuracy', 'average_runtime', 'cost_per_instance']

3D Array Coverage: 35.0%
  Filled cells: 63 / 180

============================================================
OVERALL PROGRESS: ⬛⬛⬛⬛⬜⬜⬜⬜⬜⬜⬜ 35.0%
============================================================

✅ Schema Validation

============================================================
Schema Validation Report
============================================================

Results directory: /home/runner/work/openhands-index-results/openhands-index-results/results
Files validated: 28
  Passed: 28
  Failed: 0

============================================================
VALIDATION PASSED
============================================================

This report measures progress towards the 3D array goal (benchmarks × models × metrics) as described in #2.

openhands-ai bot mentioned this pull request Jan 17, 2026

In the directory names holding the results, replace the dates with version numbers #209

Closed

Revert "Add dedicated schema validation workflow for push and PR"

faf3baf

This reverts commit 11011d4.

neubig merged commit a491ae5 into main Jan 17, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace dates with version numbers in directory names #210

Replace dates with version numbers in directory names #210

Uh oh!

neubig commented Jan 17, 2026

Uh oh!

github-actions bot commented Jan 17, 2026

Uh oh!

github-actions bot commented Jan 17, 2026

Uh oh!

github-actions bot commented Jan 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Replace dates with version numbers in directory names #210

Replace dates with version numbers in directory names #210

Uh oh!

Conversation

neubig commented Jan 17, 2026

Summary

Changes

Directory Renaming

Documentation

Validation

Tests

Testing

Uh oh!

github-actions bot commented Jan 17, 2026

📊 Progress Report

✅ Schema Validation

Uh oh!

github-actions bot commented Jan 17, 2026

📊 Progress Report

✅ Schema Validation

Uh oh!

github-actions bot commented Jan 17, 2026

📊 Progress Report

✅ Schema Validation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants