Skip to content

Conversation

@neubig
Copy link
Contributor

@neubig neubig commented Jan 17, 2026

Summary

This PR updates the directory naming convention from YYYYMM_model_name to {version}_{model_name} format (e.g., v1.8.3_claude-4.5-sonnet).

Fixes #209

Changes

Directory Renaming

  • Renamed all result directories from date-based format (e.g., 202511_claude-4.5-sonnet) to version-based format (e.g., v1.8.3_claude-4.5-sonnet)
  • Updated directory_name field in all metadata.json files to match the new directory names

Documentation

  • Updated README.md with the new naming convention:
    • Changed directory structure examples
    • Updated metadata.json examples
    • Updated "Adding New Results" section with new format

Validation

  • Added validation for directory_name format in validate_schema.py:
    • Validates that directory_name follows the {version}_{model_name} pattern
    • Validates that directory_name matches the agent_version and model fields

Tests

  • Updated existing tests to use the new naming convention
  • Added new tests for directory naming validation:
    • test_valid_directory_name_format: Validates correct format passes
    • test_invalid_directory_name_date_format: Validates old date format fails
    • test_invalid_directory_name_mismatch: Validates mismatched version/model fails

Testing

All 36 tests pass:

pytest tests/ -v
========== 36 passed in 0.14s ==========

Schema validation passes for all result directories:

python scripts/validate_schema.py
============================================================
Schema Validation Report
============================================================
Files validated: 28
  Passed: 28
  Failed: 0
============================================================
VALIDATION PASSED
============================================================

@neubig can click here to continue refining the PR

This change updates the directory naming convention from YYYYMM_model_name
to {version}_{model_name} format (e.g., v1.8.3_claude-4.5-sonnet).

Changes:
- Renamed all result directories to use version numbers instead of dates
- Updated directory_name field in all metadata.json files
- Updated README.md documentation with new naming convention
- Added validation for directory_name format in validate_schema.py
- Updated tests to use new naming convention
- Added tests for directory naming validation

Fixes #209

Co-authored-by: openhands <[email protected]>
@github-actions
Copy link

📊 Progress Report

============================================================
OpenHands Index Results - Progress Report
============================================================

Target: 3D array of benchmarks × models × metrics
  6 benchmarks × 10 models × 3 metrics = 180 cells

Model Coverage: 90.0% (9/10)
  Expected: ['claude-4.5-opus', 'claude-4.5-sonnet', 'gemini-3-pro', 'gemini-3-flash', 'gpt-5.2-high-reasoning', 'gpt-5.2', 'kimi-k2-thinking', 'minimax-m2', 'deepseek-v3.2-reasoner', 'qwen-3-coder']
  Found: ['claude-4.5-opus', 'claude-4.5-sonnet', 'deepseek-v3.2-reasoner', 'gemini-3-flash', 'gemini-3-pro', 'gpt-5.2', 'gpt-5.2-high-reasoning', 'kimi-k2-thinking', 'qwen-3-coder']
  Covered: ['claude-4.5-opus', 'claude-4.5-sonnet', 'deepseek-v3.2-reasoner', 'gemini-3-flash', 'gemini-3-pro', 'gpt-5.2', 'gpt-5.2-high-reasoning', 'kimi-k2-thinking', 'qwen-3-coder']

Benchmark Coverage: 66.67% (4/6)
  Expected: ['swe-bench', 'swe-bench-multimodal', 'multi-swe-bench', 'swt-bench', 'commit0', 'gaia']
  Found: ['commit0', 'gaia', 'swe-bench', 'swe-bench-multimodal']
  Covered: ['commit0', 'gaia', 'swe-bench', 'swe-bench-multimodal']

Metric Coverage: 100.0% (3/3)
  Expected: ['accuracy', 'cost_per_instance', 'average_runtime']
  Found: ['accuracy', 'average_runtime', 'cost_per_instance']
  Covered: ['accuracy', 'average_runtime', 'cost_per_instance']

3D Array Coverage: 35.0%
  Filled cells: 63 / 180

============================================================
OVERALL PROGRESS: ⬛⬛⬛⬛⬜⬜⬜⬜⬜⬜⬜ 35.0%
============================================================

✅ Schema Validation

============================================================
Schema Validation Report
============================================================

Results directory: /home/runner/work/openhands-index-results/openhands-index-results/results
Files validated: 28
  Passed: 28
  Failed: 0

============================================================
VALIDATION PASSED
============================================================

This report measures progress towards the 3D array goal (benchmarks × models × metrics) as described in #2.

This workflow runs on:
- Push to main (when results/ directory changes)
- Pull requests to main (when results/ directory changes)

It validates:
- Directory names follow {version}_{model_name} format
- metadata.json contains valid semantic version
- directory_name matches agent_version and model fields

Co-authored-by: openhands <[email protected]>
@github-actions
Copy link

📊 Progress Report

============================================================
OpenHands Index Results - Progress Report
============================================================

Target: 3D array of benchmarks × models × metrics
  6 benchmarks × 10 models × 3 metrics = 180 cells

Model Coverage: 90.0% (9/10)
  Expected: ['claude-4.5-opus', 'claude-4.5-sonnet', 'gemini-3-pro', 'gemini-3-flash', 'gpt-5.2-high-reasoning', 'gpt-5.2', 'kimi-k2-thinking', 'minimax-m2', 'deepseek-v3.2-reasoner', 'qwen-3-coder']
  Found: ['claude-4.5-opus', 'claude-4.5-sonnet', 'deepseek-v3.2-reasoner', 'gemini-3-flash', 'gemini-3-pro', 'gpt-5.2', 'gpt-5.2-high-reasoning', 'kimi-k2-thinking', 'qwen-3-coder']
  Covered: ['claude-4.5-opus', 'claude-4.5-sonnet', 'deepseek-v3.2-reasoner', 'gemini-3-flash', 'gemini-3-pro', 'gpt-5.2', 'gpt-5.2-high-reasoning', 'kimi-k2-thinking', 'qwen-3-coder']

Benchmark Coverage: 66.67% (4/6)
  Expected: ['swe-bench', 'swe-bench-multimodal', 'multi-swe-bench', 'swt-bench', 'commit0', 'gaia']
  Found: ['commit0', 'gaia', 'swe-bench', 'swe-bench-multimodal']
  Covered: ['commit0', 'gaia', 'swe-bench', 'swe-bench-multimodal']

Metric Coverage: 100.0% (3/3)
  Expected: ['accuracy', 'cost_per_instance', 'average_runtime']
  Found: ['accuracy', 'average_runtime', 'cost_per_instance']
  Covered: ['accuracy', 'average_runtime', 'cost_per_instance']

3D Array Coverage: 35.0%
  Filled cells: 63 / 180

============================================================
OVERALL PROGRESS: ⬛⬛⬛⬛⬜⬜⬜⬜⬜⬜⬜ 35.0%
============================================================

✅ Schema Validation

============================================================
Schema Validation Report
============================================================

Results directory: /home/runner/work/openhands-index-results/openhands-index-results/results
Files validated: 28
  Passed: 28
  Failed: 0

============================================================
VALIDATION PASSED
============================================================

This report measures progress towards the 3D array goal (benchmarks × models × metrics) as described in #2.

@github-actions
Copy link

📊 Progress Report

============================================================
OpenHands Index Results - Progress Report
============================================================

Target: 3D array of benchmarks × models × metrics
  6 benchmarks × 10 models × 3 metrics = 180 cells

Model Coverage: 90.0% (9/10)
  Expected: ['claude-4.5-opus', 'claude-4.5-sonnet', 'gemini-3-pro', 'gemini-3-flash', 'gpt-5.2-high-reasoning', 'gpt-5.2', 'kimi-k2-thinking', 'minimax-m2', 'deepseek-v3.2-reasoner', 'qwen-3-coder']
  Found: ['claude-4.5-opus', 'claude-4.5-sonnet', 'deepseek-v3.2-reasoner', 'gemini-3-flash', 'gemini-3-pro', 'gpt-5.2', 'gpt-5.2-high-reasoning', 'kimi-k2-thinking', 'qwen-3-coder']
  Covered: ['claude-4.5-opus', 'claude-4.5-sonnet', 'deepseek-v3.2-reasoner', 'gemini-3-flash', 'gemini-3-pro', 'gpt-5.2', 'gpt-5.2-high-reasoning', 'kimi-k2-thinking', 'qwen-3-coder']

Benchmark Coverage: 66.67% (4/6)
  Expected: ['swe-bench', 'swe-bench-multimodal', 'multi-swe-bench', 'swt-bench', 'commit0', 'gaia']
  Found: ['commit0', 'gaia', 'swe-bench', 'swe-bench-multimodal']
  Covered: ['commit0', 'gaia', 'swe-bench', 'swe-bench-multimodal']

Metric Coverage: 100.0% (3/3)
  Expected: ['accuracy', 'cost_per_instance', 'average_runtime']
  Found: ['accuracy', 'average_runtime', 'cost_per_instance']
  Covered: ['accuracy', 'average_runtime', 'cost_per_instance']

3D Array Coverage: 35.0%
  Filled cells: 63 / 180

============================================================
OVERALL PROGRESS: ⬛⬛⬛⬛⬜⬜⬜⬜⬜⬜⬜ 35.0%
============================================================

✅ Schema Validation

============================================================
Schema Validation Report
============================================================

Results directory: /home/runner/work/openhands-index-results/openhands-index-results/results
Files validated: 28
  Passed: 28
  Failed: 0

============================================================
VALIDATION PASSED
============================================================

This report measures progress towards the 3D array goal (benchmarks × models × metrics) as described in #2.

@neubig neubig merged commit a491ae5 into main Jan 17, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

In the directory names holding the results, replace the dates with version numbers

2 participants