Contributing to OpenHands Benchmarks

This guide covers conventions and patterns for adding new benchmarks or modifying existing ones.

Directory Structure

Each benchmark lives in its own folder under benchmarks/:

benchmarks/<benchmark_name>/
├── README.md              # Benchmark documentation
├── __init__.py
├── config.py              # Default configurations (INFER_DEFAULTS, EVAL_DEFAULTS)
├── run_infer.py           # Inference entrypoint
├── eval_infer.py          # Evaluation entrypoint
├── build_images.py        # Docker image building (if needed)
└── prompts/               # Prompt templates (optional; not all benchmarks use this)

One benchmark per folder. For similar benchmarks (e.g., SWE-bench and SWE-bench MultiModal), it's preferable to duplicate code than to merge them. benchmark_name should be lowercase only. Do not use dashes or underscores, so SWE-bench MultiModal becomes swebenchmultimodal.

Required Files for New Benchmarks

When adding a new benchmark, include these files as appropriate for your benchmark:

run_infer.py

Implements an Evaluation subclass with:
- prepare_instances() → returns list of EvalInstance
- prepare_workspace(instance) → returns RemoteWorkspace
- evaluate_instance(instance, workspace) → returns EvalOutput
Has a main() function as entrypoint
Uses get_parser() from benchmarks.utils.args_parser
Uses EvalMetadata model for configuration
Handles both docker and remote workspace types

eval_infer.py

Converts inference output to the benchmark's evaluation format
Runs the evaluation harness
Generates a cost report

build_images.py (when applicable)

Builds Docker images for evaluation (only needed if using Docker for evaluation)
Supports --push flag to push images to registry
Handles parallel builds with --max-workers

README.md

Brief description of the benchmark
Setup instructions
Usage examples with command-line invocation

CLI Entrypoints

[project.scripts]
<benchmark>-infer = "benchmarks.<benchmark>.run_infer:main"
<benchmark>-eval = "benchmarks.<benchmark>.eval_infer:main"

Use the pattern <benchmark>-<command> (e.g., swebench-infer, multiswebench-infer).

LLM Configuration

LLM configs are stored in .llm_config/ as JSON files matching the LLM class schema:

{
  "model": "litellm_proxy/anthropic/claude-sonnet-4-20250514",
  "base_url": "https://llm-proxy.eval.all-hands.dev",
  "api_key": "YOUR_API_KEY"
}

Validate with: uv run validate-cfg .llm_config/your-config.json

Naming Conventions

Benchmark names: lowercase only, with no dashes or underscores (e.g., swebench, multiswebench, swebenchmultimodal)
Benchmark Python package names: follow the benchmark name and stay lowercase only (e.g., benchmarks.swebench, benchmarks.multiswebench)
Classes: PascalCase (e.g., SWEbenchEvaluation)
Functions/methods: snake_case (e.g., prepare_instances)
CLI arguments: kebab-case (e.g., --n-limit)
Environment variables: UPPER_SNAKE_CASE

Error Handling

Fail fast on unrecoverable errors: Raise exceptions rather than logging warnings when the error prevents evaluation.
Be lenient with recoverable errors: A recoverable error (e.g., a single instance failing) should be logged but not crash the entire evaluation run.
Example pattern:

for instance in instances:
    try:
        result = evaluate_instance(instance)
        results.append(result)
    except RecoverableError as e:
        logger.warning(f"Instance {instance.id} failed: {e}")
        continue  # Skip this instance, continue with others
    except UnrecoverableError:
        raise  # Crash the run

Testing

When adding a new benchmark, add tests to tests/ following the pattern test_<benchmark>_<feature>.py:

tests/
├── test_<benchmark>_run_infer.py   # Tests for run_infer logic
├── test_<benchmark>_eval_infer.py  # Tests for eval_infer logic
└── test_<benchmark>_build_images.py # Tests for image building (if applicable)

Pull Request Guidelines

Minimal changes: Modify utils/ only if your change benefits multiple existing benchmarks. For single-benchmark utilities, keep them in the benchmark's own directory.
Describe all changes: List every file changed and why.
Test locally: Run uv run pytest before submitting.
Update documentation: Update the benchmark's README.md if adding new features.

Code Style

Run uv run pre-commit run --files <changed_files> before committing
Follow existing patterns in the codebase
Use type hints for function parameters and return values

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Contributing to OpenHands Benchmarks

Directory Structure

Required Files for New Benchmarks

run_infer.py

eval_infer.py

build_images.py (when applicable)

README.md

CLI Entrypoints

LLM Configuration

Naming Conventions

Error Handling

Testing

Pull Request Guidelines

Code Style

FilesExpand file tree

CONTRIBUTING.md

Latest commit

History

CONTRIBUTING.md

File metadata and controls

Contributing to OpenHands Benchmarks

Directory Structure

Required Files for New Benchmarks

run_infer.py

eval_infer.py

build_images.py (when applicable)

README.md

CLI Entrypoints

LLM Configuration

Naming Conventions

Error Handling

Testing

Pull Request Guidelines

Code Style