This guide covers conventions and patterns for adding new benchmarks or modifying existing ones.
Each benchmark lives in its own folder under benchmarks/:
benchmarks/<benchmark_name>/
├── README.md # Benchmark documentation
├── __init__.py
├── config.py # Default configurations (INFER_DEFAULTS, EVAL_DEFAULTS)
├── run_infer.py # Inference entrypoint
├── eval_infer.py # Evaluation entrypoint
├── build_images.py # Docker image building (if needed)
└── prompts/ # Prompt templates (optional; not all benchmarks use this)
One benchmark per folder. For similar benchmarks (e.g., SWE-bench and SWE-bench MultiModal), it's preferable to duplicate code than to merge them.
benchmark_name should be lowercase only. Do not use dashes or underscores, so SWE-bench MultiModal becomes swebenchmultimodal.
When adding a new benchmark, include these files as appropriate for your benchmark:
- Implements an
Evaluationsubclass with:prepare_instances()→ returns list ofEvalInstanceprepare_workspace(instance)→ returnsRemoteWorkspaceevaluate_instance(instance, workspace)→ returnsEvalOutput
- Has a
main()function as entrypoint - Uses
get_parser()frombenchmarks.utils.args_parser - Uses
EvalMetadatamodel for configuration - Handles both
dockerandremoteworkspace types
- Converts inference output to the benchmark's evaluation format
- Runs the evaluation harness
- Generates a cost report
- Builds Docker images for evaluation (only needed if using Docker for evaluation)
- Supports
--pushflag to push images to registry - Handles parallel builds with
--max-workers
- Brief description of the benchmark
- Setup instructions
- Usage examples with command-line invocation
Register entrypoints in pyproject.toml under [project.scripts]:
[project.scripts]
<benchmark>-infer = "benchmarks.<benchmark>.run_infer:main"
<benchmark>-eval = "benchmarks.<benchmark>.eval_infer:main"Use the pattern <benchmark>-<command> (e.g., swebench-infer, multiswebench-infer).
LLM configs are stored in .llm_config/ as JSON files matching the LLM class schema:
{
"model": "litellm_proxy/anthropic/claude-sonnet-4-20250514",
"base_url": "https://llm-proxy.eval.all-hands.dev",
"api_key": "YOUR_API_KEY"
}Validate with: uv run validate-cfg .llm_config/your-config.json
- Benchmark names: lowercase only, with no dashes or underscores (e.g.,
swebench,multiswebench,swebenchmultimodal) - Benchmark Python package names: follow the benchmark name and stay lowercase only (e.g.,
benchmarks.swebench,benchmarks.multiswebench) - Classes: PascalCase (e.g.,
SWEbenchEvaluation) - Functions/methods: snake_case (e.g.,
prepare_instances) - CLI arguments: kebab-case (e.g.,
--n-limit) - Environment variables: UPPER_SNAKE_CASE
- Fail fast on unrecoverable errors: Raise exceptions rather than logging warnings when the error prevents evaluation.
- Be lenient with recoverable errors: A recoverable error (e.g., a single instance failing) should be logged but not crash the entire evaluation run.
- Example pattern:
for instance in instances:
try:
result = evaluate_instance(instance)
results.append(result)
except RecoverableError as e:
logger.warning(f"Instance {instance.id} failed: {e}")
continue # Skip this instance, continue with others
except UnrecoverableError:
raise # Crash the runWhen adding a new benchmark, add tests to tests/ following the pattern test_<benchmark>_<feature>.py:
tests/
├── test_<benchmark>_run_infer.py # Tests for run_infer logic
├── test_<benchmark>_eval_infer.py # Tests for eval_infer logic
└── test_<benchmark>_build_images.py # Tests for image building (if applicable)- Minimal changes: Modify
utils/only if your change benefits multiple existing benchmarks. For single-benchmark utilities, keep them in the benchmark's own directory. - Describe all changes: List every file changed and why.
- Test locally: Run
uv run pytestbefore submitting. - Update documentation: Update the benchmark's README.md if adding new features.
- Run
uv run pre-commit run --files <changed_files>before committing - Follow existing patterns in the codebase
- Use type hints for function parameters and return values