This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
- Always source the virtual environment before running Python commands:
source .venv/bin/activate - This project uses UV as the package manager, not pip
- Dependency management: When adding dependencies with UV, use >= constraints (e.g.,
uv add "package>=1.2.3")- Exception:
inspect-aimust remain pinned to a specific version for stability - Use the latest stable version as the minimum to keep dependencies healthy and secure
- Check latest versions with
uv pip list --outdatedor on PyPI
- Exception:
# Initial setup
uv venv && uv sync --dev
source .venv/bin/activate# Run all unit tests
pytest
# Run integration tests (requires API keys)
pytest -m integration
# Run specific test file
pytest tests/test_registry.py
# Run with coverage
pytest --cov=openbench# Format code
ruff format .
# Lint code
ruff check .
# Type checking
mypy .
# Run all pre-commit hooks
pre-commit run --all-files# List available benchmarks
bench list
# Describe a specific benchmark
bench describe mmlu
# Run evaluation
bench eval mmlu --model groq/llama-3.1-70b --limit 10
# View previous results
bench view# Install core dependencies only (runs most benchmarks)
uv sync
# Install with specific benchmark dependencies
uv sync --group scicode # For SciCode benchmark
uv sync --group jsonschemabench # For JSONSchemaBench
# Install with development tools
uv sync --dev
# Install everything
uv sync --all-groups# Build the package
uv build
# Publish to PyPI (requires PyPI API token)
uv publishsrc/openbench/- Main package directory_cli/- CLI commands implementationdatasets/- Dataset loaders (MMLU, GPQA, HumanEval, SimpleQA)evals/- Benchmark implementations built on Inspect AImetrics/- Custom scoring metricsscorers/- Scoring functions used across benchmarksutils/- Shared utilities_registry.py- Dynamic task loading systemconfig.py- Benchmark metadata and configuration
- Registry-based loading: Benchmarks are dynamically loaded via
_registry.py - Inspect AI framework: All evaluations extend Inspect AI's task/solver pattern
- Provider-agnostic: Uses Inspect AI's model abstraction for 15+ providers
- Shared components: Common scorers and utilities reduce code duplication
openbench uses a tightly coupled architecture where benchmarks share common infrastructure:
┌─────────────┐ ┌──────────────┐ ┌──────────────┐
│ EVALS │ ---> │ DATASETS │ <--> │ SCORERS │
│ (19 files) │ │ (shared) │ │ (shared) │
└─────────────┘ └──────────────┘ └──────────────┘
Core Dependencies (in main dependencies):
inspect-ai: Required by all benchmarksscipy,numpy: Used by multiple scorers (DROP, MMLU, HealthBench)datasets,typer,groq,openai: Core infrastructure
Optional Dependencies (in dependency-groups):
scicode: Only needed for SciCode benchmarkjsonschemabench: Only needed for JSONSchemaBench
Most benchmarks (17/19) can run with just core dependencies. Only specialized benchmarks require their specific dependency group.
- Create evaluation file in
src/openbench/evals/ - Add metadata to
config.py - Follow existing patterns for dataset loading and scoring
- Use type hints for all functions
- Follow conventional commits (feat, fix, docs, etc.)
- Line length: 88 characters (configured in ruff)
- Test new features with both unit and integration tests
- IMPORTANT: All pre-commit hooks MUST pass before committing
- Run
pre-commit run --all-filesto check all hooks - The configured hooks are:
ruff check --fix- Linting with auto-fixruff format- Code formattingmypy .- Type checking
- If any hook fails, fix the issues and re-run until all pass
- Run