Conversation
b5f3d05 to
9e776c4
Compare
772fab5 to
0a7ff07
Compare
Refactor the Docker image build system from being SWE-bench specific to supporting any dataset that implements the ImageBuildableDataset protocol. Key changes: - Add ImageBuildableDataset protocol to dataset registry with get_image_build_specs() classmethod - Add DerivedImageSpec and ImageBuildSpec union type to image_spec module - Support tuple entry point format (factory_fn, dataset_class) in registry_base for protocol discovery - Refactor build_images.py to use registry-based dataset discovery instead of importing SWE-bench directly - Add get_image_build_specs() classmethod to SwebenchDataset - Update dataset entry points to use tuple format - Add SeederFn support in BuildImageSpec for pre-build setup - Add DerivedImageSpec handling in ImageCache Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove dead config_overrides parameter from run_build() - Remove dead except (KeyboardInterrupt, SystemExit, asyncio.CancelledError) re-raise blocks (these inherit from BaseException, not Exception) - Fix double _should_build check in build_modified_image by calling _do_build directly - Store per-entry-point errors in registry instead of silently swallowing - Collapse _load_image_buildable_classes and _ensure_image_buildable_classes_loaded into one function - Fix BaseRegistry docstring to use generic terminology - Remove agentdojo_entry alias, register factory directly in pyproject.toml - Remove unused has_image_spec_factory() from exports and tests Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Eliminate the parallel entry-point loading system in datasets/registry.py by extending BaseRegistry to store component classes from tuple entries. This removes 4 module-level globals and ~35 lines of duplicated loading logic. Also replaces print() with proper logging, fixes a bare except that swallowed errors, removes dead code, and updates stale docstrings. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…component_class) All dataset entry points now use a consistent 3-tuple format instead of a mix of plain callables and 2-tuples. This makes the config class explicit in the tuple rather than requiring signature inspection for tuple entries, and eliminates the inconsistency between agentdojo (plain factory) and swebench (2-tuple). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…Spec, fix error handling - Add ComponentEntryPoint NamedTuple for structured entry point definitions and update dataset entry points to use it (plain tuples still supported) - Fix docstring claiming 2-tuple when code uses 3-tuple - Re-raise in outer except block instead of silently swallowing errors - Add StringConstraints(min_length=1) and tag != base_image_tag validator to DerivedImageSpec - Generalize DerivedImageSpec docstring to remove SWE-bench-specific language - Split conflated error messages in get_image_build_specs - Add exc_info=e to error logging in build_images.py - Rename _should_build to _prepare_for_build with side-effect documentation - Remove redundant comments in image_cache.py - Add ImageBuildableDataset protocol docstring note about BaseModel parameter Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…alidators and logging Split build and push try blocks in build_all_specs so push failures no longer add tags to failed_base_tags, which was causing derived images to be skipped with misleading "base image failed to build" messages. Add model validator to MultiStageBuildImageSpec enforcing final_tag matches last stage tag and stages is non-empty. Expose failed_entry_points on BaseRegistry and log warnings in get_datasets_with_image_specs for actionable --all-datasets output. Fix stale _should_build reference in _do_build docstring. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
a3d000d to
a256ab5
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
build_images.pyfrom a SWE-bench-specific script into a generic image builder that works with any dataset implementing the new ImageBuildableDatasetprotocol
ImageBuildableDatasetprotocol to the dataset registry — datasets opt into image building by implementing aget_image_build_specs()classmethod(factory_fn, dataset_class)inregistry_base.pyso the registry can discover both the factory and the class at load timeDerivedImageSpecfor images built on top of a base (e.g., SWE-bench pair images),ImageBuildSpecunion type, andSeederFnsupport for pre-build setup onBuildImageSpec (needed for upcoming browser dataset)
get_image_build_specs()classmethod toSwebenchDatasetand exposeswebench_entry/agentdojo_entrytuplesImageCacheto handleDerivedImageSpec--dataset <name>/--all-datasetsinstead of being hardcoded to SWE-benchTest plan
uv run pytest -vx -m "not docker_integration"passesuv run ruff check and uv run ruff format --checkpassprompt-siren-build-images --dataset swebenchstill builds SWE-bench images correctlyprompt-siren-build-images --all-datasetsdiscovers swebench as the only image-buildable datasetWe cannot run the last two yet while we figure out what registry to use to push images.