feat: split base image from SDK layer for faster SWE-bench rebuilds#550
feat: split base image from SDK layer for faster SWE-bench rebuilds#550simonrosenberg wants to merge 4 commits intomainfrom
Conversation
Add two-phase build support for SWE-bench images:
Phase 1: Build pre-built base images (base-image-minimal stage)
- New script: benchmarks/swebench/build_base_images.py
- Tags as ghcr.io/openhands/eval-base:{custom_tag}
- SDK-independent, only rebuilds when upstream SWE-bench image changes
Phase 2: Build agent images using pre-built bases
- New --use-prebuilt-bases flag in build_images.py
- Passes prebuilt_base through build_utils → SDK BuildOptions
- SDK Dockerfile's SOURCE_MINIMAL_BASE ARG skips base-image-minimal
Expected impact: new SDK with same instances goes from ~5-6h to ~30-45min.
Depends on: OpenHands/software-agent-sdk#2542
Closes #538
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
all-hands-bot
left a comment
There was a problem hiding this comment.
🟡 Acceptable - Pragmatic solution to a real problem (5-6 hour builds → 30-45 min), but show me the receipts.
|
|
||
| def _resolve(base_image: str) -> str | None: | ||
| custom_tag = extract_custom_tag(base_image) | ||
| tag = base_image_tag(custom_tag, base_image_registry) |
There was a problem hiding this comment.
🟡 Suggestion: You're doing N remote registry checks here (one per base image). Could batch-fetch all available base tags once instead of checking existence one-by-one.
BUT - this is premature optimization. The current approach is simple, debuggable, and the registry checks are fast enough. Don't over-engineer until you measure this as a bottleneck. Good taste is knowing when to stop optimizing.
|
|
||
| - name: "Phase 1: Build and push base images (pre-built base mode)" | ||
| if: ${{ env.USE_PREBUILT_BASES == 'true' }} | ||
| run: | |
There was a problem hiding this comment.
🟢 Acceptable: Phase 1 failures don't block Phase 2. If some base images fail to build here, Phase 2's _resolve() function returns None and falls back to full builds.
This is pragmatic fault-tolerance - the workflow doesn't become fragile if a few base images hit transient errors. Good engineering.
| #!/usr/bin/env python3 | ||
| """ | ||
| Build pre-built base images for SWE-Bench evaluation. | ||
|
|
||
| Base images contain everything from the SWE-bench upstream image through | ||
| apt-get/npm setup (the Dockerfile ``base-image-minimal`` stage). They are | ||
| SDK-independent and only need rebuilding when the upstream SWE-bench image | ||
| or the Dockerfile's base layers change. | ||
|
|
||
| Once base images exist in the registry, the agent-server build | ||
| (build_images.py --use-prebuilt-bases) can skip the base-image-minimal | ||
| stage entirely, reducing per-image build time from ~154s to ~5-10s. | ||
|
|
||
| Example: | ||
| uv run benchmarks/swebench/build_base_images.py \ | ||
| --dataset princeton-nlp/SWE-bench_Verified --split test --push | ||
| """ |
There was a problem hiding this comment.
🔴 Critical - Evidence Required: Your test plan has empty checkboxes. Where's the proof this works?
I need to see:
- Actual command output from running Phase 1 base image builds
- Actual command output from running Phase 2 with
--use-prebuilt-bases - Build time comparison showing the claimed 5-6 hours → 30-45 min improvement
Add an Evidence section to the PR description with terminal output proving you actually ran this end-to-end. "Test plan" with unchecked boxes isn't evidence - it's a TODO list.
Example format:
## Evidence
### Phase 1: Base image build
$ uv run benchmarks/swebench/build_base_images.py --dataset princeton-nlp/SWE-bench_Verified --n-limit 5 --push
[paste actual output]
### Phase 2: Agent build with pre-built bases
$ uv run benchmarks/swebench/build_images.py --dataset princeton-nlp/SWE-bench_Verified --n-limit 5 --use-prebuilt-bases --push
[paste actual output]
### Build time comparison
Before: X minutes
After: Y minutes
|
|
||
| # Check registry first | ||
| if remote_image_exists(tag): | ||
| logger.info("Base image %s already exists. Skipping.", tag) |
There was a problem hiding this comment.
🟢 Acceptable: This remote_image_exists check serves a different purpose than the one in build_images.py. Here it's "skip if already built" (idempotency). In Phase 2 it's "check if pre-built base exists before using it" (validation).
Not redundant - correctly handling two different scenarios.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Telemetry from first run showed each image spending 67-109s on cache_export and 72-168s on image_export. With pre-built bases the builder stage is already cached from the first image, so writing cache back to the registry is pure overhead. Set OPENHANDS_BUILDKIT_CACHE_MODE=off when USE_PREBUILT_BASES=true. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
build_base_images.pyscript builds base images (base-image-minimalstage) independently--use-prebuilt-basesflag inbuild_images.pyuses pre-built bases to skip base stageSOURCE_MINIMAL_BASE)Motivation
When the SDK changes but SWE-bench instances stay the same (the common case), all ~433 images currently rebuild from scratch (~5-6 hours). With pre-built bases, only the SDK layer needs rebuilding:
How it works
build_base_images.pybuildsghcr.io/openhands/eval-base:{custom_tag}for each SWE-bench instance using--target base-image-minimalfrom the SDK Dockerfilebuild_images.py --use-prebuilt-basespasses pre-built base refs throughBuildOptions.prebuilt_base→ SDK Dockerfile'sSOURCE_MINIMAL_BASEARG, skipping the base stage entirelyDependencies
Test plan
use-prebuilt-bases=trueandn-limit=100to validate two-phase buildCloses #538
🤖 Generated with Claude Code