Skip to content

feat: split base image from SDK layer for faster SWE-bench rebuilds#550

Open
simonrosenberg wants to merge 4 commits intomainfrom
feat/prebuilt-base-images
Open

feat: split base image from SDK layer for faster SWE-bench rebuilds#550
simonrosenberg wants to merge 4 commits intomainfrom
feat/prebuilt-base-images

Conversation

@simonrosenberg
Copy link
Collaborator

Summary

  • New build_base_images.py script builds base images (base-image-minimal stage) independently
  • New --use-prebuilt-bases flag in build_images.py uses pre-built bases to skip base stage
  • Modified GHA workflow adds optional Phase 1 (base images) before Phase 2 (agent images)
  • SDK submodule updated to include Dockerfile ARG support (SOURCE_MINIMAL_BASE)

Motivation

When the SDK changes but SWE-bench instances stay the same (the common case), all ~433 images currently rebuild from scratch (~5-6 hours). With pre-built bases, only the SDK layer needs rebuilding:

Scenario Current (with ARG fix) With base split
Same SDK (all cached in GHCR) 3-12 min 3-12 min
New SDK, same SWE-bench instances ~5-6 hours ~30-45 min
New SDK + new SWE-bench base ~5-6 hours ~5-6 hours

How it works

  1. Phase 1: build_base_images.py builds ghcr.io/openhands/eval-base:{custom_tag} for each SWE-bench instance using --target base-image-minimal from the SDK Dockerfile
  2. Phase 2: build_images.py --use-prebuilt-bases passes pre-built base refs through BuildOptions.prebuilt_base → SDK Dockerfile's SOURCE_MINIMAL_BASE ARG, skipping the base stage entirely

Dependencies

Test plan

  • Trigger workflow with use-prebuilt-bases=true and n-limit=100 to validate two-phase build
  • Compare build time against equivalent single-phase build
  • Verify built images work correctly (run a few SWE-bench evaluations)

Closes #538

🤖 Generated with Claude Code

Add two-phase build support for SWE-bench images:

Phase 1: Build pre-built base images (base-image-minimal stage)
- New script: benchmarks/swebench/build_base_images.py
- Tags as ghcr.io/openhands/eval-base:{custom_tag}
- SDK-independent, only rebuilds when upstream SWE-bench image changes

Phase 2: Build agent images using pre-built bases
- New --use-prebuilt-bases flag in build_images.py
- Passes prebuilt_base through build_utils → SDK BuildOptions
- SDK Dockerfile's SOURCE_MINIMAL_BASE ARG skips base-image-minimal

Expected impact: new SDK with same instances goes from ~5-6h to ~30-45min.

Depends on: OpenHands/software-agent-sdk#2542

Closes #538

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Acceptable - Pragmatic solution to a real problem (5-6 hour builds → 30-45 min), but show me the receipts.


def _resolve(base_image: str) -> str | None:
custom_tag = extract_custom_tag(base_image)
tag = base_image_tag(custom_tag, base_image_registry)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Suggestion: You're doing N remote registry checks here (one per base image). Could batch-fetch all available base tags once instead of checking existence one-by-one.

BUT - this is premature optimization. The current approach is simple, debuggable, and the registry checks are fast enough. Don't over-engineer until you measure this as a bottleneck. Good taste is knowing when to stop optimizing.


- name: "Phase 1: Build and push base images (pre-built base mode)"
if: ${{ env.USE_PREBUILT_BASES == 'true' }}
run: |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Acceptable: Phase 1 failures don't block Phase 2. If some base images fail to build here, Phase 2's _resolve() function returns None and falls back to full builds.

This is pragmatic fault-tolerance - the workflow doesn't become fragile if a few base images hit transient errors. Good engineering.

Comment on lines +1 to +17
#!/usr/bin/env python3
"""
Build pre-built base images for SWE-Bench evaluation.

Base images contain everything from the SWE-bench upstream image through
apt-get/npm setup (the Dockerfile ``base-image-minimal`` stage). They are
SDK-independent and only need rebuilding when the upstream SWE-bench image
or the Dockerfile's base layers change.

Once base images exist in the registry, the agent-server build
(build_images.py --use-prebuilt-bases) can skip the base-image-minimal
stage entirely, reducing per-image build time from ~154s to ~5-10s.

Example:
uv run benchmarks/swebench/build_base_images.py \
--dataset princeton-nlp/SWE-bench_Verified --split test --push
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Critical - Evidence Required: Your test plan has empty checkboxes. Where's the proof this works?

I need to see:

  1. Actual command output from running Phase 1 base image builds
  2. Actual command output from running Phase 2 with --use-prebuilt-bases
  3. Build time comparison showing the claimed 5-6 hours → 30-45 min improvement

Add an Evidence section to the PR description with terminal output proving you actually ran this end-to-end. "Test plan" with unchecked boxes isn't evidence - it's a TODO list.

Example format:

## Evidence

### Phase 1: Base image build
$ uv run benchmarks/swebench/build_base_images.py --dataset princeton-nlp/SWE-bench_Verified --n-limit 5 --push
[paste actual output]

### Phase 2: Agent build with pre-built bases  
$ uv run benchmarks/swebench/build_images.py --dataset princeton-nlp/SWE-bench_Verified --n-limit 5 --use-prebuilt-bases --push
[paste actual output]

### Build time comparison
Before: X minutes
After: Y minutes


# Check registry first
if remote_image_exists(tag):
logger.info("Base image %s already exists. Skipping.", tag)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Acceptable: This remote_image_exists check serves a different purpose than the one in build_images.py. Here it's "skip if already built" (idempotency). In Phase 2 it's "check if pre-built base exists before using it" (validation).

Not redundant - correctly handling two different scenarios.

Debug Agent and others added 3 commits March 20, 2026 18:05
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Telemetry from first run showed each image spending 67-109s on
cache_export and 72-168s on image_export. With pre-built bases the
builder stage is already cached from the first image, so writing
cache back to the registry is pure overhead.

Set OPENHANDS_BUILDKIT_CACHE_MODE=off when USE_PREBUILT_BASES=true.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Split base image from SDK layer to avoid full 9h+ rebuilds on SDK changes

2 participants