feat: split base image from SDK layer for faster SWE-bench rebuilds by simonrosenberg · Pull Request #550 · OpenHands/benchmarks

simonrosenberg · 2026-03-20T21:02:12Z

Summary

New build_base_images.py script builds base images (base-image-minimal stage) independently
New --use-prebuilt-bases flag in build_images.py uses pre-built bases to skip base stage
Modified GHA workflow adds optional Phase 1 (base images) before Phase 2 (agent images)
SDK submodule updated to include Dockerfile ARG support (SOURCE_MINIMAL_BASE)

Motivation

When the SDK changes but SWE-bench instances stay the same (the common case), all ~433 images currently rebuild from scratch (~5-6 hours). With pre-built bases, only the SDK layer needs rebuilding:

Scenario	Current (with ARG fix)	With base split
Same SDK (all cached in GHCR)	3-12 min	3-12 min
New SDK, same SWE-bench instances	~5-6 hours	~30-45 min
New SDK + new SWE-bench base	~5-6 hours	~5-6 hours

How it works

Phase 1: build_base_images.py builds ghcr.io/openhands/eval-base:{custom_tag} for each SWE-bench instance using --target base-image-minimal from the SDK Dockerfile
Phase 2: build_images.py --use-prebuilt-bases passes pre-built base refs through BuildOptions.prebuilt_base → SDK Dockerfile's SOURCE_MINIMAL_BASE ARG, skipping the base stage entirely

Dependencies

SDK PR: feat: support pre-built base images for faster rebuilds software-agent-sdk#2542

Test plan

Trigger workflow with use-prebuilt-bases=true and n-limit=100 to validate two-phase build
Compare build time against equivalent single-phase build
Verify built images work correctly (run a few SWE-bench evaluations)

Closes #538

🤖 Generated with Claude Code

Add two-phase build support for SWE-bench images: Phase 1: Build pre-built base images (base-image-minimal stage) - New script: benchmarks/swebench/build_base_images.py - Tags as ghcr.io/openhands/eval-base:{custom_tag} - SDK-independent, only rebuilds when upstream SWE-bench image changes Phase 2: Build agent images using pre-built bases - New --use-prebuilt-bases flag in build_images.py - Passes prebuilt_base through build_utils → SDK BuildOptions - SDK Dockerfile's SOURCE_MINIMAL_BASE ARG skips base-image-minimal Expected impact: new SDK with same instances goes from ~5-6h to ~30-45min. Depends on: OpenHands/software-agent-sdk#2542 Closes #538 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

all-hands-bot

🟡 Acceptable - Pragmatic solution to a real problem (5-6 hour builds → 30-45 min), but show me the receipts.

all-hands-bot · 2026-03-20T21:04:24Z

benchmarks/swebench/build_images.py

+
+    def _resolve(base_image: str) -> str | None:
+        custom_tag = extract_custom_tag(base_image)
+        tag = base_image_tag(custom_tag, base_image_registry)


🟡 Suggestion: You're doing N remote registry checks here (one per base image). Could batch-fetch all available base tags once instead of checking existence one-by-one.

BUT - this is premature optimization. The current approach is simple, debuggable, and the registry checks are fast enough. Don't over-engineer until you measure this as a bottleneck. Good taste is knowing when to stop optimizing.

all-hands-bot · 2026-03-20T21:04:24Z

.github/workflows/build-swebench-images.yml


+      - name: "Phase 1: Build and push base images (pre-built base mode)"
+        if: ${{ env.USE_PREBUILT_BASES == 'true' }}
+        run: |


🟢 Acceptable: Phase 1 failures don't block Phase 2. If some base images fail to build here, Phase 2's _resolve() function returns None and falls back to full builds.

This is pragmatic fault-tolerance - the workflow doesn't become fragile if a few base images hit transient errors. Good engineering.

all-hands-bot · 2026-03-20T21:04:24Z

benchmarks/swebench/build_base_images.py

+#!/usr/bin/env python3
+"""
+Build pre-built base images for SWE-Bench evaluation.
+
+Base images contain everything from the SWE-bench upstream image through
+apt-get/npm setup (the Dockerfile ``base-image-minimal`` stage).  They are
+SDK-independent and only need rebuilding when the upstream SWE-bench image
+or the Dockerfile's base layers change.
+
+Once base images exist in the registry, the agent-server build
+(build_images.py --use-prebuilt-bases) can skip the base-image-minimal
+stage entirely, reducing per-image build time from ~154s to ~5-10s.
+
+Example:
+  uv run benchmarks/swebench/build_base_images.py \
+    --dataset princeton-nlp/SWE-bench_Verified --split test --push
+"""


🔴 Critical - Evidence Required: Your test plan has empty checkboxes. Where's the proof this works?

I need to see:

Actual command output from running Phase 1 base image builds

Actual command output from running Phase 2 with --use-prebuilt-bases

Build time comparison showing the claimed 5-6 hours → 30-45 min improvement

Add an Evidence section to the PR description with terminal output proving you actually ran this end-to-end. "Test plan" with unchecked boxes isn't evidence - it's a TODO list.

Example format:

## Evidence ### Phase 1: Base image build $ uv run benchmarks/swebench/build_base_images.py --dataset princeton-nlp/SWE-bench_Verified --n-limit 5 --push [paste actual output] ### Phase 2: Agent build with pre-built bases $ uv run benchmarks/swebench/build_images.py --dataset princeton-nlp/SWE-bench_Verified --n-limit 5 --use-prebuilt-bases --push [paste actual output] ### Build time comparison Before: X minutes After: Y minutes

all-hands-bot · 2026-03-20T21:04:24Z

benchmarks/swebench/build_base_images.py

+
+    # Check registry first
+    if remote_image_exists(tag):
+        logger.info("Base image %s already exists. Skipping.", tag)


🟢 Acceptable: This remote_image_exists check serves a different purpose than the one in build_images.py. Here it's "skip if already built" (idempotency). In Phase 2 it's "check if pre-built base exists before using it" (validation).

Not redundant - correctly handling two different scenarios.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Telemetry from first run showed each image spending 67-109s on cache_export and 72-168s on image_export. With pre-built bases the builder stage is already cached from the first image, so writing cache back to the registry is pure overhead. Set OPENHANDS_BUILDKIT_CACHE_MODE=off when USE_PREBUILT_BASES=true. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

all-hands-bot reviewed Mar 20, 2026

View reviewed changes

Debug Agent and others added 3 commits March 20, 2026 18:05

fix: match _update_pbar signature from main (add skipped param)

4493964

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chore: update SDK submodule to clean prebuilt-base commit

f477493

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: split base image from SDK layer for faster SWE-bench rebuilds#550

feat: split base image from SDK layer for faster SWE-bench rebuilds#550
simonrosenberg wants to merge 4 commits intomainfrom
feat/prebuilt-base-images

simonrosenberg commented Mar 20, 2026

Uh oh!

all-hands-bot left a comment

Uh oh!

all-hands-bot Mar 20, 2026

Uh oh!

all-hands-bot Mar 20, 2026

Uh oh!

all-hands-bot Mar 20, 2026

Uh oh!

all-hands-bot Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

simonrosenberg commented Mar 20, 2026

Summary

Motivation

How it works

Dependencies

Test plan

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

all-hands-bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

all-hands-bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

all-hands-bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

all-hands-bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants