Skip to content

feat: build lightweight benchmark images by default#549

Merged
simonrosenberg merged 4 commits intomainfrom
feat/lightweight-benchmark-images-v2
Mar 20, 2026
Merged

feat: build lightweight benchmark images by default#549
simonrosenberg merged 4 commits intomainfrom
feat/lightweight-benchmark-images-v2

Conversation

@simonrosenberg
Copy link
Collaborator

@simonrosenberg simonrosenberg commented Mar 20, 2026

Summary

  • Add --agent-type CLI flag to SWE-bench and SWT-bench image build scripts
  • Thread extra_build_args through the full call chain: build_all_images_build_with_loggingbuild_image → SDK BuildOptions
  • Both workflows default agent-type to "default", which skips ACP and boto3 installation
  • Can be toggled via workflow dispatch input (agent-type: acp-claude or acp-codex) when ACP images are needed

Changes

  • benchmarks/utils/build_utils.py: LIGHTWEIGHT_BUILD_ARGS / ACP_BUILD_ARGS constants, --agent-type CLI flag, build_args_for_agent_type() helper, extra_build_args parameter threaded through all build functions
  • benchmarks/swebench/build_images.py: Pass lightweight build args to build_all_images
  • benchmarks/swtbench/build_images.py: Same
  • .github/workflows/build-swebench-images.yml: agent-type input (default default), passed as --agent-type flag
  • .github/workflows/build-swtbench-images.yml: Same
  • vendor/software-agent-sdk: Updated to SDK main (62c2e7cf) which includes extra_build_args support (#2541), optional ACP (#2535), and optional boto3 (#2536)

Expected impact (from analysis of 433-image build logs, #537)

Dependency skipped Per-image saving Cumulative (433 imgs)
npm ACP + nodejs ~32s install + ~4s export/push 4.5h
boto3/botocore ~3s export/push 0.4h
Total ~36s/image ~4.9h cumulative

Wall-clock improvement: ~1.4–2.5h (at 3.5× effective parallelism), plus non-linear savings from reduced disk pressure.

SDK dependencies (all merged to main)

Relation to #548

This is a slimmed-down version of #548 that works with SDK main today. It excludes INSTALL_BROWSER (browser-use optional support hasn't been merged to the SDK yet). Once browser-use lands, a follow-up can add INSTALL_BROWSER: "false" to LIGHTWEIGHT_BUILD_ARGS.

Test plan

  • Run SWT-bench build with --agent-type default on 4–10 images to verify builds succeed
  • Verify image can run a basic SWT-bench evaluation
  • Run with --agent-type acp-claude to verify ACP images still build correctly
  • Full 433-image build to measure actual wall-clock improvement

Closes #537

🤖 Generated with Claude Code

Add --agent-type CLI flag to SWE-bench and SWT-bench image build scripts.
Thread extra_build_args through the full call chain: build_all_images ->
_build_with_logging -> build_image -> SDK BuildOptions.

Both workflows default agent-type to "default", which skips ACP and boto3
installation. Use "acp-claude" or "acp-codex" to keep ACP installed.

Expected per-image saving: ~36s (ACP ~32s + boto3 ~3s), translating to
~4.9h cumulative across 433 images.

Depends on: OpenHands/software-agent-sdk#2541

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Taste Rating: Acceptable - Pragmatic solution to a real problem (4.9h build savings), but parameter defaulting needs cleanup.

Verdict: ✅ Core logic is sound. The config threading is straightforward. Fix the inconsistent defaults and this is good to merge.

Debug Agent and others added 3 commits March 20, 2026 17:06
Update vendor/software-agent-sdk to SDK main now that #2541
(extra_build_args) has been merged.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use a consistent default across build_image, _build_with_logging, and
build_all_images instead of defaulting to None and checking later.
Removes the ternary fallback in _build_with_logging.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The base parser (get_parser) already defines --agent-type. Adding it
again in get_build_parser caused an argparse conflict error.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@simonrosenberg simonrosenberg self-assigned this Mar 20, 2026
Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Good taste - Pragmatic solution to a real problem (4.9h build savings) with clean execution.

Previous review issues resolved:

  • Consistent defaulting throughout: All functions now default extra_build_args to LIGHTWEIGHT_BUILD_ARGS instead of None
  • Eliminated unnecessary conditionals

Implementation is solid:

  • Simple, straightforward parameter threading
  • Clean constants with clear documentation
  • build_args_for_agent_type() uses startswith() pattern that allows future extensions
  • No special cases, no unnecessary complexity

Minor observation (not blocking):
Mutable dict as default argument is technically a Python code smell, but it's pragmatic here since:

  • Functions don't mutate, just pass through to SDK
  • Constants use SCREAMING_SNAKE_CASE signaling immutability
  • Making them immutable would add complexity without solving a real problem

Verdict: Ship it

@simonrosenberg simonrosenberg merged commit 96d72a0 into main Mar 20, 2026
3 checks passed
@simonrosenberg simonrosenberg deleted the feat/lightweight-benchmark-images-v2 branch March 20, 2026 20:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Proposal: lightweight benchmark images via optional dependency flags

2 participants