Skip to content

Restore self-contained /agent-server/.venv via managed-Python + PT_GNU_STACK PF_X sanitization #2761

@simonrosenberg

Description

@simonrosenberg

Problem

The openhands-agent-server Dockerfile builder stage has used uv venv --python-preference only-system since v1.15.0 (#2567, commits 29b76b66 and 3192b084). Under this mode, /agent-server/.venv/bin/python is a symlink to the builder image's /usr/local/bin/python3 — the .venv is not self-contained, it shells around the base image's system CPython.

Two concrete problems follow from that:

1. Non-portable .venv. When a downstream consumer COPYs /agent-server from the SDK builder onto a different base image, the symlink points at a path that may not exist in the target. Verified breakage: commit0's raw Ubuntu bases (e.g. docker.io/wentingzhao/tinydb:v0) have no /usr/local/bin/python3, so runtime pods fail with:

exec: "/agent-server/.venv/bin/python": stat /usr/local/bin/python3: no such file or directory

Downstream has worked around this in OpenHands/benchmarks#614 by introducing benchmarks/utils/Dockerfile.agent-layer-commit0 (and hardening benchmarks/utils/Dockerfile.agent-layer for SWE-bench) — both manually COPYing Debian's Python 3.13 runtime from the builder stage into the final image. This works but is dead-code-in-waiting: it duplicates SDK internals into a downstream repo and is load-bearing on undocumented SDK Dockerfile layout.

2. Degraded local/dev experience. The .venv inside any SDK-built image no longer contains its own Python. Developers inspecting a running container see a venv that depends on base-image state rather than being a self-contained environment. The "python contained in the .venv" property that existed pre-v1.15.0 is gone.

Why only-system was chosen in the first place

python-build-standalone (what uv python install downloads) ships libpython3.13.so.1.0 with PT_GNU_STACK PF_X. Under Debian's glibc 2.41-12+deb13u2 (Trixie) NX enforcement and under Docker-in-Docker with seccomp restrictions (GitHub Actions, sysbox-runc), the dynamic linker rejects these .sos with:

cannot enable executable stack as shared object requires: Invalid argument

All binary-target GAIA evals were failing 100% with PYI-37 errors before v1.15.0. The --python-preference only-system switch dodged this by using Debian's CPython (which doesn't have the PF_X flag set), at the cost of the self-contained .venv property.

Proposed fix: managed-Python + PT_GNU_STACK PF_X sanitization

Address the execstack problem at its actual layer (ELF program headers) instead of dodging python-build-standalone, and restore managed-Python so .venv is self-contained by construction.

Builder stage change:

ENV UV_PROJECT_ENVIRONMENT=/agent-server/.venv
ENV UV_PYTHON_INSTALL_DIR=/agent-server/uv-managed-python

RUN uv python install 3.13 && \
    python3 /build/clear_execstack.py /agent-server/uv-managed-python && \
    uv venv --python 3.13 .venv && \
    uv sync --frozen --no-editable --managed-python --extra boto3

Helper: clear_execstack.py

Walks a directory tree, finds every .so* file, parses ELF program headers, and clears the PF_X bit on any PT_GNU_STACK entry that has it. Idempotent. No-op on ELFs that don't have PT_GNU_STACK or don't have PF_X set. Supports 32/64-bit, amd64/arm64, stripped and unstripped binaries.

PyInstaller spec change (resurrects #2574):

Apply the same helper as a post-Analysis hook so bundled .sos in the binary/binary-minimal target also get sanitized. PR #2574 proved this approach end-to-end: GAIA eval_limit=1 on sysbox-runc with image 28a56ab-gaia-binary — 0 PYI-37 errors, 1/1 instance resolved, versus 4 PYI-37 errors in the control (image 4907d99-gaia-binary) in the same window.

Net result:

  • .venv/bin/python../uv-managed-python/cpython-3.13-.../bin/python3.13 — fully inside /agent-server, no base-image dependency
  • Bundled Python and PyInstaller binaries are both PT_GNU_STACK RW — DinD/sysbox/seccomp safe
  • Same helper, two call sites — no duplicated logic
  • /agent-server becomes genuinely portable as a side effect (python-build-standalone is designed for relocation; Debian CPython is not)

Why this is better than the alternatives considered

  • revert(docker): restore managed Python instead of --python-preference only-system #2692 (naive revert to --managed-python). Closed. Reintroduces PYI-37 execstack crashes in DinD/sysbox — the exact problem only-system was added to fix. Doesn't address the execstack issue at all.
  • fix(docker): bundle Python runtime for portable /agent-server #2676 (bundle Debian Python into /agent-server/.python/). Tactical patch. Works on glibc-based downstream bases but is not a universal artifact boundary — musl/Alpine, mismatched glibc ABI, and arch mismatch all still break. Per Python's own docs, venvs are not generally portable when the interpreter is relocated. Adds ~60 lines of Dockerfile shell to rewrite venv symlinks, patch pyvenv.cfg, and bundle libraries. Does not satisfy the "python contained in the .venv" property — the venv still shells around a CPython that was never designed to be relocated. To be closed in favor of this plan.
  • Wheel migration (pip install openhands-agent-server). Long-term direction but not shippable today. Blocked on two packaging gaps: (a) openhands-agent-server's pyproject.toml does not declare openhands-tools as a runtime dep despite importing openhands.tools.*, so pip install openhands-agent-server==1.16.1 && python -m openhands.agent_server crashes with ModuleNotFoundError: No module named 'openhands.tools'; and (b) there is no supported multi-package same-SHA install pattern for consumers who pin unreleased SDK commits via a vendor submodule (a git+…#subdirectory=openhands-agent-server install resolves siblings from PyPI, not from the same commit). Captured as separate tracking work.

The managed-Python + ELF-sanitize approach is a strictly smaller and more principled diff than any of the above, and is the only option that simultaneously satisfies: self-contained .venv, DinD/sysbox-safe, PyInstaller-safe, no downstream Dockerfile coupling, no packaging prerequisites.

Validation plan

The execstack problem is silent at build time and only manifests at runtime, so validation has to include a real downstream eval run, not just CI import checks.

1. Unit tests for clear_execstack.py

Fixture-driven: ELFs with PT_GNU_STACK RWX, PT_GNU_STACK RW, no PT_GNU_STACK, 32/64-bit, amd64/arm64, stripped and unstripped. Idempotence (running the script twice produces no change on the second pass). Invariant: after the helper walks a directory, readelf -l on every .so* shows GNU_STACK without the E flag.

2. Small benchmark eval runs on the feature branches

Open the SDK PR as draft. SDK CI will build and push multi-arch agent-server images tagged with the PR SHA (ghcr.io/openhands/agent-server:<sha>-python). On a matching benchmarks feature branch, bump vendor/software-agent-sdk to the SDK PR head SHA and run all three downstream targets at eval_limit=5:

  • GAIAbinary target
  • commit0source-minimal target, raw Ubuntu base
  • SWE-benchsource-minimal target

All three must complete without PYI-37 errors or cannot enable executable stack failures, and instance resolution on the 5-instance sample should match historical baselines. Only then mark the SDK PR ready for review.

3. Rollout

  • Merge SDK PR, cut release.
  • Release notes: "agent-server builder switched from --python-preference only-system to --managed-python with PF_X sanitization on bundled Python and PyInstaller binaries. Restores self-contained .venv contract."
  • Bump vendor/software-agent-sdk in benchmarks to the released SDK tag and rebuild per-benchmark images.

4. Downstream cleanup (separate benchmarks PR, after rollout is stable)

Strip the now-dead Python-runtime COPY workarounds from benchmarks/utils/Dockerfile.agent-layer and benchmarks/utils/Dockerfile.agent-layer-commit0:

  • Drop COPY --from=builder /usr/local/bin/python3.13, /usr/local/lib/python3.13, libpython3.13.so*, /usr/local/bin/python3
  • Drop ENV LD_LIBRARY_PATH=/usr/local/lib
  • Drop ENV UV_PYTHON_INSTALL_DIR=/agent-server/uv-managed-python — the SDK builder stage now sets this itself

Keep everything else: user creation, system packages (for commit0's raw upstream bases), uv binary COPY, cache dir ENV vars. Those are commit0's legitimate concerns, not workarounds for an SDK contract.

Scope

In scope for this issue:

  • Add clear_execstack.py helper with unit tests
  • Switch builder stage to --managed-python with UV_PYTHON_INSTALL_DIR=/agent-server/uv-managed-python
  • Apply helper to $UV_PYTHON_INSTALL_DIR in the builder stage, after uv python install 3.13
  • Apply helper to PyInstaller bundled .sos via the spec file's Analysis hook
  • Close fix(docker): bundle Python runtime for portable /agent-server #2676 as superseded
  • Matching downstream benchmarks cleanup PR after rollout

Out of scope (tracked separately):

Known downsides

  • Image size. binary and binary-minimal targets grow ~80MB (python-build-standalone is larger than a Debian CPython install). Acceptable — smaller than the Chromium/VSCode layers that already dominate the binary target.
  • The helper is a workaround for an upstream bug. Ideally python-build-standalone would build without PF_X set. The helper becomes a no-op when that lands upstream and can be removed.
  • Does not address Decouple eval image assembly from SDK Dockerfile internals #2585 or agent-server source images rely on a non-portable Python runtime contract #2687. The "COPY-from-builder as supported contract" architectural concern stands. This plan makes the contract stable and portable; the broader question of whether to have that contract at all is deferred.

References

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions