Skip to content

Conversation

imajes
Copy link

@imajes imajes commented Oct 6, 2025

Why

  • The previous approach treated many non‑zero exits as sandbox denials, making normal command failures look like sandbox errors.
  • SIGSYS mapping was inconsistent (hardcoded to 12 while Linux uses 31), and magic integers made maintenance hard.

Pytest exit codes (context for "Why")

  • 0 — All tests collected and passed.
  • 1 — Tests ran but some failed (normal failing test signal).
  • 2 — Execution interrupted by user (Ctrl‑C).
  • 3Internal pytest error while executing tests.
  • 4Command‑line usage/config error (bad CLI/options).
  • 5No tests were collected. Intentionally non‑zero so CI fails when your test discovery/config is broken (e.g., wrong -k/-m filters, path patterns, or missing files).

Implication for Codex classification: these are typical app/tool outcomes, not sandbox denials. Treat 0/2/4/5 as LikelyNotSandbox; 1 and generic >0 without hints as Unknown (not sandbox). Only upgrade to LikelySandbox on clear evidence (e.g., SIGSYS, explicit sandbox denial text).

What

  • Introduced a conservative SandboxVerdict classifier with three outcomes:
    • LikelySandbox · LikelyNotSandbox · Unknown
  • Rewired process_exec_tool_call to:
    • Preserve timeouts as Timeout and non‑SIGSYS signals as Signal.
    • Defer SIGSYS classification until stderr is available.
    • Return Denied only on strong sandbox evidence.
  • Removed legacy is_likely_sandbox_denied (no remaining callers).
  • Replaced magic numbers with named constants for clarity and correctness.
  • Improved debug logging context.

Details

  • New classifier

    fn sandbox_verdict(
        sandbox_type: SandboxType,
        status: ExitStatus,
        stderr: &str,
    ) -> SandboxVerdict
  • Signals

    • SIGSYSLikelySandbox (via status.signal() or exit code 128 + SIGSYS).
    • Common app/user signals → LikelyNotSandbox:
      • SIGINT_CODE = 2, SIGABRT_CODE = 6, SIGBUS_CODE = 7, SIGFPE_CODE = 8, SIGKILL_CODE = 9, SIGSEGV_CODE = 11, SIGPIPE_CODE = 13, SIGTERM_CODE = 15.
  • Exit codes

    • LikelyNotSandbox: 0, 2, 124 (timeout wrapper), 127 (not found), 64..=78 (BSD sysexits), and 128 + {signals above}.
    • EXIT_NOT_EXECUTABLE_CODE = 126Unknown unless stderr has explicit sandbox hints.
    • Otherwise Unknown unless hints indicate sandbox.
    • Stderr hints (case‑insensitive; conservative):
      • "sandbox: deny", "not permitted by sandbox", "bad system call", "seccomp".
  • Platform constants

    • SIGSYS_CODE: macOS = 12, Linux = 31 (no libc deps).
  • Exec flow changes

    • Defer SIGSYS to inspect stderr; otherwise return Signal immediately.
    • After building ExecToolCallOutput, compute verdict:
      • On LikelySandboxDenied { output }.
      • Else if deferred signal → Signal.
      • Else return Ok(exec_output).
  • Logging

    tracing::debug!(
        target: "codex_core::exec",
        exit_code, ?pending_signal, verdict=?verdict,
        "exec SandboxClassification",
    );

Removed

  • is_likely_sandbox_denied shim removed; note left explaining how to synthesize an ExitStatus if ever needed again.

Refactors / Clippy

  • sandbox_verdict now takes ExitStatus by value (small Copy type) to satisfy clippy::trivially-copy-pass-by-ref.
  • Replaced raw lists of integers with named constants and a computed set for 128 + signal codes.

Tests

  • Unit tests in core/src/exec.rs:
    • not_sandbox_obvious_codes: 127 / 124 / 2LikelyNotSandbox.
    • sandbox_sigsys_codepath: 128 + SIGSYSLikelySandbox.
    • sandbox_stderr_hint: 126 + "Sandbox: deny …"LikelySandbox.
    • unknown_generic_failure: 1Unknown.
  • Existing integration/suite tests already cover timeout/signal/Denied and OK paths.

Manual Test Plan

  • macOS seatbelt

    • Run a command that exits 126 and prints "Sandbox: deny …" to stderrDenied.
    • Run a command that exits 127 or 2not Denied.
  • Linux seccomp

    • Simulate bad syscall exit 128 + 31Denied.
    • 130 / 141 / 143 (SIGINT / SIGPIPE / SIGTERM) → not Denied; surfaced as normal failure or Signal.
  • Confirm debug line appears with target codex_core::exec and message exec SandboxClassification.

Behavioral Changes

  • Normal non‑zero exits (e.g., exit 1) are no longer misclassified as sandbox errors.
  • Timeouts and non‑SIGSYS signals retain previous semantics.
  • Denials are now attributed only on strong evidence.

Risk / Rollout

  • Scope limited to core/src/exec.rs; no new dependencies.
  • macOS and Linux only; no *BSD branches.
  • Low risk: adds conservative classification and clearer logging; removes unused shim.

Follow‑ups (optional)

  • Add a macOS‑specific stderr hint "sandbox-exec:" if we want to treat seatbelt launcher failures as LikelySandbox.
  • Expand tests to include a couple more 128 + signal codes to lock in "not sandbox" classification.
  • Run just fix -p codex-core and optionally a full workspace test pass once reviewers approve.

External (non‑OpenAI) Pull Request Requirements

Before opening this Pull Request, please read the dedicated "Contributing" markdown file or your PR may be closed: https://github.com/openai/codex/blob/main/docs/contributing.md

Note: ✅ I didn't create the issue first as I was solving the bug I was experiencing in real time. Apologies for that.

Copy link
Contributor

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

@imajes
Copy link
Author

imajes commented Oct 6, 2025

@pakrym-oai, @bolinfest -- this fixes issues introduced by #3576. The approach afforded by is_likely_sandbox_denied isn't quite enough, as it is trapping pretty much any command run in the sandbox that doesn't return 0, essentially.

This is true for things like pytest, which makes use of varied exit codes, but i would imagine anything run like e.g. bash -lc ... as that'll have varying exit codes also.

It's not easy to know whether an exit code is because of the sandbox or the process it invoked, but i think there are some heuristics we can leverage. This PR provides a heuristic classifier, which essentially looks for the class of outcomes that most likely indicates a sandbox error, and then pretty much passes through the rest -- because they're not likely to be sandbox specific.

This came about because in the last few days, i've noticed a significant number of failed in sandbox type error states where it didn't have output. When i added some tracing for the exact output from the command run, i saw that it was valid, but the exit code was non zero.

Happy to talk more, edit, etc -- but i've been running this all afternoon with great success. Obviously it'll be good to get wider feedback, but it is working so far!

@imajes imajes force-pushed the fix/resolve-issues-with-sandbox-denials branch from 6e0c459 to 9dee00f Compare October 7, 2025 00:16
@imajes
Copy link
Author

imajes commented Oct 7, 2025

@codex review

Copy link
Contributor

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Comment on lines +101 to +113
fn stderr_hints_sandbox(stderr: &str) -> bool {
// Conservative: require explicit sandbox/seccomp phrasing likely to come from
// a kernel/seatbelt/seccomp denial; generic perms errors are too noisy.
let s = stderr.to_ascii_lowercase();
s.contains("sandbox: deny")
|| s.contains("not permitted by sandbox")
|| s.contains("bad system call")
|| s.contains("seccomp")
// Less explicit, but commonly emitted by shells and tools when a seatbelt
// or seccomp policy blocks a write/open. Tests rely on these.
|| s.contains("operation not permitted")
|| s.contains("permission denied")
|| s.contains("read-only file system")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid treating generic “permission denied” output as sandbox denial

The new stderr_hints_sandbox function classifies any stderr containing phrases like "permission denied" or "operation not permitted" as LikelySandbox, and process_exec_tool_call turns that verdict into a SandboxErr::Denied. Many common commands emit those exact strings for reasons unrelated to the sandbox (e.g., git clone failing with Permission denied (publickey) or an SSH server refusing credentials). When such a failure happens under seatbelt/seccomp, the code will now misreport it as a sandbox denial even though the sandbox played no role. This defeats the goal of distinguishing real sandbox denials from normal command failures and will send users down the wrong debugging path. The heuristic should be narrowed or combined with other evidence before upgrading to Denied.

Useful? React with 👍 / 👎.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it's a first step. I agree that there's no such thing as a perfect heuristic; i wonder if we should adopt a weighted approach which assigns each of these "signals" with a weight, and then calculate the probability that it's a sandbox error.

That sounds good, but it becomes super complex. Either way, a permission denied (publickey) is likely a sandbox fail as it'll require an escalation to the user to fix.

imajes added 3 commits October 8, 2025 10:03
…fier; correct SIGSYS; remove misattribution

  Why

  - The previous approach treated many non‑zero exits as sandbox denials, making normal command failures look like sandbox errors.
  - SIGSYS mapping was inconsistent (hardcoded to 12 while Linux uses 31), and magic integers made maintenance hard.

  What

  - Introduced a conservative SandboxVerdict classifier with three outcomes:
      - LikelySandbox | LikelyNotSandbox | Unknown
  - Rewired process_exec_tool_call to:
      - Preserve timeouts as Timeout and non‑SIGSYS signals as Signal
      - Defer SIGSYS classification until stderr is available
      - Only return Denied on strong sandbox evidence
  - Removed legacy is_likely_sandbox_denied (no remaining callers).
  - Replaced magic numbers with named constants for clarity and correctness.
  - Improved debug logging context.

  Details

  - New classifier: sandbox_verdict(sandbox_type: SandboxType, status: ExitStatus, stderr: &str) -> SandboxVerdict
      - Signals:
          - SIGSYS → LikelySandbox (via status.signal() or exit code 128+SIGSYS)
          - Common app/user signals → LikelyNotSandbox
              - SIGINT_CODE=2, SIGABRT_CODE=6, SIGBUS_CODE=7, SIGFPE_CODE=8, SIGKILL_CODE=9, SIGSEGV_CODE=11, SIGPIPE_CODE=13, SIGTERM_CODE=15
      - Exit codes:
          - LikelyNotSandbox: 0, 2, 124 (timeout wrapper), 127 (not found), 64..=78 (BSD sysexits), and 128+{the signals above}
          - EXIT_NOT_EXECUTABLE_CODE=126 → Unknown unless stderr has explicit sandbox hints
          - Otherwise Unknown unless hints indicate sandbox
      - Stderr hints (case-insensitive; conservative):
          - “sandbox: deny”, “not permitted by sandbox”, “bad system call”, “seccomp”
  - Platform constants:
      - SIGSYS_CODE: macOS=12, Linux=31 (no libc deps).
  - Exec flow changes:
      - Defer SIGSYS to inspect stderr; otherwise return Signal immediately.
      - After building ExecToolCallOutput, compute verdict; on LikelySandbox → Denied { output }; else if deferred signal → Signal; else return Ok(exec_output).
  - Logging:
      - tracing::debug!(target: "codex_core::exec", exit_code, ?pending_signal, verdict=?verdict, "exec SandboxClassification")

  Removed

  - is_likely_sandbox_denied shim removed; note left explaining how to synthesize an ExitStatus if ever needed again.

  Refactors / Clippy

  - sandbox_verdict now takes ExitStatus by value (small Copy type) to satisfy clippy::trivially-copy-pass-by-ref.
  - Replaced raw lists of integers with named constants and a computed array for 128+signal codes.

  Tests

  - Unit tests in core/src/exec.rs:
      - not_sandbox_obvious_codes: 127/124/2 → LikelyNotSandbox
      - sandbox_sigsys_codepath: 128+SIGSYS → LikelySandbox
      - sandbox_stderr_hint: 126 + “Sandbox: deny …” → LikelySandbox
      - unknown_generic_failure: 1 → Unknown
  - Existing integration/suite tests already cover timeout/signal/Denied and OK paths.

  Manual Test Plan

  - macOS seatbelt:
      - Run a command that exits 126 and prints “Sandbox: deny …” to stderr → Denied.
      - Run a command that exits 127 or 2 → not Denied.
  - Linux seccomp:
      - Simulate bad syscall exit 128+31 → Denied.
      - 130/141/143 (SIGINT/SIGPIPE/SIGTERM) → not Denied; surfaced as normal failure or Signal.
  - Confirm debug line appears with target codex_core::exec and message “exec SandboxClassification”.

  Behavioral Changes

  - Normal non‑zero exits (e.g., exit 1) are no longer misclassified as sandbox errors.
  - Timeouts and non‑SIGSYS signals retain previous semantics.
  - Denials are now attributed only on strong evidence.

  Risk / Rollout

  - Scope limited to core/src/exec.rs; no new dependencies.
  - macOS and Linux only; no *BSD branches.
  - Low risk: adds conservative classification and clearer logging; removes unused shim.

  Follow‑ups (optional)

  - Add a macOS‑specific stderr hint “sandbox-exec:” if we want to treat seatbelt launcher failures as LikelySandbox.
  - Expand tests to include a couple more 128+signal codes to lock in “not sandbox” classification.
  - Run just fix -p codex-core and optionally a full workspace test pass once reviewers approve.
…S gating for Windows/*BSD

  - Replace hardcoded signal numbers with libc::* on Unix (SIGINT/SIGABRT/SIGBUS/SIGFPE/SIGKILL/SIGSEGV/SIGPIPE/SIGTERM/SIGSYS).
  - Group Unix signal constants in a unix_sig module and re-export; removes repeated per-line cfgs.
  - Define SIGSYS_CODE via libc on all Unix and gate all SIGSYS uses with cfg(target_family="unix").
  - Keep non-Unix fallback SIGKILL_CODE=9 to preserve synthesized 128+9 behavior.
  - sandbox_verdict/process_exec_tool_call:
      - Check SIGSYS only on Unix and classify as LikelySandbox.
      - Defer SIGSYS only when a sandbox was requested; otherwise, non-timeout signals error immediately.
      - Build “128 + signal” exit-code checks only on Unix.
  - Tests: gate sandbox_sigsys_codepath behind cfg(target_family="unix").

  This restores cross-platform builds (Windows and non-Linux/macOS Unix) while simplifying the platform gating.
…YS gating; fix sandbox detection; correct macOS test

  - Replace hardcoded signal numbers with libc on Unix:
      - SIGINT/SIGABRT/SIGBUS/SIGFPE/SIGKILL/SIGSEGV/SIGPIPE/SIGTERM/SIGSYS now come from libc.
      - Grouped under a unix_sig module and re-exported; eliminates repeated per-line cfgs.
      - Keep non‑Unix fallback SIGKILL_CODE=9 for synthesized 128+9 behavior.
  - Gate platform logic cleanly:
      - Use cfg(target_family = "unix") around Unix-only paths.
      - All SIGSYS uses now compile on every Unix, avoiding previous linux/macos-only symbol gaps.
      - Tests updated to gate the SIGSYS codepath on target_family = "unix".
  - Improve sandbox classification:
      - Prefer stderr hints first: treat “sandbox: deny”, “seccomp”, and common OS errors (“operation not permitted”, “permission denied”, “read-only file system”) as LikelySandbox.
      - Defer SIGSYS handling only when a sandbox was requested; otherwise treat non-timeout signals as immediate errors.
      - Shell-style SIGSYS exit (128 + SIGSYS) recognized on all Unix targets.
  - Formatting/truncation remains centralized:
      - No changes to output shaping other than classifying sandbox denials earlier so user-facing messages are consistent (“failed in sandbox: …”).
  - Tests:
      - Correct an especially egregious path typo in macOS seatbelt exec test:
          - core/tests/suite/exec.rs: "/user/bin/touch" → "/usr/bin/touch".
          - This was likely a typo; if it was intentional for a specific scenario, we can reintroduce it behind an explicit comment/cfg. As‑is it prevented the test from exercising the intended read‑only denial
            and instead looked like a missing-binary edge case.
  - Misc:
      - Ran just fmt and just fix -p codex-core; clippy autofixed exec.rs (9 fixes).
      - Verified codex-core builds; project tests run, with unrelated existing failures unchanged.

  Rationale

  - Avoids Windows/*BSD build breaks from SIGSYS references while keeping behavior correct on Linux/macOS.
  - libc keeps signal numbers portable; grouping reduces cfg noise and maintenance risk.
  - Earlier stderr-based sandbox detection makes the shell tool denial UX match expectations even when the sandbox blocks before the child produces long output.
@imajes imajes force-pushed the fix/resolve-issues-with-sandbox-denials branch from 3785974 to 5b3632e Compare October 8, 2025 15:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant