feat(core[exec]): Stronger SandboxVerdict Classifier & Exec Flow #4852

imajes · 2025-10-06T23:25:38Z

Why

The previous approach treated many non‑zero exits as sandbox denials, making normal command failures look like sandbox errors.
SIGSYS mapping was inconsistent (hardcoded to 12 while Linux uses 31), and magic integers made maintenance hard.

Pytest exit codes (context for "Why")

0 — All tests collected and passed.
1 — Tests ran but some failed (normal failing test signal).
2 — Execution interrupted by user (Ctrl‑C).
3 — Internal pytest error while executing tests.
4 — Command‑line usage/config error (bad CLI/options).
5 — No tests were collected. Intentionally non‑zero so CI fails when your test discovery/config is broken (e.g., wrong -k/-m filters, path patterns, or missing files).

Implication for Codex classification: these are typical app/tool outcomes, not sandbox denials. Treat 0/2/4/5 as LikelyNotSandbox; 1 and generic >0 without hints as Unknown (not sandbox). Only upgrade to LikelySandbox on clear evidence (e.g., SIGSYS, explicit sandbox denial text).

What

Introduced a conservative SandboxVerdict classifier with three outcomes:
- LikelySandbox · LikelyNotSandbox · Unknown
Rewired process_exec_tool_call to:
- Preserve timeouts as Timeout and non‑SIGSYS signals as Signal.
- Defer SIGSYS classification until stderr is available.
- Return Denied only on strong sandbox evidence.
Removed legacy is_likely_sandbox_denied (no remaining callers).
Replaced magic numbers with named constants for clarity and correctness.
Improved debug logging context.

Details

New classifier

fn sandbox_verdict(
    sandbox_type: SandboxType,
    status: ExitStatus,
    stderr: &str,
) -> SandboxVerdict

Signals
- SIGSYS → LikelySandbox (via status.signal() or exit code 128 + SIGSYS).
- Common app/user signals → LikelyNotSandbox:
  - SIGINT_CODE = 2, SIGABRT_CODE = 6, SIGBUS_CODE = 7, SIGFPE_CODE = 8, SIGKILL_CODE = 9, SIGSEGV_CODE = 11, SIGPIPE_CODE = 13, SIGTERM_CODE = 15.
Exit codes
- LikelyNotSandbox: 0, 2, 124 (timeout wrapper), 127 (not found), 64..=78 (BSD sysexits), and 128 + {signals above}.
- EXIT_NOT_EXECUTABLE_CODE = 126 → Unknown unless stderr has explicit sandbox hints.
- Otherwise Unknown unless hints indicate sandbox.
- Stderr hints (case‑insensitive; conservative):
  - "sandbox: deny", "not permitted by sandbox", "bad system call", "seccomp".
Platform constants
- SIGSYS_CODE: macOS = 12, Linux = 31 (no libc deps).
Exec flow changes
- Defer SIGSYS to inspect stderr; otherwise return Signal immediately.
- After building ExecToolCallOutput, compute verdict:
  - On LikelySandbox → Denied { output }.
  - Else if deferred signal → Signal.
  - Else return Ok(exec_output).

Logging

tracing::debug!(
    target: "codex_core::exec",
    exit_code, ?pending_signal, verdict=?verdict,
    "exec SandboxClassification",
);

Removed

is_likely_sandbox_denied shim removed; note left explaining how to synthesize an ExitStatus if ever needed again.

Refactors / Clippy

sandbox_verdict now takes ExitStatus by value (small Copy type) to satisfy clippy::trivially-copy-pass-by-ref.
Replaced raw lists of integers with named constants and a computed set for 128 + signal codes.

Tests

Unit tests in core/src/exec.rs:
- not_sandbox_obvious_codes: 127 / 124 / 2 → LikelyNotSandbox.
- sandbox_sigsys_codepath: 128 + SIGSYS → LikelySandbox.
- sandbox_stderr_hint: 126 + "Sandbox: deny …" → LikelySandbox.
- unknown_generic_failure: 1 → Unknown.
Existing integration/suite tests already cover timeout/signal/Denied and OK paths.

Manual Test Plan

macOS seatbelt
- Run a command that exits 126 and prints "Sandbox: deny …" to stderr → Denied.
- Run a command that exits 127 or 2 → not Denied.
Linux seccomp
- Simulate bad syscall exit 128 + 31 → Denied.
- 130 / 141 / 143 (SIGINT / SIGPIPE / SIGTERM) → not Denied; surfaced as normal failure or Signal.
Confirm debug line appears with target codex_core::exec and message exec SandboxClassification.

Behavioral Changes

Normal non‑zero exits (e.g., exit 1) are no longer misclassified as sandbox errors.
Timeouts and non‑SIGSYS signals retain previous semantics.
Denials are now attributed only on strong evidence.

Risk / Rollout

Scope limited to core/src/exec.rs; no new dependencies.
macOS and Linux only; no *BSD branches.
Low risk: adds conservative classification and clearer logging; removes unused shim.

Follow‑ups (optional)

Add a macOS‑specific stderr hint "sandbox-exec:" if we want to treat seatbelt launcher failures as LikelySandbox.
Expand tests to include a couple more 128 + signal codes to lock in "not sandbox" classification.
Run just fix -p codex-core and optionally a full workspace test pass once reviewers approve.

External (non‑OpenAI) Pull Request Requirements

Before opening this Pull Request, please read the dedicated "Contributing" markdown file or your PR may be closed: https://github.com/openai/codex/blob/main/docs/contributing.md

Note: ✅ I didn't create the issue first as I was solving the bug I was experiencing in real time. Apologies for that.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

codex-rs/core/src/exec.rs

imajes · 2025-10-06T23:33:39Z

@pakrym-oai, @bolinfest -- this fixes issues introduced by #3576. The approach afforded by is_likely_sandbox_denied isn't quite enough, as it is trapping pretty much any command run in the sandbox that doesn't return 0, essentially.

This is true for things like pytest, which makes use of varied exit codes, but i would imagine anything run like e.g. bash -lc ... as that'll have varying exit codes also.

It's not easy to know whether an exit code is because of the sandbox or the process it invoked, but i think there are some heuristics we can leverage. This PR provides a heuristic classifier, which essentially looks for the class of outcomes that most likely indicates a sandbox error, and then pretty much passes through the rest -- because they're not likely to be sandbox specific.

This came about because in the last few days, i've noticed a significant number of failed in sandbox type error states where it didn't have output. When i added some tracing for the exact output from the command run, i saw that it was valid, but the exit code was non zero.

Happy to talk more, edit, etc -- but i've been running this all afternoon with great success. Obviously it'll be good to get wider feedback, but it is working so far!

imajes · 2025-10-07T00:46:16Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

chatgpt-codex-connector · 2025-10-07T00:50:30Z

codex-rs/core/src/exec.rs

+fn stderr_hints_sandbox(stderr: &str) -> bool {
+    // Conservative: require explicit sandbox/seccomp phrasing likely to come from
+    // a kernel/seatbelt/seccomp denial; generic perms errors are too noisy.
+    let s = stderr.to_ascii_lowercase();
+    s.contains("sandbox: deny")
+        || s.contains("not permitted by sandbox")
+        || s.contains("bad system call")
+        || s.contains("seccomp")
+        // Less explicit, but commonly emitted by shells and tools when a seatbelt
+        // or seccomp policy blocks a write/open. Tests rely on these.
+        || s.contains("operation not permitted")
+        || s.contains("permission denied")
+        || s.contains("read-only file system")


Avoid treating generic “permission denied” output as sandbox denial

The new stderr_hints_sandbox function classifies any stderr containing phrases like "permission denied" or "operation not permitted" as LikelySandbox, and process_exec_tool_call turns that verdict into a SandboxErr::Denied. Many common commands emit those exact strings for reasons unrelated to the sandbox (e.g., git clone failing with Permission denied (publickey) or an SSH server refusing credentials). When such a failure happens under seatbelt/seccomp, the code will now misreport it as a sandbox denial even though the sandbox played no role. This defeats the goal of distinguishing real sandbox denials from normal command failures and will send users down the wrong debugging path. The heuristic should be narrowed or combined with other evidence before upgrading to Denied.

Useful? React with 👍 / 👎.

Well, it's a first step. I agree that there's no such thing as a perfect heuristic; i wonder if we should adopt a weighted approach which assigns each of these "signals" with a weight, and then calculate the probability that it's a sandbox error.

That sounds good, but it becomes super complex. Either way, a permission denied (publickey) is likely a sandbox fail as it'll require an escalation to the user to fix.

…fier; correct SIGSYS; remove misattribution Why - The previous approach treated many non‑zero exits as sandbox denials, making normal command failures look like sandbox errors. - SIGSYS mapping was inconsistent (hardcoded to 12 while Linux uses 31), and magic integers made maintenance hard. What - Introduced a conservative SandboxVerdict classifier with three outcomes: - LikelySandbox | LikelyNotSandbox | Unknown - Rewired process_exec_tool_call to: - Preserve timeouts as Timeout and non‑SIGSYS signals as Signal - Defer SIGSYS classification until stderr is available - Only return Denied on strong sandbox evidence - Removed legacy is_likely_sandbox_denied (no remaining callers). - Replaced magic numbers with named constants for clarity and correctness. - Improved debug logging context. Details - New classifier: sandbox_verdict(sandbox_type: SandboxType, status: ExitStatus, stderr: &str) -> SandboxVerdict - Signals: - SIGSYS → LikelySandbox (via status.signal() or exit code 128+SIGSYS) - Common app/user signals → LikelyNotSandbox - SIGINT_CODE=2, SIGABRT_CODE=6, SIGBUS_CODE=7, SIGFPE_CODE=8, SIGKILL_CODE=9, SIGSEGV_CODE=11, SIGPIPE_CODE=13, SIGTERM_CODE=15 - Exit codes: - LikelyNotSandbox: 0, 2, 124 (timeout wrapper), 127 (not found), 64..=78 (BSD sysexits), and 128+{the signals above} - EXIT_NOT_EXECUTABLE_CODE=126 → Unknown unless stderr has explicit sandbox hints - Otherwise Unknown unless hints indicate sandbox - Stderr hints (case-insensitive; conservative): - “sandbox: deny”, “not permitted by sandbox”, “bad system call”, “seccomp” - Platform constants: - SIGSYS_CODE: macOS=12, Linux=31 (no libc deps). - Exec flow changes: - Defer SIGSYS to inspect stderr; otherwise return Signal immediately. - After building ExecToolCallOutput, compute verdict; on LikelySandbox → Denied { output }; else if deferred signal → Signal; else return Ok(exec_output). - Logging: - tracing::debug!(target: "codex_core::exec", exit_code, ?pending_signal, verdict=?verdict, "exec SandboxClassification") Removed - is_likely_sandbox_denied shim removed; note left explaining how to synthesize an ExitStatus if ever needed again. Refactors / Clippy - sandbox_verdict now takes ExitStatus by value (small Copy type) to satisfy clippy::trivially-copy-pass-by-ref. - Replaced raw lists of integers with named constants and a computed array for 128+signal codes. Tests - Unit tests in core/src/exec.rs: - not_sandbox_obvious_codes: 127/124/2 → LikelyNotSandbox - sandbox_sigsys_codepath: 128+SIGSYS → LikelySandbox - sandbox_stderr_hint: 126 + “Sandbox: deny …” → LikelySandbox - unknown_generic_failure: 1 → Unknown - Existing integration/suite tests already cover timeout/signal/Denied and OK paths. Manual Test Plan - macOS seatbelt: - Run a command that exits 126 and prints “Sandbox: deny …” to stderr → Denied. - Run a command that exits 127 or 2 → not Denied. - Linux seccomp: - Simulate bad syscall exit 128+31 → Denied. - 130/141/143 (SIGINT/SIGPIPE/SIGTERM) → not Denied; surfaced as normal failure or Signal. - Confirm debug line appears with target codex_core::exec and message “exec SandboxClassification”. Behavioral Changes - Normal non‑zero exits (e.g., exit 1) are no longer misclassified as sandbox errors. - Timeouts and non‑SIGSYS signals retain previous semantics. - Denials are now attributed only on strong evidence. Risk / Rollout - Scope limited to core/src/exec.rs; no new dependencies. - macOS and Linux only; no *BSD branches. - Low risk: adds conservative classification and clearer logging; removes unused shim. Follow‑ups (optional) - Add a macOS‑specific stderr hint “sandbox-exec:” if we want to treat seatbelt launcher failures as LikelySandbox. - Expand tests to include a couple more 128+signal codes to lock in “not sandbox” classification. - Run just fix -p codex-core and optionally a full workspace test pass once reviewers approve.

…S gating for Windows/*BSD - Replace hardcoded signal numbers with libc::* on Unix (SIGINT/SIGABRT/SIGBUS/SIGFPE/SIGKILL/SIGSEGV/SIGPIPE/SIGTERM/SIGSYS). - Group Unix signal constants in a unix_sig module and re-export; removes repeated per-line cfgs. - Define SIGSYS_CODE via libc on all Unix and gate all SIGSYS uses with cfg(target_family="unix"). - Keep non-Unix fallback SIGKILL_CODE=9 to preserve synthesized 128+9 behavior. - sandbox_verdict/process_exec_tool_call: - Check SIGSYS only on Unix and classify as LikelySandbox. - Defer SIGSYS only when a sandbox was requested; otherwise, non-timeout signals error immediately. - Build “128 + signal” exit-code checks only on Unix. - Tests: gate sandbox_sigsys_codepath behind cfg(target_family="unix"). This restores cross-platform builds (Windows and non-Linux/macOS Unix) while simplifying the platform gating.

…YS gating; fix sandbox detection; correct macOS test - Replace hardcoded signal numbers with libc on Unix: - SIGINT/SIGABRT/SIGBUS/SIGFPE/SIGKILL/SIGSEGV/SIGPIPE/SIGTERM/SIGSYS now come from libc. - Grouped under a unix_sig module and re-exported; eliminates repeated per-line cfgs. - Keep non‑Unix fallback SIGKILL_CODE=9 for synthesized 128+9 behavior. - Gate platform logic cleanly: - Use cfg(target_family = "unix") around Unix-only paths. - All SIGSYS uses now compile on every Unix, avoiding previous linux/macos-only symbol gaps. - Tests updated to gate the SIGSYS codepath on target_family = "unix". - Improve sandbox classification: - Prefer stderr hints first: treat “sandbox: deny”, “seccomp”, and common OS errors (“operation not permitted”, “permission denied”, “read-only file system”) as LikelySandbox. - Defer SIGSYS handling only when a sandbox was requested; otherwise treat non-timeout signals as immediate errors. - Shell-style SIGSYS exit (128 + SIGSYS) recognized on all Unix targets. - Formatting/truncation remains centralized: - No changes to output shaping other than classifying sandbox denials earlier so user-facing messages are consistent (“failed in sandbox: …”). - Tests: - Correct an especially egregious path typo in macOS seatbelt exec test: - core/tests/suite/exec.rs: "/user/bin/touch" → "/usr/bin/touch". - This was likely a typo; if it was intentional for a specific scenario, we can reintroduce it behind an explicit comment/cfg. As‑is it prevented the test from exercising the intended read‑only denial and instead looked like a missing-binary edge case. - Misc: - Ran just fmt and just fix -p codex-core; clippy autofixed exec.rs (9 fixes). - Verified codex-core builds; project tests run, with unrelated existing failures unchanged. Rationale - Avoids Windows/*BSD build breaks from SIGSYS references while keeping behavior correct on Linux/macOS. - libc keeps signal numbers portable; grouping reduces cfg noise and maintenance risk. - Earlier stderr-based sandbox detection makes the shell tool denial UX match expectations even when the sandbox blocks before the child produces long output.

chatgpt-codex-connector bot reviewed Oct 6, 2025

View reviewed changes

codex-rs/core/src/exec.rs Outdated Show resolved Hide resolved

imajes force-pushed the fix/resolve-issues-with-sandbox-denials branch from 6e0c459 to 9dee00f Compare October 7, 2025 00:16

chatgpt-codex-connector bot reviewed Oct 7, 2025

View reviewed changes

imajes added 3 commits October 8, 2025 10:03

imajes force-pushed the fix/resolve-issues-with-sandbox-denials branch from 3785974 to 5b3632e Compare October 8, 2025 15:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(core[exec]): Stronger SandboxVerdict Classifier & Exec Flow #4852

feat(core[exec]): Stronger SandboxVerdict Classifier & Exec Flow #4852

Uh oh!

imajes commented Oct 6, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

imajes commented Oct 6, 2025

Uh oh!

imajes commented Oct 7, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Oct 7, 2025

Uh oh!

imajes Oct 7, 2025

Uh oh!

Uh oh!

feat(core[exec]): Stronger SandboxVerdict Classifier & Exec Flow #4852

Are you sure you want to change the base?

feat(core[exec]): Stronger SandboxVerdict Classifier & Exec Flow #4852

Uh oh!

Conversation

imajes commented Oct 6, 2025

Why

Pytest exit codes (context for "Why")

What

Details

Removed

Refactors / Clippy

Tests

Manual Test Plan

Behavioral Changes

Risk / Rollout

Follow‑ups (optional)

External (non‑OpenAI) Pull Request Requirements

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

imajes commented Oct 6, 2025

Uh oh!

imajes commented Oct 7, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

imajes Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!