-
Notifications
You must be signed in to change notification settings - Fork 5.6k
feat(core[exec]): Stronger SandboxVerdict Classifier & Exec Flow #4852
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat(core[exec]): Stronger SandboxVerdict Classifier & Exec Flow #4852
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
@pakrym-oai, @bolinfest -- this fixes issues introduced by #3576. The approach afforded by This is true for things like pytest, which makes use of varied exit codes, but i would imagine anything run like e.g. It's not easy to know whether an exit code is because of the sandbox or the process it invoked, but i think there are some heuristics we can leverage. This PR provides a heuristic classifier, which essentially looks for the class of outcomes that most likely indicates a sandbox error, and then pretty much passes through the rest -- because they're not likely to be sandbox specific. This came about because in the last few days, i've noticed a significant number of Happy to talk more, edit, etc -- but i've been running this all afternoon with great success. Obviously it'll be good to get wider feedback, but it is working so far! |
6e0c459
to
9dee00f
Compare
@codex review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
fn stderr_hints_sandbox(stderr: &str) -> bool { | ||
// Conservative: require explicit sandbox/seccomp phrasing likely to come from | ||
// a kernel/seatbelt/seccomp denial; generic perms errors are too noisy. | ||
let s = stderr.to_ascii_lowercase(); | ||
s.contains("sandbox: deny") | ||
|| s.contains("not permitted by sandbox") | ||
|| s.contains("bad system call") | ||
|| s.contains("seccomp") | ||
// Less explicit, but commonly emitted by shells and tools when a seatbelt | ||
// or seccomp policy blocks a write/open. Tests rely on these. | ||
|| s.contains("operation not permitted") | ||
|| s.contains("permission denied") | ||
|| s.contains("read-only file system") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid treating generic “permission denied” output as sandbox denial
The new stderr_hints_sandbox
function classifies any stderr containing phrases like "permission denied"
or "operation not permitted"
as LikelySandbox
, and process_exec_tool_call
turns that verdict into a SandboxErr::Denied
. Many common commands emit those exact strings for reasons unrelated to the sandbox (e.g., git clone
failing with Permission denied (publickey)
or an SSH server refusing credentials). When such a failure happens under seatbelt/seccomp, the code will now misreport it as a sandbox denial even though the sandbox played no role. This defeats the goal of distinguishing real sandbox denials from normal command failures and will send users down the wrong debugging path. The heuristic should be narrowed or combined with other evidence before upgrading to Denied
.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, it's a first step. I agree that there's no such thing as a perfect heuristic; i wonder if we should adopt a weighted approach which assigns each of these "signals" with a weight, and then calculate the probability that it's a sandbox error.
That sounds good, but it becomes super complex. Either way, a permission denied (publickey)
is likely a sandbox fail as it'll require an escalation to the user to fix.
…fier; correct SIGSYS; remove misattribution Why - The previous approach treated many non‑zero exits as sandbox denials, making normal command failures look like sandbox errors. - SIGSYS mapping was inconsistent (hardcoded to 12 while Linux uses 31), and magic integers made maintenance hard. What - Introduced a conservative SandboxVerdict classifier with three outcomes: - LikelySandbox | LikelyNotSandbox | Unknown - Rewired process_exec_tool_call to: - Preserve timeouts as Timeout and non‑SIGSYS signals as Signal - Defer SIGSYS classification until stderr is available - Only return Denied on strong sandbox evidence - Removed legacy is_likely_sandbox_denied (no remaining callers). - Replaced magic numbers with named constants for clarity and correctness. - Improved debug logging context. Details - New classifier: sandbox_verdict(sandbox_type: SandboxType, status: ExitStatus, stderr: &str) -> SandboxVerdict - Signals: - SIGSYS → LikelySandbox (via status.signal() or exit code 128+SIGSYS) - Common app/user signals → LikelyNotSandbox - SIGINT_CODE=2, SIGABRT_CODE=6, SIGBUS_CODE=7, SIGFPE_CODE=8, SIGKILL_CODE=9, SIGSEGV_CODE=11, SIGPIPE_CODE=13, SIGTERM_CODE=15 - Exit codes: - LikelyNotSandbox: 0, 2, 124 (timeout wrapper), 127 (not found), 64..=78 (BSD sysexits), and 128+{the signals above} - EXIT_NOT_EXECUTABLE_CODE=126 → Unknown unless stderr has explicit sandbox hints - Otherwise Unknown unless hints indicate sandbox - Stderr hints (case-insensitive; conservative): - “sandbox: deny”, “not permitted by sandbox”, “bad system call”, “seccomp” - Platform constants: - SIGSYS_CODE: macOS=12, Linux=31 (no libc deps). - Exec flow changes: - Defer SIGSYS to inspect stderr; otherwise return Signal immediately. - After building ExecToolCallOutput, compute verdict; on LikelySandbox → Denied { output }; else if deferred signal → Signal; else return Ok(exec_output). - Logging: - tracing::debug!(target: "codex_core::exec", exit_code, ?pending_signal, verdict=?verdict, "exec SandboxClassification") Removed - is_likely_sandbox_denied shim removed; note left explaining how to synthesize an ExitStatus if ever needed again. Refactors / Clippy - sandbox_verdict now takes ExitStatus by value (small Copy type) to satisfy clippy::trivially-copy-pass-by-ref. - Replaced raw lists of integers with named constants and a computed array for 128+signal codes. Tests - Unit tests in core/src/exec.rs: - not_sandbox_obvious_codes: 127/124/2 → LikelyNotSandbox - sandbox_sigsys_codepath: 128+SIGSYS → LikelySandbox - sandbox_stderr_hint: 126 + “Sandbox: deny …” → LikelySandbox - unknown_generic_failure: 1 → Unknown - Existing integration/suite tests already cover timeout/signal/Denied and OK paths. Manual Test Plan - macOS seatbelt: - Run a command that exits 126 and prints “Sandbox: deny …” to stderr → Denied. - Run a command that exits 127 or 2 → not Denied. - Linux seccomp: - Simulate bad syscall exit 128+31 → Denied. - 130/141/143 (SIGINT/SIGPIPE/SIGTERM) → not Denied; surfaced as normal failure or Signal. - Confirm debug line appears with target codex_core::exec and message “exec SandboxClassification”. Behavioral Changes - Normal non‑zero exits (e.g., exit 1) are no longer misclassified as sandbox errors. - Timeouts and non‑SIGSYS signals retain previous semantics. - Denials are now attributed only on strong evidence. Risk / Rollout - Scope limited to core/src/exec.rs; no new dependencies. - macOS and Linux only; no *BSD branches. - Low risk: adds conservative classification and clearer logging; removes unused shim. Follow‑ups (optional) - Add a macOS‑specific stderr hint “sandbox-exec:” if we want to treat seatbelt launcher failures as LikelySandbox. - Expand tests to include a couple more 128+signal codes to lock in “not sandbox” classification. - Run just fix -p codex-core and optionally a full workspace test pass once reviewers approve.
…S gating for Windows/*BSD - Replace hardcoded signal numbers with libc::* on Unix (SIGINT/SIGABRT/SIGBUS/SIGFPE/SIGKILL/SIGSEGV/SIGPIPE/SIGTERM/SIGSYS). - Group Unix signal constants in a unix_sig module and re-export; removes repeated per-line cfgs. - Define SIGSYS_CODE via libc on all Unix and gate all SIGSYS uses with cfg(target_family="unix"). - Keep non-Unix fallback SIGKILL_CODE=9 to preserve synthesized 128+9 behavior. - sandbox_verdict/process_exec_tool_call: - Check SIGSYS only on Unix and classify as LikelySandbox. - Defer SIGSYS only when a sandbox was requested; otherwise, non-timeout signals error immediately. - Build “128 + signal” exit-code checks only on Unix. - Tests: gate sandbox_sigsys_codepath behind cfg(target_family="unix"). This restores cross-platform builds (Windows and non-Linux/macOS Unix) while simplifying the platform gating.
…YS gating; fix sandbox detection; correct macOS test - Replace hardcoded signal numbers with libc on Unix: - SIGINT/SIGABRT/SIGBUS/SIGFPE/SIGKILL/SIGSEGV/SIGPIPE/SIGTERM/SIGSYS now come from libc. - Grouped under a unix_sig module and re-exported; eliminates repeated per-line cfgs. - Keep non‑Unix fallback SIGKILL_CODE=9 for synthesized 128+9 behavior. - Gate platform logic cleanly: - Use cfg(target_family = "unix") around Unix-only paths. - All SIGSYS uses now compile on every Unix, avoiding previous linux/macos-only symbol gaps. - Tests updated to gate the SIGSYS codepath on target_family = "unix". - Improve sandbox classification: - Prefer stderr hints first: treat “sandbox: deny”, “seccomp”, and common OS errors (“operation not permitted”, “permission denied”, “read-only file system”) as LikelySandbox. - Defer SIGSYS handling only when a sandbox was requested; otherwise treat non-timeout signals as immediate errors. - Shell-style SIGSYS exit (128 + SIGSYS) recognized on all Unix targets. - Formatting/truncation remains centralized: - No changes to output shaping other than classifying sandbox denials earlier so user-facing messages are consistent (“failed in sandbox: …”). - Tests: - Correct an especially egregious path typo in macOS seatbelt exec test: - core/tests/suite/exec.rs: "/user/bin/touch" → "/usr/bin/touch". - This was likely a typo; if it was intentional for a specific scenario, we can reintroduce it behind an explicit comment/cfg. As‑is it prevented the test from exercising the intended read‑only denial and instead looked like a missing-binary edge case. - Misc: - Ran just fmt and just fix -p codex-core; clippy autofixed exec.rs (9 fixes). - Verified codex-core builds; project tests run, with unrelated existing failures unchanged. Rationale - Avoids Windows/*BSD build breaks from SIGSYS references while keeping behavior correct on Linux/macOS. - libc keeps signal numbers portable; grouping reduces cfg noise and maintenance risk. - Earlier stderr-based sandbox detection makes the shell tool denial UX match expectations even when the sandbox blocks before the child produces long output.
3785974
to
5b3632e
Compare
Why
12
while Linux uses31
), and magic integers made maintenance hard.Pytest exit codes (context for "Why")
0
— All tests collected and passed.1
— Tests ran but some failed (normal failing test signal).2
— Execution interrupted by user (Ctrl‑C).3
— Internal pytest error while executing tests.4
— Command‑line usage/config error (bad CLI/options).5
— No tests were collected. Intentionally non‑zero so CI fails when your test discovery/config is broken (e.g., wrong-k
/-m
filters, path patterns, or missing files).Implication for Codex classification: these are typical app/tool outcomes, not sandbox denials. Treat
0/2/4/5
asLikelyNotSandbox
;1
and generic>0
without hints asUnknown
(not sandbox). Only upgrade toLikelySandbox
on clear evidence (e.g.,SIGSYS
, explicit sandbox denial text).What
SandboxVerdict
classifier with three outcomes:LikelySandbox
·LikelyNotSandbox
·Unknown
process_exec_tool_call
to:Timeout
and non‑SIGSYS signals asSignal
.stderr
is available.Denied
only on strong sandbox evidence.is_likely_sandbox_denied
(no remaining callers).Details
New classifier
Signals
SIGSYS
→LikelySandbox
(viastatus.signal()
or exit code128 + SIGSYS
).LikelyNotSandbox
:SIGINT_CODE = 2
,SIGABRT_CODE = 6
,SIGBUS_CODE = 7
,SIGFPE_CODE = 8
,SIGKILL_CODE = 9
,SIGSEGV_CODE = 11
,SIGPIPE_CODE = 13
,SIGTERM_CODE = 15
.Exit codes
LikelyNotSandbox
:0
,2
,124
(timeout wrapper),127
(not found),64..=78
(BSDsysexits
), and128 + {signals above}
.EXIT_NOT_EXECUTABLE_CODE = 126
→Unknown
unlessstderr
has explicit sandbox hints.Unknown
unless hints indicate sandbox."sandbox: deny"
,"not permitted by sandbox"
,"bad system call"
,"seccomp"
.Platform constants
SIGSYS_CODE
: macOS =12
, Linux =31
(nolibc
deps).Exec flow changes
SIGSYS
to inspectstderr
; otherwise returnSignal
immediately.ExecToolCallOutput
, computeverdict
:LikelySandbox
→Denied { output }
.Signal
.Ok(exec_output)
.Logging
Removed
is_likely_sandbox_denied
shim removed; note left explaining how to synthesize anExitStatus
if ever needed again.Refactors / Clippy
sandbox_verdict
now takesExitStatus
by value (smallCopy
type) to satisfyclippy::trivially-copy-pass-by-ref
.128 + signal
codes.Tests
core/src/exec.rs
:127
/124
/2
→LikelyNotSandbox
.128 + SIGSYS
→LikelySandbox
.126
+"Sandbox: deny …"
→LikelySandbox
.1
→Unknown
.Manual Test Plan
macOS seatbelt
126
and prints"Sandbox: deny …"
tostderr
→Denied
.127
or2
→ notDenied
.Linux seccomp
128 + 31
→Denied
.130
/141
/143
(SIGINT
/SIGPIPE
/SIGTERM
) → notDenied
; surfaced as normal failure orSignal
.Confirm debug line appears with target
codex_core::exec
and messageexec SandboxClassification
.Behavioral Changes
exit 1
) are no longer misclassified as sandbox errors.Risk / Rollout
core/src/exec.rs
; no new dependencies.Follow‑ups (optional)
"sandbox-exec:"
if we want to treat seatbelt launcher failures asLikelySandbox
.128 + signal
codes to lock in "not sandbox" classification.just fix -p codex-core
and optionally a full workspace test pass once reviewers approve.External (non‑OpenAI) Pull Request Requirements
Note: ✅ I didn't create the issue first as I was solving the bug I was experiencing in real time. Apologies for that.