design: Recorder Exit Code Policy

tzanko-matev · tzanko-matev · commit 6e688478f009 · 2025-10-28T13:18:19.000+02:00
design-docs/adr/0017-recorder-exit-code-policy.md: 
design-docs/recorder-exit-code-policy-implementation-plan.md: 

Signed-off-by: Tzanko Matev &lt;tsanko@metacraft-labs.com&gt;
diff --git a/design-docs/adr/0017-recorder-exit-code-policy.md b/design-docs/adr/0017-recorder-exit-code-policy.md
@@ -0,0 +1,45 @@
+# ADR 0017 – Recorder Exit Code Policy
+
+- **Status:** Proposed
+- **Date:** 2025-10-28
+- **Stakeholders:** Desktop CLI team, Runtime tracer maintainers, Release engineering
+- **Related Decisions:** ADR 0005 (Python Recorder DB Backend Integration), ADR 0015 (Balanced Toplevel Lifecycle and Trace Gating)
+
+## Context
+
+`ct record` invokes the Rust-backed `codetracer_python_recorder` CLI when capturing Python traces. The CLI currently returns the traced script's process exit code (`codetracer_python_recorder/cli.py:165`). When the target program exits with a non-zero status—whether via `SystemExit`, a failed assertion, or an explicit `sys.exit()`—the recorder propagates that status. The desktop CLI treats any non-zero exit as a fatal recording failure, so trace uploads and follow-on automation abort even though the trace artefacts are valid and the recorder itself completed successfully.
+
+Our recorder already captures the script's exit status in session metadata (`runtime/tracer/lifecycle.rs:143`) and exposes it through trace viewers. Downstream consumers that need to assert on the original program outcome can read that field. However, other integrations (CI pipelines, `ct record` automations, scripted data collection) rely on the CLI process exit code to decide whether to continue, and they expect Codetracer to return `0` when recording succeeded.
+
+We must let callers control whether the recorder propagates the script's exit status or reports recorder success independently. The default should favour Codetracer success (exit `0`) to preserve `ct record` expectations, while still allowing advanced users and direct CLI invocations to opt back into passthrough semantics.
+
+## Decision
+
+Introduce a recorder exit-code policy with the following behaviour:
+
+1. **Default:** When tracing completes without recorder errors (start, flush, stop, and write phases succeed and `require_trace` did not trigger), the CLI exits with status `0` regardless of the traced script's exit code. The recorder still records the script's status in trace metadata.
+2. **Opt-in passthrough:** Expose a CLI flag `--propagate-script-exit` and environment override `CODETRACER_PROPAGATE_SCRIPT_EXIT`. When enabled, the CLI mirrors the traced script's exit code (the current behaviour). Both configuration surfaces resolve through the recorder policy layer so other entry points (e.g., embedded integrations) can opt in.
+3. **User feedback:** If passthrough is disabled and the script exits non-zero, emit a one-line warning on stderr indicating the script's exit status and how to re-enable propagation.
+4. **Recorder failure precedence:** Recorder failures (startup errors, policy violations such as `--require-trace`, flush/stop exceptions) continue to exit non-zero irrespective of the propagation setting to ensure automation can detect recorder malfunction.
+
+This policy applies uniformly to `python -m codetracer_python_recorder`, `ct record`, and any embedding that drives the same CLI module.
+
+## Consequences
+
+**Positive**
+
+- `ct record` can treat successful recordings as success even when the target script fails, unblocking chained workflows and uploads.
+- The script's exit status remains available in trace metadata, preserving observability without overloading process exit handling.
+- Configuration is explicit and discoverable via CLI help, environment variables, and policy APIs.
+
+**Negative / Risks**
+
+- Direct CLI users may miss that their script failed if they rely solely on the process exit code. The stderr warning mitigates this but adds additional output.
+- Changing the default may surprise users accustomed to passthrough semantics. Documentation and release notes must highlight the new default and the opt-in flag.
+- Additional configuration surface increases policy complexity; we must ensure conflicting overrides (CLI vs. env) resolve predictably.
+
+## Rollout Notes
+
+- Update CLI help text, README, and desktop `ct record` documentation to describe the new default and override flag.
+- Add regression tests covering both default and passthrough modes (CLI invocation, environment override, policy API).
+- Communicate the change in the recorder CHANGELOG and release notes so downstream automation owners can adjust expectations.
diff --git a/design-docs/recorder-exit-code-policy-implementation-plan.md b/design-docs/recorder-exit-code-policy-implementation-plan.md
@@ -0,0 +1,61 @@
+# Recorder Exit Code Policy – Implementation Plan
+
+Plan owners: codetracer Python recorder maintainers  
+Related ADR: 0017 – Recorder Exit Code Policy  
+Target release: codetracer-python-recorder 0.x (next minor)
+
+## Goals
+- Default the recorder CLI to exit with status `0` when tracing succeeds, even if the target script exits non-zero.
+- Preserve the script exit status in trace metadata and surface it through logs so users stay informed.
+- Provide consistent configuration knobs (CLI flag, environment variable, policy API) to re-enable exit-code passthrough when desired.
+- Ensure recorder failures (`start`, `flush`, `stop`, `require_trace`) still emit non-zero exit codes.
+
+## Non-Goals
+- Changing how `ct record` parses or surfaces recorder output beyond the new default.
+- Altering metadata schemas storing script exit information.
+- Introducing scripting hooks for arbitrary exit-code transforms outside passthrough vs. success modes.
+
+## Current Gaps
+- `codetracer_python_recorder.cli.main` (`codetracer_python_recorder/cli.py:165`) always returns the traced script's exit code; there is no concept of recorder success vs. script result.
+- `RecorderPolicy` (`src/policy/model.rs`) and associated FFI lack an exit-code behaviour flag, so env policy and embedding APIs cannot control the outcome.
+- No CLI argument or environment variable communicates the user's preference, and the help text/docs imply passthrough semantics.
+- There are no regression tests asserting CLI exit behaviour; `ct record` integration relies on current passthrough behaviour implicitly.
+
+## Workstreams
+
+### WS1 – Policy & Configuration Plumbing
+**Scope:** Extend recorder policy models and configuration surfaces with an exit-code behaviour flag.
+- Add `propagate_script_exit_code: bool` to `RecorderPolicy` (default `false`) plus matching field in `PolicyUpdate`; update `apply_update`/`Default` implementations.
+- Extend PyO3 bindings `configure_policy_py`/`policy_snapshot` to accept and expose `propagate_script_exit_code`.
+- Add environment variable `CODETRACER_PROPAGATE_SCRIPT_EXIT` in `policy/env.rs`, parsing booleans via existing helpers.
+- Update Python session helpers (`session.py`) to pass through a `propagate_script_exit` policy key, including `_coerce_policy_kwargs`.
+- Unit tests:
+  - Rust: policy default and update round-trips (`policy::model` & `policy::ffi` tests).
+  - Rust: env configuration toggles new flag and rejects invalid values.
+  - Python: `configure_policy` keyword argument path accepts the new key.
+
+### WS2 – CLI Behaviour & Warning Surface
+**Scope:** Teach the CLI to honour the new policy and compute the final exit status.
+- Introduce `--propagate-script-exit` boolean flag (default `False`) wired into CLI help; set `policy["propagate_script_exit"] = True` when provided.
+- After `start(...)`, cache the effective propagation flag by inspecting CLI config and, when unspecified, consulting `policy_snapshot()` to honour env defaults.
+- Rework `main`'s shutdown path:
+  - Track recorder success across `start`, script execution, `flush`, and `stop`.
+  - Decide final process exit: return `script_exit` if propagation enabled, otherwise `0` when recorder succeeded; use a distinct error code (existing) when recorder fails.
+  - On non-zero script exit with propagation disabled, emit a concise stderr warning mentioning the exit status and `--propagate-script-exit`.
+- Ensure the script exit status continues to flow into `stop(exit_code=...)` and metadata serialisation unchanged.
+- Add CLI unit/integration tests (pytest) covering combinations: default non-propagating success/failure, `--propagate-script-exit`, and recorder failure paths (e.g., missing script).
+
+### WS3 – Documentation, Tooling, and Release Notes
+**Scope:** Update user-facing materials and automation checks.
+- Refresh CLI `--help`, README, and docs (`docs/book/src/...` if applicable) to describe default exit behaviour and configuration options.
+- Document `CODETRACER_PROPAGATE_SCRIPT_EXIT` and Python policy key in API guides.
+- Add CHANGELOG entry summarising behaviour change and migration guidance for users relying on passthrough.
+- Extend CI/test harness:
+  - Add regression test via `just test` hitting CLI exit codes (likely in Python test suite under `tests/`).
+  - Update any existing `ct record` integration smoke tests to pin the expected default (0) where relevant.
+- Coordinate with desktop CLI maintainers to flip their expectations once the recorder release lands.
+
+## Timeline & Dependencies
+- WS1 should land first to provide configuration plumbing for CLI work.
+- WS2 depends on WS1's policy flag; both should merge within the same feature branch to avoid transient inconsistent behaviour.
+- WS3 can progress in parallel once WS2 stabilises, but final doc updates should wait for CLI flag names to settle.