|
| 1 | +# ADR 0004: Error Handling Policy for codetracer-python-recorder |
| 2 | + |
| 3 | +- **Status:** Proposed |
| 4 | +- **Date:** 2025-10-02 |
| 5 | +- **Deciders:** Runtime Tracing Maintainers |
| 6 | +- **Consulted:** Python Tooling WG, Observability WG |
| 7 | +- **Informed:** Developer Experience WG, Release Engineering |
| 8 | + |
| 9 | +## Context |
| 10 | + |
| 11 | +The Rust-backed recorder currently propagates errors piecemeal: |
| 12 | +- PyO3 entry points bubble up plain `PyRuntimeError` instances with free-form strings (e.g., `src/session.rs:21-52`, `src/runtime/mod.rs:77-126`). |
| 13 | +- Runtime helpers panic on invariant violations, which will abort the host interpreter because we do not fence panics at the FFI boundary (`src/runtime/mod.rs:107-120`, `src/runtime/activation.rs:24-33`, `src/runtime/value_encoder.rs:61-78`). |
| 14 | +- Monitoring callbacks rely on `GLOBAL.lock().unwrap()` so poisoned mutexes or lock errors terminate the process (`src/monitoring/tracer.rs:268` and subsequent callback shims). |
| 15 | +- Python helpers expose bare `RuntimeError`/`ValueError` without linking to a shared policy, and auto-start simply re-raises whatever the Rust layer emits (`codetracer_python_recorder/session.py:27-63`, `codetracer_python_recorder/auto_start.py:24-36`). |
| 16 | +- Exit codes, log destinations, and trace-writer fallback behaviour are implicit; a disk-full failure today yields a generic exception and can leave partially written outputs. |
| 17 | + |
| 18 | +The lack of a central error façade makes it hard to enforce user-facing guarantees, reason about detaching vs aborting behaviour, or meet the operational goals we have been given: stable error codes, structured logs, optional JSON diagnostics, policy switches, and atomic trace outputs. |
| 19 | + |
| 20 | +## Decision |
| 21 | + |
| 22 | +We will introduce a recorder-wide error handling policy centred on a dedicated `recorder-errors` crate and a Python exception hierarchy. The policy follows fifteen guiding principles supplied by operations and is designed so the “right way” is the only easy way for contributors. |
| 23 | + |
| 24 | +### 1. Single Error Façade |
| 25 | +- Create a new workspace crate `recorder-errors` exporting `RecorderError`, a structural error type with fields `{ kind: ErrorKind, code: ErrorCode, message: Cow<'static, str>, context: ContextMap, source: RecorderErrorSource }`. |
| 26 | +- Provide `RecorderResult<T> = Result<T, RecorderError>` and convenience macros (`usage!`, `enverr!`, `target!`, `bug!`, `ensure_usage!`, `ensure_env!`, etc.) so Rust modules can author classified failures with one line. |
| 27 | +- Require every other crate (including the PyO3 module) to depend on `recorder-errors`; direct construction of `PyErr`/`io::Error` is disallowed outside the façade. |
| 28 | +- Maintain `ErrorCode` as a small, grep-able enum (e.g., `ERR_TRACE_DIR_NOT_DIR`, `ERR_FORMAT_UNSUPPORTED`), with documentation in the crate so codes stay stable across releases. |
| 29 | + |
| 30 | +### 2. Clear Classification & Exit Codes |
| 31 | +- Define four top-level `ErrorKind` variants: |
| 32 | + - `Usage` (caller mistakes, bad flags, conflicting sessions). |
| 33 | + - `Environment` (IO, permissions, resource exhaustion). |
| 34 | + - `Target` (user code raised or misbehaved while being traced). |
| 35 | + - `Internal` (bugs, invariants, unexpected panics). |
| 36 | +- Map kinds to fixed process exit codes (`Usage=2`, `Environment=10`, `Target=20`, `Internal=70`). These are surfaced by CLI utilities and exported via the Python module for embedding tooling. |
| 37 | +- Document canonical examples for each kind in the ADR appendix and in crate docs. |
| 38 | + |
| 39 | +### 3. FFI Safety & Python Exceptions |
| 40 | +- Add an `ffi` module that wraps every `#[pyfunction]` with `catch_unwind`, converts `RecorderError` into a custom Python exception hierarchy (`RecorderError` base, subclasses `UsageError`, `EnvironmentError`, `TargetError`, `InternalError`), and logs panic payloads before mapping them to `InternalError`. |
| 41 | +- PyO3 callbacks (`install_tracer`, monitoring trampolines) will run through `ffi::dispatch`, ensuring we never leak panics across the boundary. |
| 42 | + |
| 43 | +### 4. Output Channels & Diagnostics |
| 44 | +- Forbid `println!`/`eprintln!` outside the logging module; diagnostic output goes to stderr via `tracing`/`log` infrastructure. |
| 45 | +- Introduce a structured logging wrapper that attaches `{ run_id, trace_id, error_code }` fields to every error record. Provide `--log-level`, `--log-file`, and `--json-errors` switches that route structured diagnostics either to stderr or a configured file. |
| 46 | + |
| 47 | +### 5. Policy Switches |
| 48 | +- Introduce a runtime policy singleton (`RecorderPolicy` stored in `OnceCell`) configured via CLI flags or environment variables: `--on-recorder-error=abort|disable`, `--require-trace`, `--keep-partial-trace`. |
| 49 | +- Define semantics: `abort` -> propagate error and non-zero exit; `disable` -> detach tracer, emit structured warning, continue host process. Document exit codes for each combination in module docs. |
| 50 | + |
| 51 | +### 6. Atomic, Truthful Outputs |
| 52 | +- Wrap trace writes behind an IO façade that stages files in a temp directory and performs atomic rename on success. |
| 53 | +- When `--keep-partial-trace` is enabled, mark outputs with a `partial=true`, `reason=<ErrorCode>` trailer. Otherwise ensure no trace files are left behind on failure. |
| 54 | + |
| 55 | +### 7. Assertions with Containment |
| 56 | +- Replace `expect`/`unwrap` (e.g., `src/runtime/mod.rs:114`, `src/runtime/activation.rs:26`, `src/runtime/value_encoder.rs:70`) with classified `bug!` assertions that convert to `RecorderError` while still triggering `debug_assert!` in dev builds. |
| 57 | +- Document invariants in the new crate and ensure fuzzing/tests observe the diagnostics. |
| 58 | + |
| 59 | +### 8. Preflight Checks |
| 60 | +- Centralise version/compatibility checks in a `preflight` module called from `start_tracing`. Validate Python major.minor, ABI compatibility, trace schema version, and feature flags before installing monitoring callbacks. |
| 61 | +- Embed recorder version, schema version, and policy hash into every trace metadata file via `TraceWriter` extensions. |
| 62 | + |
| 63 | +### 9. Observability & Metrics |
| 64 | +- Emit structured counters for key error pathways (dropped events, detach reasons, panics caught). Provide a `RecorderMetrics` sink with a no-op default and an optional exporter trait. |
| 65 | +- When `--json-errors` is set, append a single-line JSON trailer to stderr containing `{ "error_code": .., "kind": .., "message": .., "context": .. }`. |
| 66 | + |
| 67 | +### 10. Failure-Path Testing |
| 68 | +- Add exhaustive unit tests in `recorder-errors` for every `ErrorCode` and conversion path. |
| 69 | +- Extend Rust integration tests to simulate disk-full (`ENOSPC`), permission denied, target exceptions, callback panics, SIGINT during detach, and partial trace recovery. |
| 70 | +- Add Python tests asserting the custom exception hierarchy and policy toggles behave as documented. |
| 71 | + |
| 72 | +### 11. Performance-Aware Defences |
| 73 | +- Reserve heavyweight diagnostics (stack captures, large context maps) for error paths. Hot callbacks use cheap checks (`debug_assert!` in release builds). Provide sampled validation hooks if additional runtime checks become necessary. |
| 74 | + |
| 75 | +### 12. Tooling Enforcement |
| 76 | +- Add workspace lints (`deny(panic_in_result_fn)`, Clippy config) and a `just lint-errors` task that fails if `panic!`, `unwrap`, or `expect` appear outside `recorder-errors`. |
| 77 | +- Disallow `anyhow`/`eyre` except inside the error façade with documented justification. |
| 78 | + |
| 79 | +### 13. Developer Ergonomics |
| 80 | +- Export prelude modules (`use recorder_errors::prelude::*;`) so contributors get macros and types with a single import. |
| 81 | +- Provide cookbook examples in the crate documentation and link the ADR so developers know how to map new errors to codes quickly. |
| 82 | + |
| 83 | +### 14. Documented Guarantees |
| 84 | +- Document, in README + crate docs, the three promises: no stdout writes, trace outputs are atomic (or explicitly partial), and error codes stay stable within a minor version line. |
| 85 | + |
| 86 | +### 15. Scope & Non-Goals |
| 87 | +- The recorder never aborts the host process; even internal bugs downgrade to `InternalError` surfaced through policy switches. |
| 88 | +- Business-specific retention, shipping logs, or analytics integrations remain out of scope for this ADR. |
| 89 | + |
| 90 | +## Consequences |
| 91 | + |
| 92 | +- **Positive:** Structured errors enable user tooling, stable exit codes improve scripting, and panics are contained so we remain embedder-friendly. Central macros reduce boilerplate and make reviewers enforce policy easily. |
| 93 | +- **Negative / Risks:** Introducing a new crate and policy layer adds upfront work and requires retrofitting existing call sites. Atomic IO staging may increase disk usage for large traces. Contributors must learn the new taxonomy and update tests accordingly. |
| 94 | + |
| 95 | +## Rollout & Status Tracking |
| 96 | + |
| 97 | +- Implementation proceeds under a dedicated plan (see "Error Handling Implementation Plan"). The ADR moves to **Accepted** once the façade crate, FFI wrappers, and policy switches are merged, and the legacy ad-hoc errors are removed. |
| 98 | +- Future adjustments (e.g., new error codes) must update `recorder-errors` documentation and ensure backward compatibility for exit codes. |
| 99 | + |
| 100 | +## Alternatives Considered |
| 101 | + |
| 102 | +- **Use `anyhow` throughout and convert at the boundary.** Rejected because it obscures error provenance, offers no stable codes, and encourages stringly-typed errors. |
| 103 | +- **Catch panics lazily within individual callbacks.** Rejected; a central wrapper keeps the policy uniform and ensures we do not miss newer entry points. |
| 104 | +- **Rely on existing logging without policy switches.** Rejected because operational requirements demand scriptable behaviour on failure. |
| 105 | + |
0 commit comments