errors: Designing error policy

tzanko-matev · tzanko-matev · commit 954d3e30da74 · 2025-10-02T17:26:52.000+03:00
diff --git a/README.md b/README.md
@@ -45,6 +45,15 @@ Basic workflow:
   - from codetracer_python_recorder import hello
   - hello()
 
+#### Testing & Coverage
+
+- Run the full split test suite (Rust nextest + Python pytest): `just test`
+- Run only Rust integration/unit tests: `just cargo-test`
+- Run only Python tests (including the pure-Python recorder to guard regressions): `just py-test`
+- Collect coverage artefacts locally (LCOV + Cobertura/JSON): `just coverage`
+
+The CI workflow mirrors these commands. Pull requests get an automated comment with the latest Rust/Python coverage tables and downloadable artefacts (`lcov.info`, `coverage.xml`, `coverage.json`).
+
 ### Future directions
 
 The current Python support is an unfinished prototype. We can finish it. In the future, it may be expanded to function in a way to similar to the more complete implementations, e.g. [Noir](https://github.com/blocksense-network/noir/tree/blocksense/tooling/tracer).
diff --git a/design-docs/adr/0004-error-handling-policy.md b/design-docs/adr/0004-error-handling-policy.md
@@ -0,0 +1,105 @@
+# ADR 0004: Error Handling Policy for codetracer-python-recorder
+
+- **Status:** Proposed
+- **Date:** 2025-10-02
+- **Deciders:** Runtime Tracing Maintainers
+- **Consulted:** Python Tooling WG, Observability WG
+- **Informed:** Developer Experience WG, Release Engineering
+
+## Context
+
+The Rust-backed recorder currently propagates errors piecemeal:
+- PyO3 entry points bubble up plain `PyRuntimeError` instances with free-form strings (e.g., `src/session.rs:21-52`, `src/runtime/mod.rs:77-126`).
+- Runtime helpers panic on invariant violations, which will abort the host interpreter because we do not fence panics at the FFI boundary (`src/runtime/mod.rs:107-120`, `src/runtime/activation.rs:24-33`, `src/runtime/value_encoder.rs:61-78`).
+- Monitoring callbacks rely on `GLOBAL.lock().unwrap()` so poisoned mutexes or lock errors terminate the process (`src/monitoring/tracer.rs:268` and subsequent callback shims).
+- Python helpers expose bare `RuntimeError`/`ValueError` without linking to a shared policy, and auto-start simply re-raises whatever the Rust layer emits (`codetracer_python_recorder/session.py:27-63`, `codetracer_python_recorder/auto_start.py:24-36`).
+- Exit codes, log destinations, and trace-writer fallback behaviour are implicit; a disk-full failure today yields a generic exception and can leave partially written outputs.
+
+The lack of a central error façade makes it hard to enforce user-facing guarantees, reason about detaching vs aborting behaviour, or meet the operational goals we have been given: stable error codes, structured logs, optional JSON diagnostics, policy switches, and atomic trace outputs.
+
+## Decision
+
+We will introduce a recorder-wide error handling policy centred on a dedicated `recorder-errors` crate and a Python exception hierarchy. The policy follows fifteen guiding principles supplied by operations and is designed so the “right way” is the only easy way for contributors.
+
+### 1. Single Error Façade
+- Create a new workspace crate `recorder-errors` exporting `RecorderError`, a structural error type with fields `{ kind: ErrorKind, code: ErrorCode, message: Cow<'static, str>, context: ContextMap, source: RecorderErrorSource }`.
+- Provide `RecorderResult<T> = Result<T, RecorderError>` and convenience macros (`usage!`, `enverr!`, `target!`, `bug!`, `ensure_usage!`, `ensure_env!`, etc.) so Rust modules can author classified failures with one line.
+- Require every other crate (including the PyO3 module) to depend on `recorder-errors`; direct construction of `PyErr`/`io::Error` is disallowed outside the façade.
+- Maintain `ErrorCode` as a small, grep-able enum (e.g., `ERR_TRACE_DIR_NOT_DIR`, `ERR_FORMAT_UNSUPPORTED`), with documentation in the crate so codes stay stable across releases.
+
+### 2. Clear Classification & Exit Codes
+- Define four top-level `ErrorKind` variants:
+  - `Usage` (caller mistakes, bad flags, conflicting sessions).
+  - `Environment` (IO, permissions, resource exhaustion).
+  - `Target` (user code raised or misbehaved while being traced).
+  - `Internal` (bugs, invariants, unexpected panics).
+- Map kinds to fixed process exit codes (`Usage=2`, `Environment=10`, `Target=20`, `Internal=70`). These are surfaced by CLI utilities and exported via the Python module for embedding tooling.
+- Document canonical examples for each kind in the ADR appendix and in crate docs.
+
+### 3. FFI Safety & Python Exceptions
+- Add an `ffi` module that wraps every `#[pyfunction]` with `catch_unwind`, converts `RecorderError` into a custom Python exception hierarchy (`RecorderError` base, subclasses `UsageError`, `EnvironmentError`, `TargetError`, `InternalError`), and logs panic payloads before mapping them to `InternalError`.
+- PyO3 callbacks (`install_tracer`, monitoring trampolines) will run through `ffi::dispatch`, ensuring we never leak panics across the boundary.
+
+### 4. Output Channels & Diagnostics
+- Forbid `println!`/`eprintln!` outside the logging module; diagnostic output goes to stderr via `tracing`/`log` infrastructure.
+- Introduce a structured logging wrapper that attaches `{ run_id, trace_id, error_code }` fields to every error record. Provide `--log-level`, `--log-file`, and `--json-errors` switches that route structured diagnostics either to stderr or a configured file.
+
+### 5. Policy Switches
+- Introduce a runtime policy singleton (`RecorderPolicy` stored in `OnceCell`) configured via CLI flags or environment variables: `--on-recorder-error=abort|disable`, `--require-trace`, `--keep-partial-trace`.
+- Define semantics: `abort` -> propagate error and non-zero exit; `disable` -> detach tracer, emit structured warning, continue host process. Document exit codes for each combination in module docs.
+
+### 6. Atomic, Truthful Outputs
+- Wrap trace writes behind an IO façade that stages files in a temp directory and performs atomic rename on success.
+- When `--keep-partial-trace` is enabled, mark outputs with a `partial=true`, `reason=<ErrorCode>` trailer. Otherwise ensure no trace files are left behind on failure.
+
+### 7. Assertions with Containment
+- Replace `expect`/`unwrap` (e.g., `src/runtime/mod.rs:114`, `src/runtime/activation.rs:26`, `src/runtime/value_encoder.rs:70`) with classified `bug!` assertions that convert to `RecorderError` while still triggering `debug_assert!` in dev builds.
+- Document invariants in the new crate and ensure fuzzing/tests observe the diagnostics.
+
+### 8. Preflight Checks
+- Centralise version/compatibility checks in a `preflight` module called from `start_tracing`. Validate Python major.minor, ABI compatibility, trace schema version, and feature flags before installing monitoring callbacks.
+- Embed recorder version, schema version, and policy hash into every trace metadata file via `TraceWriter` extensions.
+
+### 9. Observability & Metrics
+- Emit structured counters for key error pathways (dropped events, detach reasons, panics caught). Provide a `RecorderMetrics` sink with a no-op default and an optional exporter trait.
+- When `--json-errors` is set, append a single-line JSON trailer to stderr containing `{ "error_code": .., "kind": .., "message": .., "context": .. }`.
+
+### 10. Failure-Path Testing
+- Add exhaustive unit tests in `recorder-errors` for every `ErrorCode` and conversion path.
+- Extend Rust integration tests to simulate disk-full (`ENOSPC`), permission denied, target exceptions, callback panics, SIGINT during detach, and partial trace recovery.
+- Add Python tests asserting the custom exception hierarchy and policy toggles behave as documented.
+
+### 11. Performance-Aware Defences
+- Reserve heavyweight diagnostics (stack captures, large context maps) for error paths. Hot callbacks use cheap checks (`debug_assert!` in release builds). Provide sampled validation hooks if additional runtime checks become necessary.
+
+### 12. Tooling Enforcement
+- Add workspace lints (`deny(panic_in_result_fn)`, Clippy config) and a `just lint-errors` task that fails if `panic!`, `unwrap`, or `expect` appear outside `recorder-errors`.
+- Disallow `anyhow`/`eyre` except inside the error façade with documented justification.
+
+### 13. Developer Ergonomics
+- Export prelude modules (`use recorder_errors::prelude::*;`) so contributors get macros and types with a single import.
+- Provide cookbook examples in the crate documentation and link the ADR so developers know how to map new errors to codes quickly.
+
+### 14. Documented Guarantees
+- Document, in README + crate docs, the three promises: no stdout writes, trace outputs are atomic (or explicitly partial), and error codes stay stable within a minor version line.
+
+### 15. Scope & Non-Goals
+- The recorder never aborts the host process; even internal bugs downgrade to `InternalError` surfaced through policy switches.
+- Business-specific retention, shipping logs, or analytics integrations remain out of scope for this ADR.
+
+## Consequences
+
+- **Positive:** Structured errors enable user tooling, stable exit codes improve scripting, and panics are contained so we remain embedder-friendly. Central macros reduce boilerplate and make reviewers enforce policy easily.
+- **Negative / Risks:** Introducing a new crate and policy layer adds upfront work and requires retrofitting existing call sites. Atomic IO staging may increase disk usage for large traces. Contributors must learn the new taxonomy and update tests accordingly.
+
+## Rollout & Status Tracking
+
+- Implementation proceeds under a dedicated plan (see "Error Handling Implementation Plan"). The ADR moves to **Accepted** once the façade crate, FFI wrappers, and policy switches are merged, and the legacy ad-hoc errors are removed.
+- Future adjustments (e.g., new error codes) must update `recorder-errors` documentation and ensure backward compatibility for exit codes.
+
+## Alternatives Considered
+
+- **Use `anyhow` throughout and convert at the boundary.** Rejected because it obscures error provenance, offers no stable codes, and encourages stringly-typed errors.
+- **Catch panics lazily within individual callbacks.** Rejected; a central wrapper keeps the policy uniform and ensures we do not miss newer entry points.
+- **Rely on existing logging without policy switches.** Rejected because operational requirements demand scriptable behaviour on failure.
+
diff --git a/design-docs/error-handling-implementation-plan.md b/design-docs/error-handling-implementation-plan.md
@@ -0,0 +1,92 @@
+# codetracer-python-recorder Error Handling Implementation Plan
+
+## Goals
+- Deliver the policy defined in ADR 0004: every error flows through `RecorderError`, surfaces a stable code/kind, and maps to the Python exception hierarchy.
+- Contain all panics within the FFI boundary and offer deterministic behaviour for `abort` versus `disable` policies.
+- Ensure trace outputs remain atomic (or explicitly marked partial) and diagnostics never leak to stdout.
+- Provide developers with ergonomic macros, tooling guardrails, and comprehensive tests covering failure paths.
+
+## Current Gaps
+- Ad-hoc `PyRuntimeError` strings in `src/session.rs:21-76` and `src/runtime/mod.rs:77-190` prevent stable categorisation and user scripting.
+- FFI trampolines in `src/monitoring/tracer.rs:268-706` and activation helpers in `src/runtime/activation.rs:24-83` still use `unwrap`/`expect`, so poisoned locks or filesystem errors abort the interpreter.
+- Python facade functions (`codetracer_python_recorder/session.py:27-63`) return built-in exceptions and provide no context or exit codes.
+- No support for JSON diagnostics, policy switches, or atomic output staging; disk failures can leave half-written traces and logs mix stdout/stderr.
+
+## Workstreams
+
+### WS1 – Foundations & Inventory
+- Add a `just errors-audit` command that runs `rg` to list `PyRuntimeError`, `unwrap`, `expect`, and direct `panic!` usage in the recorder crate.
+- Create issue tracker entries grouping call sites by module (`session`, `runtime`, `monitoring`, Python facade) to guide refactors.
+- Exit criteria: checklist of legacy error sites recorded with owners.
+
+### WS2 – `recorder-errors` Crate
+- Scaffold `recorder-errors` under the workspace with `RecorderError`, `RecorderResult`, `ErrorKind`, `ErrorCode`, context map type, and conversion traits from `io::Error`, `PyErr`, etc.
+- Implement ergonomic macros (`usage!`, `enverr!`, `target!`, `bug!`, `ensure_*`) plus unit tests covering formatting, context propagation, and downcasting.
+- Publish crate docs explaining mapping rules and promises; link ADR 0004.
+- Exit criteria: `cargo test -p recorder-errors` covers all codes; workspace builds with the new crate.
+
+### WS3 – Retrofit Rust Modules
+- Replace direct `PyRuntimeError` construction in `src/session/bootstrap.rs`, `src/session.rs`, `src/runtime/mod.rs`, `src/runtime/output_paths.rs`, and helpers with `RecorderResult` + macros.
+- Update `RuntimeTracer` to propagate structured errors instead of strings; remove `expect`/`unwrap` in hot paths by returning classified `bug!` or `enverr!` failures.
+- Introduce a small adapter in `src/runtime/mod.rs` that stages IO writes and applies the atomic/partial policy described in ADR 0004.
+- Exit criteria: All recorder crate modules compile without `pyo3::exceptions::PyRuntimeError::new_err` usage.
+
+### WS4 – FFI Wrapper & Python Exception Hierarchy
+- Implement `ffi::wrap_pyfunction` that catches panics (`std::panic::catch_unwind`), maps `RecorderError` to a new `PyRecorderError` base type plus subclasses (`PyUsageError`, `PyEnvironmentError`, etc.).
+- Update `#[pymodule]` and every `#[pyfunction]` to use the wrapper; ensure monitoring callbacks also go through the dispatcher.
+- Expose the exception types in `codetracer_python_recorder/__init__.py` for Python callers.
+- Exit criteria: Rust panics surface as `PyInternalError`, and Python tests can assert exception class + code.
+
+### WS5 – Policy Switches & Runtime Configuration
+- Add `RecorderPolicy` backed by `OnceCell` with setters for CLI flags/env vars: `--on-recorder-error`, `--require-trace`, `--keep-partial-trace`, `--log-level`, `--log-file`, `--json-errors`.
+- Update the CLI/embedding entry points (auto-start, `TraceSession`) to fill the policy before starting tracing.
+- Implement detach vs abort semantics in `RuntimeTracer::finish` / session stop paths, honoring policy decisions and exit codes.
+- Exit criteria: Integration tests demonstrate both `abort` and `disable` flows, including partial trace handling.
+
+### WS6 – Logging, Metrics, and Diagnostics
+- Replace `env_logger` initialisation with a `tracing` subscriber or structured `log` formatter that includes `run_id`, `trace_id`, and `ErrorCode` fields.
+- Emit counters for dropped events, detach reasons, and caught panics via a `RecorderMetrics` sink (default no-op, pluggable in future).
+- Implement `--json-errors` to emit a single-line JSON trailer on stderr whenever an error is returned to Python.
+- Exit criteria: Structured log output verified in tests; stdout usage gated by lint.
+
+### WS7 – Test Coverage & Tooling Enforcement
+- Add unit tests for the new error crate, IO façade, policy switches, and FFI wrappers (panic capture, exception mapping).
+- Extend Python tests to cover the new exception hierarchy, JSON diagnostics, and policy flags.
+- Introduce CI lints (`cargo clippy --deny clippy::panic`, custom script rejecting `unwrap` outside allowed modules) and integrate with `just lint`.
+- Exit criteria: CI blocks regressions; failure-path tests cover disk full, permission denied, target exceptions, partial trace recovery, and SIGINT during detach.
+
+### WS8 – Documentation & Rollout
+- Update README, API docs, and onboarding material to describe guarantees, exit codes, example snippets, and migration guidance for downstream tools.
+- Add a change log entry summarising the policy and how to consume structured errors from Python.
+- Track adoption status in `design-docs/error-handling-implementation-plan.status.md` (mirror existing planning artifacts).
+- Exit criteria: Documentation merged, status file created, ADR 0004 promoted to **Accepted** once WS2–WS7 land.
+
+## Milestones & Sequencing
+1. **Milestone A – Foundations:** Complete WS1 and WS2 (error crate scaffold) in parallel; unblock later work.
+2. **Milestone B – Core Refactor:** Deliver WS3 and WS4 together so Rust modules emit structured errors and Python sees the new exceptions.
+3. **Milestone C – Policy & IO Guarantees:** Finish WS5 and WS6 to stabilise runtime behaviour and diagnostics.
+4. **Milestone D – Hardening:** Execute WS7 (tests, tooling) and WS8 (documentation). Promote ADR 0004 to Accepted.
+
+## Verification Strategy
+- Add a `just test-errors` recipe running targeted failure tests (disk-full, detach, panic capture) plus Python unit tests for error classes.
+- Use `cargo nextest run -p codetracer-python-recorder --features failure-fixtures` to execute synthetic failure cases.
+- Enable `pytest tests/python/error_handling -q` for Python-specific coverage.
+- Capture structured stderr in integration tests to assert JSON trailers and exit codes.
+
+## Dependencies & Coordination
+- Requires consensus with the Observability WG on log format fields and exit-code mapping.
+- Policy flag wiring depends on any CLI/front-end work planned for Q4; coordinate with developer experience owners.
+- If `runtime_tracing` needs extensions for metadata trailers, align timelines with that team.
+
+## Risks & Mitigations
+- **Wide-scope refactor:** Stage work behind feature branches and land per-module PRs to avoid blocking releases.
+- **Performance regressions:** Benchmark hot callbacks before/after WS3 using existing microbenchmarks; keep additional allocations off hot paths.
+- **API churn for users:** Provide compatibility shims that map old exceptions to new ones for at least one minor release, and document upgrade notes.
+- **Partial trace semantics confusion:** Default to `abort` (no partial outputs) unless `--keep-partial-trace` is explicit; emit warnings when users opt in.
+
+## Done Definition
+- Legacy `PyRuntimeError::new_err` usage is removed or isolated to compat shims.
+- All panics are caught before crossing into Python; fuzz tests confirm no UB.
+- `just test` (and targeted error suites) pass on Linux/macOS CI, with new structured logs and metrics visible.
+- Documentation reflects guarantees, and downstream teams acknowledge new exit codes.
+