Skip to content

Commit 954d3e3

Browse files
committed
errors: Designing error policy
1 parent 58c7bf7 commit 954d3e3

File tree

3 files changed

+206
-0
lines changed

3 files changed

+206
-0
lines changed

README.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,15 @@ Basic workflow:
4545
- from codetracer_python_recorder import hello
4646
- hello()
4747

48+
#### Testing & Coverage
49+
50+
- Run the full split test suite (Rust nextest + Python pytest): `just test`
51+
- Run only Rust integration/unit tests: `just cargo-test`
52+
- Run only Python tests (including the pure-Python recorder to guard regressions): `just py-test`
53+
- Collect coverage artefacts locally (LCOV + Cobertura/JSON): `just coverage`
54+
55+
The CI workflow mirrors these commands. Pull requests get an automated comment with the latest Rust/Python coverage tables and downloadable artefacts (`lcov.info`, `coverage.xml`, `coverage.json`).
56+
4857
### Future directions
4958

5059
The current Python support is an unfinished prototype. We can finish it. In the future, it may be expanded to function in a way to similar to the more complete implementations, e.g. [Noir](https://github.com/blocksense-network/noir/tree/blocksense/tooling/tracer).
Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,105 @@
1+
# ADR 0004: Error Handling Policy for codetracer-python-recorder
2+
3+
- **Status:** Proposed
4+
- **Date:** 2025-10-02
5+
- **Deciders:** Runtime Tracing Maintainers
6+
- **Consulted:** Python Tooling WG, Observability WG
7+
- **Informed:** Developer Experience WG, Release Engineering
8+
9+
## Context
10+
11+
The Rust-backed recorder currently propagates errors piecemeal:
12+
- PyO3 entry points bubble up plain `PyRuntimeError` instances with free-form strings (e.g., `src/session.rs:21-52`, `src/runtime/mod.rs:77-126`).
13+
- Runtime helpers panic on invariant violations, which will abort the host interpreter because we do not fence panics at the FFI boundary (`src/runtime/mod.rs:107-120`, `src/runtime/activation.rs:24-33`, `src/runtime/value_encoder.rs:61-78`).
14+
- Monitoring callbacks rely on `GLOBAL.lock().unwrap()` so poisoned mutexes or lock errors terminate the process (`src/monitoring/tracer.rs:268` and subsequent callback shims).
15+
- Python helpers expose bare `RuntimeError`/`ValueError` without linking to a shared policy, and auto-start simply re-raises whatever the Rust layer emits (`codetracer_python_recorder/session.py:27-63`, `codetracer_python_recorder/auto_start.py:24-36`).
16+
- Exit codes, log destinations, and trace-writer fallback behaviour are implicit; a disk-full failure today yields a generic exception and can leave partially written outputs.
17+
18+
The lack of a central error façade makes it hard to enforce user-facing guarantees, reason about detaching vs aborting behaviour, or meet the operational goals we have been given: stable error codes, structured logs, optional JSON diagnostics, policy switches, and atomic trace outputs.
19+
20+
## Decision
21+
22+
We will introduce a recorder-wide error handling policy centred on a dedicated `recorder-errors` crate and a Python exception hierarchy. The policy follows fifteen guiding principles supplied by operations and is designed so the “right way” is the only easy way for contributors.
23+
24+
### 1. Single Error Façade
25+
- Create a new workspace crate `recorder-errors` exporting `RecorderError`, a structural error type with fields `{ kind: ErrorKind, code: ErrorCode, message: Cow<'static, str>, context: ContextMap, source: RecorderErrorSource }`.
26+
- Provide `RecorderResult<T> = Result<T, RecorderError>` and convenience macros (`usage!`, `enverr!`, `target!`, `bug!`, `ensure_usage!`, `ensure_env!`, etc.) so Rust modules can author classified failures with one line.
27+
- Require every other crate (including the PyO3 module) to depend on `recorder-errors`; direct construction of `PyErr`/`io::Error` is disallowed outside the façade.
28+
- Maintain `ErrorCode` as a small, grep-able enum (e.g., `ERR_TRACE_DIR_NOT_DIR`, `ERR_FORMAT_UNSUPPORTED`), with documentation in the crate so codes stay stable across releases.
29+
30+
### 2. Clear Classification & Exit Codes
31+
- Define four top-level `ErrorKind` variants:
32+
- `Usage` (caller mistakes, bad flags, conflicting sessions).
33+
- `Environment` (IO, permissions, resource exhaustion).
34+
- `Target` (user code raised or misbehaved while being traced).
35+
- `Internal` (bugs, invariants, unexpected panics).
36+
- Map kinds to fixed process exit codes (`Usage=2`, `Environment=10`, `Target=20`, `Internal=70`). These are surfaced by CLI utilities and exported via the Python module for embedding tooling.
37+
- Document canonical examples for each kind in the ADR appendix and in crate docs.
38+
39+
### 3. FFI Safety & Python Exceptions
40+
- Add an `ffi` module that wraps every `#[pyfunction]` with `catch_unwind`, converts `RecorderError` into a custom Python exception hierarchy (`RecorderError` base, subclasses `UsageError`, `EnvironmentError`, `TargetError`, `InternalError`), and logs panic payloads before mapping them to `InternalError`.
41+
- PyO3 callbacks (`install_tracer`, monitoring trampolines) will run through `ffi::dispatch`, ensuring we never leak panics across the boundary.
42+
43+
### 4. Output Channels & Diagnostics
44+
- Forbid `println!`/`eprintln!` outside the logging module; diagnostic output goes to stderr via `tracing`/`log` infrastructure.
45+
- Introduce a structured logging wrapper that attaches `{ run_id, trace_id, error_code }` fields to every error record. Provide `--log-level`, `--log-file`, and `--json-errors` switches that route structured diagnostics either to stderr or a configured file.
46+
47+
### 5. Policy Switches
48+
- Introduce a runtime policy singleton (`RecorderPolicy` stored in `OnceCell`) configured via CLI flags or environment variables: `--on-recorder-error=abort|disable`, `--require-trace`, `--keep-partial-trace`.
49+
- Define semantics: `abort` -> propagate error and non-zero exit; `disable` -> detach tracer, emit structured warning, continue host process. Document exit codes for each combination in module docs.
50+
51+
### 6. Atomic, Truthful Outputs
52+
- Wrap trace writes behind an IO façade that stages files in a temp directory and performs atomic rename on success.
53+
- When `--keep-partial-trace` is enabled, mark outputs with a `partial=true`, `reason=<ErrorCode>` trailer. Otherwise ensure no trace files are left behind on failure.
54+
55+
### 7. Assertions with Containment
56+
- Replace `expect`/`unwrap` (e.g., `src/runtime/mod.rs:114`, `src/runtime/activation.rs:26`, `src/runtime/value_encoder.rs:70`) with classified `bug!` assertions that convert to `RecorderError` while still triggering `debug_assert!` in dev builds.
57+
- Document invariants in the new crate and ensure fuzzing/tests observe the diagnostics.
58+
59+
### 8. Preflight Checks
60+
- Centralise version/compatibility checks in a `preflight` module called from `start_tracing`. Validate Python major.minor, ABI compatibility, trace schema version, and feature flags before installing monitoring callbacks.
61+
- Embed recorder version, schema version, and policy hash into every trace metadata file via `TraceWriter` extensions.
62+
63+
### 9. Observability & Metrics
64+
- Emit structured counters for key error pathways (dropped events, detach reasons, panics caught). Provide a `RecorderMetrics` sink with a no-op default and an optional exporter trait.
65+
- When `--json-errors` is set, append a single-line JSON trailer to stderr containing `{ "error_code": .., "kind": .., "message": .., "context": .. }`.
66+
67+
### 10. Failure-Path Testing
68+
- Add exhaustive unit tests in `recorder-errors` for every `ErrorCode` and conversion path.
69+
- Extend Rust integration tests to simulate disk-full (`ENOSPC`), permission denied, target exceptions, callback panics, SIGINT during detach, and partial trace recovery.
70+
- Add Python tests asserting the custom exception hierarchy and policy toggles behave as documented.
71+
72+
### 11. Performance-Aware Defences
73+
- Reserve heavyweight diagnostics (stack captures, large context maps) for error paths. Hot callbacks use cheap checks (`debug_assert!` in release builds). Provide sampled validation hooks if additional runtime checks become necessary.
74+
75+
### 12. Tooling Enforcement
76+
- Add workspace lints (`deny(panic_in_result_fn)`, Clippy config) and a `just lint-errors` task that fails if `panic!`, `unwrap`, or `expect` appear outside `recorder-errors`.
77+
- Disallow `anyhow`/`eyre` except inside the error façade with documented justification.
78+
79+
### 13. Developer Ergonomics
80+
- Export prelude modules (`use recorder_errors::prelude::*;`) so contributors get macros and types with a single import.
81+
- Provide cookbook examples in the crate documentation and link the ADR so developers know how to map new errors to codes quickly.
82+
83+
### 14. Documented Guarantees
84+
- Document, in README + crate docs, the three promises: no stdout writes, trace outputs are atomic (or explicitly partial), and error codes stay stable within a minor version line.
85+
86+
### 15. Scope & Non-Goals
87+
- The recorder never aborts the host process; even internal bugs downgrade to `InternalError` surfaced through policy switches.
88+
- Business-specific retention, shipping logs, or analytics integrations remain out of scope for this ADR.
89+
90+
## Consequences
91+
92+
- **Positive:** Structured errors enable user tooling, stable exit codes improve scripting, and panics are contained so we remain embedder-friendly. Central macros reduce boilerplate and make reviewers enforce policy easily.
93+
- **Negative / Risks:** Introducing a new crate and policy layer adds upfront work and requires retrofitting existing call sites. Atomic IO staging may increase disk usage for large traces. Contributors must learn the new taxonomy and update tests accordingly.
94+
95+
## Rollout & Status Tracking
96+
97+
- Implementation proceeds under a dedicated plan (see "Error Handling Implementation Plan"). The ADR moves to **Accepted** once the façade crate, FFI wrappers, and policy switches are merged, and the legacy ad-hoc errors are removed.
98+
- Future adjustments (e.g., new error codes) must update `recorder-errors` documentation and ensure backward compatibility for exit codes.
99+
100+
## Alternatives Considered
101+
102+
- **Use `anyhow` throughout and convert at the boundary.** Rejected because it obscures error provenance, offers no stable codes, and encourages stringly-typed errors.
103+
- **Catch panics lazily within individual callbacks.** Rejected; a central wrapper keeps the policy uniform and ensures we do not miss newer entry points.
104+
- **Rely on existing logging without policy switches.** Rejected because operational requirements demand scriptable behaviour on failure.
105+
Lines changed: 92 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,92 @@
1+
# codetracer-python-recorder Error Handling Implementation Plan
2+
3+
## Goals
4+
- Deliver the policy defined in ADR 0004: every error flows through `RecorderError`, surfaces a stable code/kind, and maps to the Python exception hierarchy.
5+
- Contain all panics within the FFI boundary and offer deterministic behaviour for `abort` versus `disable` policies.
6+
- Ensure trace outputs remain atomic (or explicitly marked partial) and diagnostics never leak to stdout.
7+
- Provide developers with ergonomic macros, tooling guardrails, and comprehensive tests covering failure paths.
8+
9+
## Current Gaps
10+
- Ad-hoc `PyRuntimeError` strings in `src/session.rs:21-76` and `src/runtime/mod.rs:77-190` prevent stable categorisation and user scripting.
11+
- FFI trampolines in `src/monitoring/tracer.rs:268-706` and activation helpers in `src/runtime/activation.rs:24-83` still use `unwrap`/`expect`, so poisoned locks or filesystem errors abort the interpreter.
12+
- Python facade functions (`codetracer_python_recorder/session.py:27-63`) return built-in exceptions and provide no context or exit codes.
13+
- No support for JSON diagnostics, policy switches, or atomic output staging; disk failures can leave half-written traces and logs mix stdout/stderr.
14+
15+
## Workstreams
16+
17+
### WS1 – Foundations & Inventory
18+
- Add a `just errors-audit` command that runs `rg` to list `PyRuntimeError`, `unwrap`, `expect`, and direct `panic!` usage in the recorder crate.
19+
- Create issue tracker entries grouping call sites by module (`session`, `runtime`, `monitoring`, Python facade) to guide refactors.
20+
- Exit criteria: checklist of legacy error sites recorded with owners.
21+
22+
### WS2 – `recorder-errors` Crate
23+
- Scaffold `recorder-errors` under the workspace with `RecorderError`, `RecorderResult`, `ErrorKind`, `ErrorCode`, context map type, and conversion traits from `io::Error`, `PyErr`, etc.
24+
- Implement ergonomic macros (`usage!`, `enverr!`, `target!`, `bug!`, `ensure_*`) plus unit tests covering formatting, context propagation, and downcasting.
25+
- Publish crate docs explaining mapping rules and promises; link ADR 0004.
26+
- Exit criteria: `cargo test -p recorder-errors` covers all codes; workspace builds with the new crate.
27+
28+
### WS3 – Retrofit Rust Modules
29+
- Replace direct `PyRuntimeError` construction in `src/session/bootstrap.rs`, `src/session.rs`, `src/runtime/mod.rs`, `src/runtime/output_paths.rs`, and helpers with `RecorderResult` + macros.
30+
- Update `RuntimeTracer` to propagate structured errors instead of strings; remove `expect`/`unwrap` in hot paths by returning classified `bug!` or `enverr!` failures.
31+
- Introduce a small adapter in `src/runtime/mod.rs` that stages IO writes and applies the atomic/partial policy described in ADR 0004.
32+
- Exit criteria: All recorder crate modules compile without `pyo3::exceptions::PyRuntimeError::new_err` usage.
33+
34+
### WS4 – FFI Wrapper & Python Exception Hierarchy
35+
- Implement `ffi::wrap_pyfunction` that catches panics (`std::panic::catch_unwind`), maps `RecorderError` to a new `PyRecorderError` base type plus subclasses (`PyUsageError`, `PyEnvironmentError`, etc.).
36+
- Update `#[pymodule]` and every `#[pyfunction]` to use the wrapper; ensure monitoring callbacks also go through the dispatcher.
37+
- Expose the exception types in `codetracer_python_recorder/__init__.py` for Python callers.
38+
- Exit criteria: Rust panics surface as `PyInternalError`, and Python tests can assert exception class + code.
39+
40+
### WS5 – Policy Switches & Runtime Configuration
41+
- Add `RecorderPolicy` backed by `OnceCell` with setters for CLI flags/env vars: `--on-recorder-error`, `--require-trace`, `--keep-partial-trace`, `--log-level`, `--log-file`, `--json-errors`.
42+
- Update the CLI/embedding entry points (auto-start, `TraceSession`) to fill the policy before starting tracing.
43+
- Implement detach vs abort semantics in `RuntimeTracer::finish` / session stop paths, honoring policy decisions and exit codes.
44+
- Exit criteria: Integration tests demonstrate both `abort` and `disable` flows, including partial trace handling.
45+
46+
### WS6 – Logging, Metrics, and Diagnostics
47+
- Replace `env_logger` initialisation with a `tracing` subscriber or structured `log` formatter that includes `run_id`, `trace_id`, and `ErrorCode` fields.
48+
- Emit counters for dropped events, detach reasons, and caught panics via a `RecorderMetrics` sink (default no-op, pluggable in future).
49+
- Implement `--json-errors` to emit a single-line JSON trailer on stderr whenever an error is returned to Python.
50+
- Exit criteria: Structured log output verified in tests; stdout usage gated by lint.
51+
52+
### WS7 – Test Coverage & Tooling Enforcement
53+
- Add unit tests for the new error crate, IO façade, policy switches, and FFI wrappers (panic capture, exception mapping).
54+
- Extend Python tests to cover the new exception hierarchy, JSON diagnostics, and policy flags.
55+
- Introduce CI lints (`cargo clippy --deny clippy::panic`, custom script rejecting `unwrap` outside allowed modules) and integrate with `just lint`.
56+
- Exit criteria: CI blocks regressions; failure-path tests cover disk full, permission denied, target exceptions, partial trace recovery, and SIGINT during detach.
57+
58+
### WS8 – Documentation & Rollout
59+
- Update README, API docs, and onboarding material to describe guarantees, exit codes, example snippets, and migration guidance for downstream tools.
60+
- Add a change log entry summarising the policy and how to consume structured errors from Python.
61+
- Track adoption status in `design-docs/error-handling-implementation-plan.status.md` (mirror existing planning artifacts).
62+
- Exit criteria: Documentation merged, status file created, ADR 0004 promoted to **Accepted** once WS2–WS7 land.
63+
64+
## Milestones & Sequencing
65+
1. **Milestone A – Foundations:** Complete WS1 and WS2 (error crate scaffold) in parallel; unblock later work.
66+
2. **Milestone B – Core Refactor:** Deliver WS3 and WS4 together so Rust modules emit structured errors and Python sees the new exceptions.
67+
3. **Milestone C – Policy & IO Guarantees:** Finish WS5 and WS6 to stabilise runtime behaviour and diagnostics.
68+
4. **Milestone D – Hardening:** Execute WS7 (tests, tooling) and WS8 (documentation). Promote ADR 0004 to Accepted.
69+
70+
## Verification Strategy
71+
- Add a `just test-errors` recipe running targeted failure tests (disk-full, detach, panic capture) plus Python unit tests for error classes.
72+
- Use `cargo nextest run -p codetracer-python-recorder --features failure-fixtures` to execute synthetic failure cases.
73+
- Enable `pytest tests/python/error_handling -q` for Python-specific coverage.
74+
- Capture structured stderr in integration tests to assert JSON trailers and exit codes.
75+
76+
## Dependencies & Coordination
77+
- Requires consensus with the Observability WG on log format fields and exit-code mapping.
78+
- Policy flag wiring depends on any CLI/front-end work planned for Q4; coordinate with developer experience owners.
79+
- If `runtime_tracing` needs extensions for metadata trailers, align timelines with that team.
80+
81+
## Risks & Mitigations
82+
- **Wide-scope refactor:** Stage work behind feature branches and land per-module PRs to avoid blocking releases.
83+
- **Performance regressions:** Benchmark hot callbacks before/after WS3 using existing microbenchmarks; keep additional allocations off hot paths.
84+
- **API churn for users:** Provide compatibility shims that map old exceptions to new ones for at least one minor release, and document upgrade notes.
85+
- **Partial trace semantics confusion:** Default to `abort` (no partial outputs) unless `--keep-partial-trace` is explicit; emit warnings when users opt in.
86+
87+
## Done Definition
88+
- Legacy `PyRuntimeError::new_err` usage is removed or isolated to compat shims.
89+
- All panics are caught before crossing into Python; fuzz tests confirm no UB.
90+
- `just test` (and targeted error suites) pass on Linux/macOS CI, with new structured logs and metrics visible.
91+
- Documentation reflects guarantees, and downstream teams acknowledge new exit codes.
92+

0 commit comments

Comments
 (0)