design: Toplevel exit and trace gating

tzanko-matev · tzanko-matev · commit c2999af51725 · 2025-10-27T14:46:22.000+02:00
design-docs/adr/0015-balanced-toplevel-lifecycle-and-trace-gating.md: 
design-docs/toplevel-exit-and-trace-gating-implementation-plan.md: 

Signed-off-by: Tzanko Matev &lt;tsanko@metacraft-labs.com&gt;
diff --git a/design-docs/adr/0015-balanced-toplevel-lifecycle-and-trace-gating.md b/design-docs/adr/0015-balanced-toplevel-lifecycle-and-trace-gating.md
@@ -0,0 +1,65 @@
+# ADR 0015: Balanced Toplevel Lifecycle and Unified Trace Gating
+
+- **Status:** Proposed
+- **Date:** 2025-03-21
+- **Deciders:** codetracer recorder maintainers
+- **Consulted:** DX tooling stakeholders, runtime tracing SMEs
+- **Informed:** Support engineering, product analytics, replay consumers
+
+## Context
+- The recorder seeds every trace with a synthetic `<toplevel>` call when `TraceWriter::start` is invoked from `TraceOutputPaths::configure_writer` (`codetracer-python-recorder/src/runtime/output_paths.rs`). That call models the Python process entrypoint but the runtime never emits the matching return edge.
+- CLI and API entrypoints (`codetracer_python_recorder/cli.py`, `codetracer_python_recorder/session.py`) already capture the script's exit status, yet the Rust runtime is oblivious to it, so the trace file looks like the script is still running when recording ends.
+- Runtime gating currently combines two orthogonal systems: the legacy activation controller (`codetracer-python-recorder/src/runtime/activation.rs`) that defers tracing until a configured file executes, and the newer `TraceFilterEngine` (`codetracer-python-recorder/src/runtime/tracer/filtering.rs`) that offers scope-level allow/deny decisions. Both mechanisms decide whether an event should be written, but they execute independently and cache their own state.
+- Because activation and filtering have separate caches and lifecycle hooks, downstream policies (value capture, IO flushes) see inconsistent state: a filter-suppressed frame still triggers activation bookkeeping, and activation suspensions do not propagate into the filter cache. The split also makes it hard to reason about which events will be recorded in the presence of chained filters.
+
+## Problem
+- **Unbalanced traces:** Without a `<toplevel>` return, consumers reconstructing the call stack see a dangling activation at depth 0. This breaks invariants in `TraceWriter::finish_writing_trace_events`, forces replay tools to special-case the synthetic frame, and hides the script's exit code even though callers already have it.
+- **Duplicated gating logic:** Activation and filter decisions contradict one another in edge cases. For example:
+  - When activation gates tracing until a file runs, the filter still caches a `TraceDecision::Trace` for the same code object, so subsequent resumes bypass activation because the filter short-circuits to "disable location".
+  - When filters skip a frame, activation has no way to learn that the frame completed; its suspended/completed bookkeeping only triggers on return events that never fire for filter-disabled frames.
+- The divergent implementations increase bug surface area (e.g., dangling activations, stale filter caches) and make it challenging to add new recorder policies that need a consistent view of "is this event observable?".
+
+## Decision
+1. **Emit a `<toplevel>` return carrying process exit status.**
+   - Extend the PyO3 surface so `stop_tracing` accepts an optional integer exit code (default `None`). The Python helpers (`session.stop`, CLI `main`) will pass the script's final status.
+   - Add a `TraceSessionGuard` helper on the Rust side that stores the provided exit status until `RuntimeTracer::finish` runs. When `finish` executes, it must:
+     - Flush pending IO.
+     - Record the exit status via `TraceWriter::register_return`, tagging the payload as `<exit>` when the code is unknown (e.g., interpreter crash) and serialising the integer otherwise.
+     - Only then call the existing `finalise`/`cleanup` routines.
+   - If tracing aborts early (`OnRecorderError::Disable` or fatal errors), emit a `<toplevel>` return with a synthetic reason (`"<disabled>"` / captured exception) so the stack always balances.
+2. **Unify activation and filter decisions behind a shared gate.**
+   - Introduce a `TraceGate` service managed by `RuntimeTracer`. The gate combines activation state and filter results into a single `GateDecision { process_event, disable_location, activation_event }`.
+   - `FilterCoordinator` becomes responsible for caching scope resolutions only when `TraceGate` reports that the frame was actually processed. When the gate denies an event, both activation and filter caches are notified so they can mark the code id as "ignored" in lockstep.
+   - `ActivationController` exposes explicit transitions (`on_enter`, `on_suspend`, `on_exit`) rather than letting callbacks poke its internal flags. `TraceGate` translates filter outcomes into the appropriate activation transition (e.g., a filter skip counts as `on_exit` so suspended activations resume correctly).
+   - All tracer callbacks (`on_py_start`, `on_py_return`, `on_py_yield`, etc.) ask the gate for a decision before doing work. They honour `disable_location` uniformly, so CPython stops invoking us for code objects that either filter or activation wants to suppress.
+   - Document the merged semantics: activation remains a coarse gate (enabling/disabling the root frame), filters apply fine-grained scope policies, and both share a single cache lifetime tied to tracer flush/reset.
+3. **Update metadata and tooling expectations.**
+   - Persist the recorded exit status into trace metadata (`trace_metadata.json`) so downstream tools can rely on it without scanning events.
+   - Update docs and integration tests to assert that traces end at stack depth zero, even when activation suspends/resumes or filters drop frames.
+
+## Consequences
+- **Benefits**
+  - Trace consumers no longer see dangling `<toplevel>` activations, and they can surface the script exit status directly from the trace file.
+  - Activation, filtering, value capture, and IO policies share a single gating decision, reducing state divergence and simplifying future features (e.g., per-filter activation windows).
+  - Error paths become easier to reason about because every exit funnels through the same `<toplevel>` return emission.
+- **Costs**
+  - API changes propagate through the Python bindings (`stop_tracing`, `TraceSession.stop`, CLI), requiring coordination with users embedding the recorder programmatically.
+  - The gate abstraction adds code churn in `runtime/tracer/events.rs` and related helpers as callbacks adopt the new decision API.
+  - Metadata writers must update to include the exit status.
+- **Risks**
+  - Forgetting to pass the exit status from bespoke integrations (custom Python entrypoints) would regress behaviour back to "unknown exit". We mitigate this with a backwards-compatible default (`None` translates to `<unknown>` exit) and clear release notes.
+  - A buggy gate implementation could over-disable callbacks, suppressing legitimate trace data. We will add regression tests covering activation+filter combinations (activation path inside a skipped scope, resumed generators, etc.) before rollout.
+  - The PyO3 signature change may break ABI expectations if not versioned carefully. We will bump the crate minor version and document the new keyword argument.
+
+## Alternatives
+- **Emit the `<toplevel>` return entirely on the Python side.** Rejected because it would duplicate writer logic in Python, bypass IO flush/value capture, and fail when users call the PyO3 API directly from Rust.
+- **Keep activation and filter gating separate but document the quirks.** Rejected: we already hit real bugs (unbalanced traces, stale caches), and layering more documentation will not solve the underlying inconsistency.
+- **Deprecate activation now that filters exist.** Rejected because activation provides a simple UX for "start tracing when my script begins", which filters alone cannot replace without writing bespoke configs.
+
+## References
+- `codetracer-python-recorder/src/runtime/output_paths.rs`
+- `codetracer-python-recorder/src/runtime/tracer/events.rs`
+- `codetracer-python-recorder/src/runtime/activation.rs`
+- `codetracer-python-recorder/src/runtime/tracer/filtering.rs`
+- `codetracer_python_recorder/session.py`
+- `codetracer_python_recorder/cli.py`
diff --git a/design-docs/toplevel-exit-and-trace-gating-implementation-plan.md b/design-docs/toplevel-exit-and-trace-gating-implementation-plan.md
@@ -0,0 +1,93 @@
+# Toplevel Exit & Trace Gating – Implementation Plan
+
+Plan owners: codetracer recorder maintainers  
+Target ADR: 0015 – Balanced Toplevel Lifecycle and Unified Trace Gating  
+Impacted components:  
+- `codetracer-python-recorder/src/session.rs` and `codetracer_python_recorder/session.py`  
+- `codetracer-python-recorder/src/runtime/tracer` (events, lifecycle, filtering)  
+- `codetracer-python-recorder/src/runtime/activation.rs`  
+- `codetracer-python-recorder/src/runtime/output_paths.rs` and metadata helpers  
+- `codetracer-pure-python-recorder` parity shims (optional but strongly recommended)
+
+## Goals
+- Always emit a `<toplevel>` return event whose payload reflects the process exit status (or a descriptive placeholder when unavailable).
+- Plumb exit codes from Python entrypoints through the PyO3 API into the Rust runtime without breaking existing integrations.
+- Replace the ad-hoc combination of activation and filter decisions with a single gate so callbacks make consistent trace/skip/disable choices.
+- Keep lifecycle bookkeeping (IO flush, value capture, activation teardown) in sync with the unified gate and the new exit record.
+- Extend metadata (`trace_metadata.json`) with the recorded exit status for downstream tooling.
+
+## Non-Goals
+- No changes to the on-disk trace schema beyond the new return record payload; we keep the existing call/return/line structure.
+- No removal of activation support; the work only refactors it to cooperate with filters.
+- No immediate addition of exit-status reporting to the CLI JSON trailers (can be follow-up).
+- No attempt to refit the pure-Python recorder in the same PR; it may gain parity later but does not block landing the Rust changes.
+
+## Current Gaps
+- `stop_tracing` (PyO3) accepts no arguments, so the runtime never learns the script exit status captured by `codetracer_python_recorder/cli.py`.
+- `RuntimeTracer::finish` only finalises writers; it does not record any return edge for the synthetic `<toplevel>` call emitted in `TraceWriter::start`.
+- Activation and filtering are checked independently inside each callback (`on_py_start`, `on_py_return`, etc.), leading to divergent cache state (`ActivationController::suspended` vs `FilterCoordinator::ignored_code_ids`).
+- Filter-driven `CallbackOutcome::DisableLocation` does not inform the activation controller, so the activation window can remain "active" after CPython stops issuing callbacks.
+- Metadata writers do not persist exit status, and tests assume partial stacks are acceptable.
+
+## Workstreams
+
+### WS1 – Public API & Session Plumbing
+**Scope:** Carry exit status from Python to Rust with backwards-compatible defaults.
+- Update `codetracer-python-recorder/src/session.rs::stop_tracing` to expose an optional `exit_code: Option<i32>` parameter (new keyword-only arg in Python).
+- Adjust `codetracer_python_recorder/session.py` so `TraceSession.stop` and the module-level `stop()` accept an optional exit code and forward it.
+- Modify `codetracer_python_recorder/cli.py::main` to pass the captured `exit_code` when stopping the session; preserve legacy behaviour (`None`) for callers that do not provide a code.
+- Add unit tests in `codetracer_python_recorder/tests` ensuring the new keyword argument is optional and that `stop(exit_code=123)` calls into the backend with the expected value (mocking PyO3 layer).
+
+### WS2 – Runtime Exit State & `<toplevel>` Return Emission
+**Scope:** Store the exit status and emit the balancing return event.
+- Introduce a small struct (e.g., `SessionTermination`) inside `RuntimeTracer` to hold `exit_code: Option<i32>` plus a `reason: ExitReason`.
+- Extend the `Tracer` trait implementation with a new method (e.g., `set_exit_status`) or reuse `notify_failure` paths to capture both normal exit and disable scenarios.
+- In `RuntimeTracer::finish`, before `finalise`:
+  - Call `record_return_value` with the exit payload.
+  - Invoke `TraceWriter::register_return` / `mark_event`.
+  - Ensure activation receives an `ActivationExitKind::Completed` for the toplevel code id.
+- Emit synthetic reasons (`"<disabled>"`, `"<panic>"`, etc.) when tracing stops due to errors. Reuse existing error-path metadata in `notify_failure` to populate the reason.
+- Update integration tests (Rust + Python) to assert that the final event sequence includes a `<toplevel>` call + return pair and that the return payload matches the script exit code.
+
+### WS3 – Unified Trace Gate Abstraction
+**Scope:** Merge activation and filter decision paths.
+- Create `TraceGate` and `GateDecision` types under `runtime/tracer`.
+  - API: `evaluate(py, code, event_kind) -> GateDecision`.
+  - Decision carries: `process_event` (bool), `disable_location` (bool), `activation_transition` (enum).
+- Refactor `FilterCoordinator` so it exposes `decide_for_gate(py, code)` returning an enriched result (trace/skip plus cached scope resolution). The coordinator only updates caches when the gate confirms the event was processed.
+- Update `ActivationController` with methods `on_enter`, `on_suspend`, `on_exit`, and `reset`. Remove direct field mutations from callbacks.
+- Rewrite tracer callbacks (`on_py_start`, `on_py_return`, `on_py_resume`, etc.) to:
+  1. Ask the gate for a decision.
+  2. Exit early when `process_event == false` and return the proper `CallbackOutcome`.
+  3. After recording the event, invoke any activation transition included in the decision.
+- Ensure `CallbackOutcome::DisableLocation` is only returned once per code id and that both activation and filter caches mark the frame as ignored thereafter.
+- Unit test the gate with synthetic `CodeObjectWrapper` fixtures covering combinations: inactive activation + allow filter, active activation + skip filter, suspended activation resumed, etc.
+
+### WS4 – Lifecycle & Metadata Updates
+**Scope:** Keep writer lifecycle and metadata consistent with the new behaviour.
+- Update `LifecycleController::finalise` (or adjacent helper) to write the exit status into `trace_metadata.json` under a new field (e.g., `"process_exit_code"`). Ensure this runs only once per session.
+- Confirm `cleanup_partial_outputs` still executes when tracing disables early and that the exit record is only written for successful sessions.
+- Add a regression test around `TraceOutputPaths::configure_writer` + `RuntimeTracer::finish` verifying the events buffer contains both call and return entries for `<toplevel>`.
+- Update documentation (`design-docs/design-001.md`, user docs) to describe the exit-status metadata and the unified gating semantics.
+
+### WS5 – Validation & Parity Follow-Up
+**Scope:** Prove end-to-end correctness and plan the optional pure-Python update.
+- Extend Python integration tests (`tests/python`) with scenarios:
+  - CLI run that exits with non-zero status; assert trace contains `<toplevel>` return with the negative path.
+  - Activation path configured alongside a filter that skips the same file; ensure tracing starts/stops exactly once and stack depth ends at zero.
+  - Generator/coroutine workloads guarded by activation; confirm gate decisions do not regress existing balanced-call tests.
+- Run `just test` to cover all tests after refactors.
+- Document a follow-up issue to mirror `<toplevel>` return emission in `codetracer-pure-python-recorder`, keeping trace semantics aligned across products.
+
+## Testing & Rollout Checklist
+- [ ] `just test`
+- [ ] Python integration tests covering exit-code propagation and activation+filter combinations
+- [ ] Manual smoke test: run CLI against a script returning exit code 3, inspect `trace.json` for `<toplevel>` return payload `3`
+- [ ] Update changelog / release notes highlighting the new API parameter and exit-status metadata
+- [ ] Notify downstream data pipeline owners that exit status is now available
+
+## Risks & Mitigations
+- **Breaking API changes:** Ensure `stop_tracing` still works without arguments by providing a Python default (`exit_code: int | None = None`) and by releasing under a minor version bump.
+- **Gate regressions:** Add exhaustive unit tests plus targeted integration tests so we catch scenarios where activation or filters no longer fire.
+- **Performance impact:** Benchmark tracing hot paths after the refactor; the gate should add minimal overhead. Profile with `just bench` / existing benchmarks, and roll back micro-optimisations if regressions exceed 5%.
+- **Incomplete error coverage:** Make sure disable/error paths still flush IO and write metadata. Write explicit tests that trigger `OnRecorderError::Disable` to observe the synthetic `<toplevel>` return reason.