|
| 1 | +# ADR 0012: Balanced sys.monitoring Call Stack Events |
| 2 | + |
| 3 | +- **Status:** Proposed |
| 4 | +- **Date:** 2025-10-26 |
| 5 | +- **Deciders:** codetracer recorder maintainers |
| 6 | +- **Consulted:** Runtime tracing stakeholders, Replay consumers |
| 7 | +- **Informed:** Support engineering, DX tooling crew |
| 8 | + |
| 9 | +## Context |
| 10 | +- The Rust-backed recorder currently subscribes to `PY_START`, `PY_RETURN`, and `LINE` events from `sys.monitoring`. |
| 11 | +- `RuntimeTracer` only emits two structural trace records—`TraceWriter::register_call` and `TraceWriter::register_return`—because the trace file format has no explicit notion of yields, resumptions, or exception unwinding. |
| 12 | +- Generators, coroutines, and exception paths trigger additional `sys.monitoring` events (`PY_YIELD`, `PY_RESUME`, `PY_THROW`, `PY_UNWIND`). When we ignore them the call stack in the trace becomes unbalanced, causing downstream tooling to miscompute nesting depth, duration, and attribution. |
| 13 | +- CPython already exposes these events with complete callback metadata. We simply never hook them, so resumptions and unwinds silently skip our writer. |
| 14 | + |
| 15 | +## Problem |
| 16 | +- Trace consumers require balanced call/return pairs to reconstruct execution trees and propagate per-activation metadata (filters, IO capture, telemetry). |
| 17 | +- When a generator yields, we never emit a `register_return`, so the activation remains "open" forever even if the generator is never resumed. |
| 18 | +- When the interpreter unwinds a frame because of an exception, we neither emit a `register_return` nor mark the activation inactive, so the lifecycle bookkeeping leaks and `TraceWriter::finish_writing_trace_events` ends with dangling activations. |
| 19 | +- Conversely, when a generator/coroutine resumes—either normally (`PY_RESUME`) or via `throw()` (`PY_THROW`)—we fail to emit the "call" edge that would push it back on the logical stack. |
| 20 | +- Without these edges, the runtime cannot guarantee `TraceWriter` invariants or present accurate trace metadata. Adding synthetic bookkeeping in consumers is not possible because the events are already lost. |
| 21 | + |
| 22 | +## Decision |
| 23 | +1. **Treat additional monitoring events as structural aliases.** |
| 24 | + - Map `PY_YIELD` and `PY_UNWIND` callbacks to the same flow as `on_py_return`, ultimately calling `TraceWriter::register_return`. |
| 25 | + - Map `PY_RESUME` callbacks to the same flow as `on_py_start`, emitting a call edge with an empty argument vector because CPython does not provide the `send()` value (`https://docs.python.org/3/library/sys.monitoring.html#monitoring-event-PY_RESUME`). |
| 26 | + - Map `PY_THROW` callbacks to the call flow but propagate the exception object as the payload recorded for the resumed activation so downstream tools can correlate the injected error; encode it as a single argument named `exception` using the existing value encoder (`https://docs.python.org/3/library/sys.monitoring.html#monitoring-event-PY_THROW`). |
| 27 | +2. **Subscribe to the four events in `RuntimeTracer::interest`.** The tracer will request `{PY_START, PY_RETURN, PY_YIELD, PY_UNWIND, PY_RESUME, PY_THROW}` plus `LINE` to preserve current behaviour. |
| 28 | +3. **Unify lifecycle hooks.** Extend the activation manager so that yield/unwind events deactivate the frame and resumption events reactivate or spawn a continuation while preserving filter decisions, telemetry handles, and IO capture state. |
| 29 | +4. **Preserve file-format semantics.** We will not add new record types; instead we ensure every control-flow boundary ultimately produces the same call/return records the file already understands. |
| 30 | +5. **Defensive guards.** Log-and-disable behaviour stays unchanged: any callback failure still honours policy (`OnRecorderError`). The new events use the same `should_trace_code` and activation gates so filters can skip generators consistently. |
| 31 | + |
| 32 | +## Consequences |
| 33 | +- **Benefits:** |
| 34 | + - Balanced call stacks for generators, coroutines, and exception unwinds without touching the trace schema. |
| 35 | + - Replay and analysis tools stop seeing "dangling activation" warnings, improving trust in exported traces. |
| 36 | + - The recorder can later add richer semantics (e.g., value capture on resume) because the structural foundation is sound. |
| 37 | +- **Costs:** |
| 38 | + - Slightly higher callback volume, especially in generator-heavy workloads (two extra events per yield/resume pair). |
| 39 | + - Additional complexity inside `RuntimeTracer` to differentiate return-like vs call-like flows while sharing writer helpers. |
| 40 | +- **Risks:** |
| 41 | + - Incorrect mapping could double-emit calls or returns, corrupting the trace. We mitigate this with targeted tests covering yields, exceptions, and `throw()`-driven resumes. |
| 42 | + - Performance regressions if the new paths capture values unnecessarily; we will keep value capture opt-in via filter policies. |
| 43 | + |
| 44 | +## Alternatives |
| 45 | +- **Introduce new trace record kinds for each event.** Rejected because consumers, storage, and analytics would all need format upgrades, and the existing stack-only writer already conveys the necessary structure. |
| 46 | +- **Approximate via Python-side bookkeeping.** Rejected: the Python helper cannot observe generator unwinds once the Rust tracer suppresses the events. |
| 47 | +- **Ignore stack balancing and patch consumers.** Rejected because it hides the source of truth and still leaves us without activation lifecycle signals during recording (IO capture, telemetry). |
| 48 | + |
| 49 | +## Key Examples |
| 50 | + |
| 51 | +### 1. Ordinary Function Call |
| 52 | +```python |
| 53 | +def add(a, b): |
| 54 | + return a + b |
| 55 | + |
| 56 | +result = add(4, 5) |
| 57 | +``` |
| 58 | +- `PY_START` fires when `add` begins. We capture the two arguments via `capture_call_arguments` and call `TraceWriter::register_call(function_id=add, args=[("a", 4), ("b", 5)])`. |
| 59 | +- `PY_RETURN` fires just before the return. We record the value `9` through `record_return_value`, which invokes `TraceWriter::register_return(9)`. |
| 60 | +- The trace shows a single balanced call/return pair; no other structural events are emitted. |
| 61 | + |
| 62 | +### 2. Generator Yield + Resume |
| 63 | +```python |
| 64 | +def ticker(): |
| 65 | + yield "ready" |
| 66 | + yield "again" |
| 67 | + |
| 68 | +g = ticker() |
| 69 | +first = next(g) |
| 70 | +second = next(g) |
| 71 | +``` |
| 72 | +- First `next(g)`: |
| 73 | + - `PY_START` → `register_call(ticker, args=[])`. |
| 74 | + - `PY_YIELD` → `register_return("ready")`. The activation is now suspended but the trace stack is balanced. |
| 75 | +- Second `next(g)`: |
| 76 | + - `PY_RESUME` → `register_call(ticker, args=[])` (empty vector because CPython does not expose the send value). |
| 77 | + - `PY_YIELD` → `register_return("again")`. |
| 78 | +- When the generator exhausts, CPython emits `PY_RETURN`, so we `register_return(None)` (or whatever value was returned). Every suspension/resumption pair corresponds to alternating `register_return`/`register_call`, keeping the call stack consistent. |
| 79 | + |
| 80 | +### 3. Generator Throw |
| 81 | +```python |
| 82 | +def worker(): |
| 83 | + try: |
| 84 | + yield "ready" |
| 85 | + except RuntimeError as err: |
| 86 | + return f"caught {err}" |
| 87 | + |
| 88 | +g = worker() |
| 89 | +next(g) |
| 90 | +g.throw(RuntimeError("boom")) |
| 91 | +``` |
| 92 | +- Initial `next(g)` behaves like Example 2. |
| 93 | +- `g.throw(...)` triggers: |
| 94 | + - `PY_THROW` with the exception object. We emit `register_call(worker, args=[("exception", RuntimeError("boom"))])`, encoding the exception with the existing value encoder so it appears in the trace payload. |
| 95 | + - If the generator handles the exception and returns, `PY_RETURN` follows and we write `register_return("caught boom")`. If it re-raises, `PY_UNWIND` fires instead and we encode the exception value in `register_return`. |
| 96 | + |
| 97 | +### 4. Exception Unwind Without Yield |
| 98 | +```python |
| 99 | +def explode(): |
| 100 | + raise ValueError("bad news") |
| 101 | + |
| 102 | +def run(): |
| 103 | + return explode() |
| 104 | + |
| 105 | +run() |
| 106 | +``` |
| 107 | +- `explode()` starts: `PY_START` → `register_call(explode, args=[])`. |
| 108 | +- The function raises before returning, so CPython skips `PY_RETURN` and emits `PY_UNWIND` with the `ValueError`. |
| 109 | +- We treat `PY_UNWIND` like `PY_RETURN`: flush pending IO, encode the exception via `record_return_value`, and call `register_return(ValueError("bad news"))`. The activation controller marks the frame inactive, preventing dangling stack entries when tracing finishes. |
| 110 | + |
| 111 | +### 5. Coroutine Await / Resume |
| 112 | +```python |
| 113 | +import asyncio |
| 114 | + |
| 115 | +async def worker(): |
| 116 | + await asyncio.sleep(0) |
| 117 | + return "done" |
| 118 | + |
| 119 | +asyncio.run(worker()) |
| 120 | +``` |
| 121 | +- Entry: `PY_START` → `register_call(worker, args=[])`. |
| 122 | +- When the coroutine awaits `sleep(0)`, CPython emits `PY_YIELD` with no explicit value (await results are delivered later). We encode the pending await result (typically `None`) via `register_return`. |
| 123 | +- When the event loop resumes `worker`, `PY_RESUME` fires and we record another `register_call(worker, args=[])`. No payload is available because the resume value is implicit in the await machinery. |
| 124 | +- Final completion triggers `PY_RETURN` so we write `register_return("done")`. |
| 125 | +- The trace therefore shows multiple call/return pairs for the same coroutine activation, mirroring each suspend/resume cycle. |
| 126 | + |
| 127 | +## Rollout |
| 128 | +1. Update the design docs with this ADR and the implementation plan. |
| 129 | +2. Implement the runtime changes behind standard CI, landing tests that prove stack balance for yields, unwinds, and resumes. |
| 130 | +3. Notify downstream consumers that generator traces now appear balanced without requiring schema or API changes. |
| 131 | +4. Monitor regression dashboards for callback volume and latency after enabling the new events by default. |
0 commit comments