design-docs/adr/0005-io-capture.md: Refactor old spec

tzanko-matev · tzanko-matev · commit f16a748ab406 · 2025-10-03T16:23:05.000+03:00
design-docs/capture-output-implementation-plan.md: 

Signed-off-by: Tzanko Matev &lt;tsanko@metacraft-labs.com&gt;
diff --git a/design-docs/adr/0005-io-capture.md b/design-docs/adr/0005-io-capture.md
@@ -0,0 +1,43 @@
+# ADR 0005: Input and Output Capture for Runtime Traces
+
+- **Status:** Proposed
+- **Date:** 2025-10-03
+- **Deciders:** Runtime recorder maintainers
+- **Consulted:** Python platform crew, Replay tooling crew
+- **Informed:** DX crew, Release crew
+
+## Context
+- The repo now splits session bootstrap, monitoring glue, and runtime logic into clear modules (`session`, `monitoring`, `runtime`).
+- `RuntimeTracer` owns the `NonStreamingTraceWriter` and activation rules, and already writes metadata, paths, and step events.
+- `recorder-errors` gives us uniform error codes and panic trapping. Every new subsystem must use it.
+- We still forward stdout, stderr, and stdin directly to the host console. No bytes reach the trace.
+- Replay and debugging teams need IO events beside call and line records so they can rebuild console sessions.
+
+## Problem
+- We need lossless IO capture without breaking the in-process `sys.monitoring` design or the new error policy.
+- The old pipe-based spec assumed the tracer lived inside `start()` and mutated global state freely. The refactor put lifecycle code behind `TraceSessionBootstrap`, `TraceOutputPaths`, and `RuntimeTracer::begin`.
+- We also added activation gating and stricter teardown rules. Any IO hooks must respect them and always restore the original file descriptors.
+
+## Decision
+1. Keep the Python CLI contract. `codetracer_python_recorder.start_tracing` keeps installing the tracer, but now also starts an IO capture controller right before `install_tracer` and shuts it down inside `stop_tracing`.
+2. Introduce `runtime::io_capture` with a single public type, `IoCapture`. It duplicates stdin/stdout/stderr, installs platform pipes, and spawns blocking reader threads. The module hides Unix vs Windows code paths behind a small trait (`IoEndpoint`).
+3. Expose an `IoEventSink` from `RuntimeTracer`. The sink wraps the writer in `Arc<Mutex<...>>` and exposes two safe methods: `record_output(chunk: IoChunk)` and `record_input(chunk: IoChunk)`. Reader threads call the sink only. All conversions to `TraceLowLevelEvent` live next to the writer so we reuse value encoders and error helpers.
+4. Extend `RuntimeTracer` with a light `ThreadSnapshotStore`. `on_line` updates the current `{ path_id, line, frame_id }` per Python thread. `IoEventSink` reads the latest snapshot when it serialises a chunk. When no snapshot exists we fall back to the last global step.
+5. Store stdout and stderr bytes as `EventLogKind::Write` and `WriteOther`. Store stdin bytes as `EventLogKind::Read`. Metadata includes the stream name, monotonic timestamps, thread tag, and the captured snapshot when present. Bytes stay base64 encoded by the runtime tracing crate.
+6. Keep console passthrough. The reader threads mirror each chunk back into the saved file descriptors so users still see live output.
+7. Wire capture teardown into existing error handling. `IoCapture::stop` drains the pipes, restores FDs, signals the threads, and logs failures through the `recorder-errors` macros. `RuntimeTracer::finish` waits for the IO channel before calling `TraceWriter::finish_*` to avoid races.
+8. Hide the feature behind `RecorderPolicy`. A new flag `policy.io_capture` defaults to off today. Tests and early adopters enable it. Once stable we flip the default.
+
+## Consequences
+- **Upsides:** We capture IO without a subprocess, reuse the refactored writer lifecycle, and keep activation gating intact. Replay tooling reads one stream for events and IO.
+- **Costs:** Writer calls now cross a mutex, so we must measure contention. The new module adds platform code that needs tight tests. We must watch out for deadlocks on interpreter shutdown.
+
+## Rollout
+- Ship behind an environment toggle `CODETRACER_CAPTURE_IO=1` wired into the policy layer. Emit a warning when the policy disables capture.
+- Document the behaviour in the recorder README and the user CLI help once we land the feature.
+- Graduate the ADR to **Accepted** after the implementation plan closes and the policy ship flips the default on both Unix and Windows.
+
+## Alternatives
+- A subprocess wrapper was considered again and rejected. It would undo the refactor that keeps tracing in-process and would break existing embedding use cases.
+- `sys.stdout` monkey patching remains off the table. It misses native writes and user-assigned streams.
+- Writing IO into a separate JSON file is still unnecessary. The runtime tracing schema already handles IO events.
diff --git a/design-docs/capture-output-implementation-plan.md b/design-docs/capture-output-implementation-plan.md
@@ -0,0 +1,65 @@
+# Capture Output Implementation Plan
+
+## Goal
+- Ship lossless stdout, stderr, and stdin capture in the Rust recorder without breaking the current CLI flow or error policy.
+
+## Guiding Notes
+- Follow ADR 0005.
+- Keep sentences short for readers; prefer bullets.
+- Run `just test` on every stage.
+
+## Stage 0 – Refactor for IO capture (must land first)
+- Split writer ownership out of `RuntimeTracer` into a helper (`TraceWriterHost`) that exposes a thread-safe event API.
+- Add a small `ThreadSnapshotStore` that records the latest `{path_id, line, frame_id}` per Python thread inside the runtime module.
+- Ensure `RuntimeTracer::finish` already waits on background work hooks; add a stub `IoDrain` trait with no-op implementation so later stages can slot in real drains.
+- Update `session::start_tracing` and `stop_tracing` to accept optional "extra lifecycle" handles so we can pair start/stop work without more globals.
+- Tests: extend existing runtime unit tests to cover the new snapshot store and confirm start/stop paths still finalise trace files.
+
+## Stage 1 – Build the IO capture core
+- Create `runtime::io_capture` with platform-specific back ends (`unix.rs`, `windows.rs`) hidden behind a common trait.
+- Implement descriptor/handle duplication, pipe install, and reader thread startup. Use blocking reads and thread-safe queues (`crossbeam-channel` already in workspace; add if missing).
+- Ensure mirror writes go back to the saved descriptors so console output stays live.
+- Tests: add Rust unit tests that fake pipes (use `os_pipe` on Unix, `tempfile` handles on Windows via CI) to confirm duplication and restoration.
+
+## Stage 2 – Connect capture to the tracer
+- Add an `IoEventSink` struct that owns `Arc<Mutex<TraceWriterHost>>` plus a snapshot reader.
+- Reader threads push `IoChunk` structs (`stream`, `timestamp`, `bytes`, `producer_thread`) into the sink. The sink converts them into runtime tracing events and records them.
+- Use `recorder-errors` for all failures (`usage!` for bad config, `enverr!` for IO problems). Log through the existing logging module; never `println!`.
+- Update `RuntimeTracer::begin` to start the sink when policy allows. Store the `IoCapture` handle and drain it in `finish`.
+- Tests: add integration tests in `tests/` that run a sample script writing to stdout/stderr and reading from stdin, then assert trace files contain the matching events. Verify passthrough stays intact.
+
+## Stage 3 – Policy flag, CLI wiring, and guards
+- Extend `RecorderPolicy` with `io_capture_enabled` plus env var `CODETRACER_CAPTURE_IO`.
+- Make the Python CLI surface a `--capture-io` flag (defaults to policy). Document the flag in help text.
+- Emit a single log line when capture is disabled by policy so users understand why their trace lacks IO events.
+- Tests: Python integration test toggling the policy and checking presence/absence of IO records.
+
+## Stage 4 – Hardening and docs
+- Stress test with large outputs (beyond pipe buffer) and interleaved writes from multiple threads.
+- Run Windows CI to verify handle restore logic and CRLF behaviour.
+- Document the feature in README + design docs. Update ADR status once accepted.
+- Add metrics for dropped IO chunks using the existing logging counters.
+- Tests: extend stress tests plus regression tests for start/stop loops to ensure descriptors always restore.
+
+## Milestones
+1. Stage 0 merged and green CI. Serves as base branch for feature work.
+2. Stages 1–2 merged together behind a feature flag. Feature hidden by default.
+3. Stage 3 flips the flag for opted-in users. Gather feedback.
+4. Stage 4 finishes docs, flips default to on, and promotes ADR 0005 to Accepted.
+
+## Verification Checklist
+- `just test` passes after every stage.
+- New unit tests cover writer host, snapshot store, and IO capture workers.
+- Integration tests assert trace events and passthrough behaviour on Linux and Windows.
+- Manual smoke: run `python -m codetracer_python_recorder examples/stdout_script.py` and confirm console output plus IO trace entries.
+
+## Risks & Mitigations
+- **Deadlocks:** Keep reader threads simple, use bounded channels, and add shutdown timeouts tested in CI.
+- **Performance hit:** Benchmark before and after Stage 2 with large stdout workloads; document results.
+- **Platform drift:** Share the Unix/Windows API contract in a `README` inside the module and guard behaviour with tests.
+
+## Exit Criteria
+- IO events present in trace files when the policy flag is on.
+- Console output unchanged for users.
+- No file descriptor leaks (checked via stress tests and `lsof` in CI scripts).
+- Documentation published and linked from ADR 0005.
diff --git a/design-docs/capture-output.md b/design-docs/capture-output.md