Skip to content

Commit f16a748

Browse files
committed
design-docs/adr/0005-io-capture.md: Refactor old spec
design-docs/capture-output-implementation-plan.md: Signed-off-by: Tzanko Matev <[email protected]>
1 parent 16e2f63 commit f16a748

File tree

3 files changed

+108
-204
lines changed

3 files changed

+108
-204
lines changed

design-docs/adr/0005-io-capture.md

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
# ADR 0005: Input and Output Capture for Runtime Traces
2+
3+
- **Status:** Proposed
4+
- **Date:** 2025-10-03
5+
- **Deciders:** Runtime recorder maintainers
6+
- **Consulted:** Python platform crew, Replay tooling crew
7+
- **Informed:** DX crew, Release crew
8+
9+
## Context
10+
- The repo now splits session bootstrap, monitoring glue, and runtime logic into clear modules (`session`, `monitoring`, `runtime`).
11+
- `RuntimeTracer` owns the `NonStreamingTraceWriter` and activation rules, and already writes metadata, paths, and step events.
12+
- `recorder-errors` gives us uniform error codes and panic trapping. Every new subsystem must use it.
13+
- We still forward stdout, stderr, and stdin directly to the host console. No bytes reach the trace.
14+
- Replay and debugging teams need IO events beside call and line records so they can rebuild console sessions.
15+
16+
## Problem
17+
- We need lossless IO capture without breaking the in-process `sys.monitoring` design or the new error policy.
18+
- The old pipe-based spec assumed the tracer lived inside `start()` and mutated global state freely. The refactor put lifecycle code behind `TraceSessionBootstrap`, `TraceOutputPaths`, and `RuntimeTracer::begin`.
19+
- We also added activation gating and stricter teardown rules. Any IO hooks must respect them and always restore the original file descriptors.
20+
21+
## Decision
22+
1. Keep the Python CLI contract. `codetracer_python_recorder.start_tracing` keeps installing the tracer, but now also starts an IO capture controller right before `install_tracer` and shuts it down inside `stop_tracing`.
23+
2. Introduce `runtime::io_capture` with a single public type, `IoCapture`. It duplicates stdin/stdout/stderr, installs platform pipes, and spawns blocking reader threads. The module hides Unix vs Windows code paths behind a small trait (`IoEndpoint`).
24+
3. Expose an `IoEventSink` from `RuntimeTracer`. The sink wraps the writer in `Arc<Mutex<...>>` and exposes two safe methods: `record_output(chunk: IoChunk)` and `record_input(chunk: IoChunk)`. Reader threads call the sink only. All conversions to `TraceLowLevelEvent` live next to the writer so we reuse value encoders and error helpers.
25+
4. Extend `RuntimeTracer` with a light `ThreadSnapshotStore`. `on_line` updates the current `{ path_id, line, frame_id }` per Python thread. `IoEventSink` reads the latest snapshot when it serialises a chunk. When no snapshot exists we fall back to the last global step.
26+
5. Store stdout and stderr bytes as `EventLogKind::Write` and `WriteOther`. Store stdin bytes as `EventLogKind::Read`. Metadata includes the stream name, monotonic timestamps, thread tag, and the captured snapshot when present. Bytes stay base64 encoded by the runtime tracing crate.
27+
6. Keep console passthrough. The reader threads mirror each chunk back into the saved file descriptors so users still see live output.
28+
7. Wire capture teardown into existing error handling. `IoCapture::stop` drains the pipes, restores FDs, signals the threads, and logs failures through the `recorder-errors` macros. `RuntimeTracer::finish` waits for the IO channel before calling `TraceWriter::finish_*` to avoid races.
29+
8. Hide the feature behind `RecorderPolicy`. A new flag `policy.io_capture` defaults to off today. Tests and early adopters enable it. Once stable we flip the default.
30+
31+
## Consequences
32+
- **Upsides:** We capture IO without a subprocess, reuse the refactored writer lifecycle, and keep activation gating intact. Replay tooling reads one stream for events and IO.
33+
- **Costs:** Writer calls now cross a mutex, so we must measure contention. The new module adds platform code that needs tight tests. We must watch out for deadlocks on interpreter shutdown.
34+
35+
## Rollout
36+
- Ship behind an environment toggle `CODETRACER_CAPTURE_IO=1` wired into the policy layer. Emit a warning when the policy disables capture.
37+
- Document the behaviour in the recorder README and the user CLI help once we land the feature.
38+
- Graduate the ADR to **Accepted** after the implementation plan closes and the policy ship flips the default on both Unix and Windows.
39+
40+
## Alternatives
41+
- A subprocess wrapper was considered again and rejected. It would undo the refactor that keeps tracing in-process and would break existing embedding use cases.
42+
- `sys.stdout` monkey patching remains off the table. It misses native writes and user-assigned streams.
43+
- Writing IO into a separate JSON file is still unnecessary. The runtime tracing schema already handles IO events.
Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
# Capture Output Implementation Plan
2+
3+
## Goal
4+
- Ship lossless stdout, stderr, and stdin capture in the Rust recorder without breaking the current CLI flow or error policy.
5+
6+
## Guiding Notes
7+
- Follow ADR 0005.
8+
- Keep sentences short for readers; prefer bullets.
9+
- Run `just test` on every stage.
10+
11+
## Stage 0 – Refactor for IO capture (must land first)
12+
- Split writer ownership out of `RuntimeTracer` into a helper (`TraceWriterHost`) that exposes a thread-safe event API.
13+
- Add a small `ThreadSnapshotStore` that records the latest `{path_id, line, frame_id}` per Python thread inside the runtime module.
14+
- Ensure `RuntimeTracer::finish` already waits on background work hooks; add a stub `IoDrain` trait with no-op implementation so later stages can slot in real drains.
15+
- Update `session::start_tracing` and `stop_tracing` to accept optional "extra lifecycle" handles so we can pair start/stop work without more globals.
16+
- Tests: extend existing runtime unit tests to cover the new snapshot store and confirm start/stop paths still finalise trace files.
17+
18+
## Stage 1 – Build the IO capture core
19+
- Create `runtime::io_capture` with platform-specific back ends (`unix.rs`, `windows.rs`) hidden behind a common trait.
20+
- Implement descriptor/handle duplication, pipe install, and reader thread startup. Use blocking reads and thread-safe queues (`crossbeam-channel` already in workspace; add if missing).
21+
- Ensure mirror writes go back to the saved descriptors so console output stays live.
22+
- Tests: add Rust unit tests that fake pipes (use `os_pipe` on Unix, `tempfile` handles on Windows via CI) to confirm duplication and restoration.
23+
24+
## Stage 2 – Connect capture to the tracer
25+
- Add an `IoEventSink` struct that owns `Arc<Mutex<TraceWriterHost>>` plus a snapshot reader.
26+
- Reader threads push `IoChunk` structs (`stream`, `timestamp`, `bytes`, `producer_thread`) into the sink. The sink converts them into runtime tracing events and records them.
27+
- Use `recorder-errors` for all failures (`usage!` for bad config, `enverr!` for IO problems). Log through the existing logging module; never `println!`.
28+
- Update `RuntimeTracer::begin` to start the sink when policy allows. Store the `IoCapture` handle and drain it in `finish`.
29+
- Tests: add integration tests in `tests/` that run a sample script writing to stdout/stderr and reading from stdin, then assert trace files contain the matching events. Verify passthrough stays intact.
30+
31+
## Stage 3 – Policy flag, CLI wiring, and guards
32+
- Extend `RecorderPolicy` with `io_capture_enabled` plus env var `CODETRACER_CAPTURE_IO`.
33+
- Make the Python CLI surface a `--capture-io` flag (defaults to policy). Document the flag in help text.
34+
- Emit a single log line when capture is disabled by policy so users understand why their trace lacks IO events.
35+
- Tests: Python integration test toggling the policy and checking presence/absence of IO records.
36+
37+
## Stage 4 – Hardening and docs
38+
- Stress test with large outputs (beyond pipe buffer) and interleaved writes from multiple threads.
39+
- Run Windows CI to verify handle restore logic and CRLF behaviour.
40+
- Document the feature in README + design docs. Update ADR status once accepted.
41+
- Add metrics for dropped IO chunks using the existing logging counters.
42+
- Tests: extend stress tests plus regression tests for start/stop loops to ensure descriptors always restore.
43+
44+
## Milestones
45+
1. Stage 0 merged and green CI. Serves as base branch for feature work.
46+
2. Stages 1–2 merged together behind a feature flag. Feature hidden by default.
47+
3. Stage 3 flips the flag for opted-in users. Gather feedback.
48+
4. Stage 4 finishes docs, flips default to on, and promotes ADR 0005 to Accepted.
49+
50+
## Verification Checklist
51+
- `just test` passes after every stage.
52+
- New unit tests cover writer host, snapshot store, and IO capture workers.
53+
- Integration tests assert trace events and passthrough behaviour on Linux and Windows.
54+
- Manual smoke: run `python -m codetracer_python_recorder examples/stdout_script.py` and confirm console output plus IO trace entries.
55+
56+
## Risks & Mitigations
57+
- **Deadlocks:** Keep reader threads simple, use bounded channels, and add shutdown timeouts tested in CI.
58+
- **Performance hit:** Benchmark before and after Stage 2 with large stdout workloads; document results.
59+
- **Platform drift:** Share the Unix/Windows API contract in a `README` inside the module and guard behaviour with tests.
60+
61+
## Exit Criteria
62+
- IO events present in trace files when the policy flag is on.
63+
- Console output unchanged for users.
64+
- No file descriptor leaks (checked via stress tests and `lsof` in CI scripts).
65+
- Documentation published and linked from ADR 0005.

0 commit comments

Comments
 (0)