|
| 1 | +# 0010 – Codetracer Python Recorder Benchmarking |
| 2 | + |
| 3 | +## Status |
| 4 | +Proposed – pending review and implementation sequencing (target: post-configurable-trace-filter release). |
| 5 | + |
| 6 | +## Context |
| 7 | +- The Rust-backed `codetracer-python-recorder` now exposes configurable trace filters (WS1–WS6) and baseline micro/perf smoke benchmarks, but these are developer-only workflows with no CI visibility or historical tracking. |
| 8 | +- Performance regressions are difficult to detect: Criterion runs produce local reports, the Python smoke benchmark is opt-in, and CI currently exercises only functional correctness. |
| 9 | +- Product direction demands confidence that new features (filters, IO capture, PyO3 integration, policy changes) do not introduce unacceptable overhead or redaction slippage across representative workloads. |
| 10 | +- We require an auditable, automated benchmarking strategy that integrates with existing tooling (`just`, `uv`, Nix flake, GitHub Actions/Jenkins) and surfaces trends to the team without burdening release cadence. |
| 11 | + |
| 12 | +## Decision |
| 13 | +We will build a first-class benchmarking suite for `codetracer-python-recorder` with three pillars: |
| 14 | + |
| 15 | +1. **Deterministic harness coverage** |
| 16 | + - Preserve the existing Criterion microbench (`benches/trace_filter.rs`) and Python smoke benchmark, expanding them into a common `bench` workspace with reusable fixtures and scenario definitions (baseline, glob, regex, IO-heavy, auto-start). |
| 17 | + - Introduce additional Rust benches for runtime hot paths (scope resolution, redaction policy application, telemetry writes) under `codetracer-python-recorder/benches/`. |
| 18 | + - Add Python benchmarks (Pytest plugins + `pytest-benchmark` or custom timers) for end-to-end CLI runs, session API usage, and cross-process start/stop costs. |
| 19 | + |
| 20 | +2. **Automated execution & artefacts** |
| 21 | + - Create a dedicated `just bench-all` (or extend `just bench`) command that orchestrates all benchmarks, produces structured JSON summaries (`target/perf/*.json`), and archives raw outputs (Criterion reports, flamegraphs when enabled). |
| 22 | + - Provide a stable JSON schema capturing metadata (git SHA, platform, interpreter versions), scenario descriptors, statistics (p50/p95/mean, variance), and thresholds. |
| 23 | + - Ship a lightweight renderer (`scripts/render_bench_report.py`) that compares current results against the latest baseline stored in CI artefacts. |
| 24 | + |
| 25 | +3. **CI integration & historical tracking** |
| 26 | + - Add a continuous benchmark job (nightly and pull-request optional) that executes the suite inside the Nix shell (ensuring gnuplot/nodeps), uploads artefacts to GitHub Actions artefacts for long-term storage, and posts summary comments in PRs. |
| 27 | + - Maintain baseline snapshots in-repo (`codetracer-python-recorder/benchmarks/baselines/*.json`) refreshed on release branches after running on dedicated hardware. |
| 28 | + - Gate merges when regressions exceed configured tolerances (e.g., >5% slowdowns on primary scenarios) unless explicitly approved. |
| 29 | + |
| 30 | +Supporting practices: |
| 31 | +- Store benchmark configuration alongside code (`benchconfig.toml`) to keep scenarios versioned and reviewable. |
| 32 | +- Ensure opt-in developer tooling (`just bench`) remains fast by allowing subset filters (e.g., `JUST_BENCH_SCENARIOS=filters,session`). |
| 33 | + |
| 34 | +## Rationale |
| 35 | +- **Consistency:** Centralising definitions and outputs ensures that local runs and CI share identical workflows, reducing “works on my machine” drift. |
| 36 | +- **Observability:** Structured artefacts + historical storage let us graph trends, spot regressions early, and correlate with feature work. |
| 37 | +- **Scalability:** By codifying thresholds and baselines, we can expand the suite without rethinking CI each time (e.g., adding memory benchmarks). |
| 38 | +- **Maintainability:** Versioned configuration and scripts avoid ad-hoc shell pipelines and make it easy for contributors to extend benchmarks. |
| 39 | + |
| 40 | +## Consequences |
| 41 | +Positive: |
| 42 | +- Faster detection of performance regressions and validation of expected improvements. |
| 43 | +- Shared language for performance goals (scenarios, metrics, thresholds) across Rust and Python components. |
| 44 | +- Developers gain confidence via `just bench` parity with CI, plus local comparison tooling. |
| 45 | + |
| 46 | +Negative / Risks: |
| 47 | +- Running the full suite may increase CI time; we mitigate by scheduling nightly runs and allowing PR opt-in toggles. |
| 48 | +- Maintaining baselines requires disciplined updates whenever we intentionally change performance characteristics. |
| 49 | +- Additional scripts and artefacts introduce upkeep; we must document workflows and automate cleanup. |
| 50 | + |
| 51 | +Mitigations: |
| 52 | +- Provide partial runs (`just bench --scenarios filters`, `pytest ... -k benchmark`) for quick iteration. |
| 53 | +- Automate baseline updates via a `scripts/update_bench_baseline.py` helper with reviewable diffs. |
| 54 | +- Document the suite in `docs/onboarding/trace-filters.md` (updated) and a new benchmarking guide. |
| 55 | + |
| 56 | +## References |
| 57 | +- `codetracer-python-recorder/benches/trace_filter.rs` (current microbench harness). |
| 58 | +- `codetracer-python-recorder/tests/python/perf/test_trace_filter_perf.py` (Python smoke benchmark). |
| 59 | +- `Justfile` (`bench` recipe) and `nix/flake.nix` (dev shell dependencies, now including gnuplot). |
| 60 | +- Storage backend for historical data (settled: GitHub Actions artefacts). |
0 commit comments