diff --git a/Justfile b/Justfile index 49d5445..c04fe78 100644 --- a/Justfile +++ b/Justfile @@ -40,6 +40,24 @@ test: cargo-test py-test cargo-test: uv run cargo nextest run --manifest-path codetracer-python-recorder/Cargo.toml --workspace --no-default-features +bench: + just venv + ROOT="$(pwd)"; \ + PYTHON_BIN="$ROOT/.venv/bin/python"; \ + if [ ! -x "$PYTHON_BIN" ]; then \ + PYTHON_BIN="$ROOT/.venv/Scripts/python.exe"; \ + fi; \ + if [ ! -x "$PYTHON_BIN" ]; then \ + echo "Python interpreter not found. Run 'just venv ' first."; \ + exit 1; \ + fi; \ + PERF_DIR="$ROOT/codetracer-python-recorder/target/perf"; \ + mkdir -p "$PERF_DIR"; \ + PYO3_PYTHON="$PYTHON_BIN" uv run cargo bench --manifest-path codetracer-python-recorder/Cargo.toml --no-default-features --bench trace_filter && \ + CODETRACER_TRACE_FILTER_PERF=1 \ + CODETRACER_TRACE_FILTER_PERF_OUTPUT="$PERF_DIR/trace_filter_py.json" \ + uv run --group dev --group test pytest codetracer-python-recorder/tests/python/perf/test_trace_filter_perf.py -q + py-test: uv run --group dev --group test pytest codetracer-python-recorder/tests/python codetracer-pure-python-recorder diff --git a/README.md b/README.md index ba1967a..8983ebc 100644 --- a/README.md +++ b/README.md @@ -137,6 +137,7 @@ Basic workflow: - Run the full split test suite (Rust nextest + Python pytest): `just test` - Run only Rust integration/unit tests: `just cargo-test` - Run only Python tests (including the pure-Python recorder to guard regressions): `just py-test` +- Exercise the trace-filter benchmarks (Rust Criterion + Python smoke, JSON output under `codetracer-python-recorder/target/perf`): `just bench` - Collect coverage artefacts locally (LCOV + Cobertura/JSON): `just coverage` The CI workflow mirrors these commands. Pull requests get an automated comment with the latest Rust/Python coverage tables and downloadable artefacts (`lcov.info`, `coverage.xml`, `coverage.json`). diff --git a/codetracer-python-recorder/CHANGELOG.md b/codetracer-python-recorder/CHANGELOG.md index 421398e..a0de02d 100644 --- a/codetracer-python-recorder/CHANGELOG.md +++ b/codetracer-python-recorder/CHANGELOG.md @@ -5,7 +5,13 @@ All notable changes to `codetracer-python-recorder` will be documented in this f The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/) and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). ## [Unreleased] + +## [0.2.0] - 2025-10-17 ### Added +- Added configurable trace filters backed by layered TOML files with glob/regex/literal selectors for packages, files, objects, and value domains, strict schema validation via `TraceFilterConfig::from_paths`, and explicit `allow`/`redact`/`drop` value policies summarised with SHA-256 digests. +- Added `TraceFilterEngine` and runtime wiring that cache scope resolutions, gate tracing, substitute `` for filtered payloads, drop suppressed variables entirely, and emit per-kind redaction/drop counters alongside filter summaries in `trace_metadata.json`. +- Exposed configurable filters through the Python API, auto-start hook, CLI (`--trace-filter`), and `CODETRACER_TRACE_FILTER` environment variable while always prepending the built-in default filter that skips stdlib noise and redacts common secrets before layering project overrides. +- Added filter-focused documentation and benchmarking coverage, including onboarding and README guides plus Criterion + Python smoke benchmarks orchestrated by `just bench`. - Introduced a line-aware IO capture pipeline that records stdout/stderr chunks with `{path_id, line, frame_id}` attribution via the shared `LineSnapshotStore` and multi-threaded `IoEventSink`. - Added `LineAwareStdout`, `LineAwareStderr`, and `LineAwareStdin` proxies that forward to the original streams while batching writes on newline, explicit `flush()`, 5 ms idle gaps, and step boundaries. - Added policy, CLI, and environment toggles for IO capture (`--io-capture`, `configure_policy(io_capture_line_proxies=..., io_capture_fd_fallback=...)`, `CODETRACER_CAPTURE_IO`) alongside the `ScopedMuteIoCapture` guard that suppresses recursive recorder logging. @@ -22,5 +28,6 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/) - Support for generating `trace_metadata.json` and `trace_paths.json` artefacts compatible with the Codetracer db-backend importer. - Cross-platform packaging targeting CPython 3.12 and 3.13 on Linux (manylinux2014 `x86_64`/`aarch64`), macOS universal2, and Windows `amd64`. -[Unreleased]: https://github.com/metacraft-labs/cpr-main/compare/recorder-v0.1.0...HEAD +[Unreleased]: https://github.com/metacraft-labs/cpr-main/compare/recorder-v0.2.0...HEAD +[0.2.0]: https://github.com/metacraft-labs/cpr-main/compare/recorder-v0.1.0...recorder-v0.2.0 [0.1.0]: https://github.com/metacraft-labs/cpr-main/releases/tag/recorder-v0.1.0 diff --git a/codetracer-python-recorder/Cargo.lock b/codetracer-python-recorder/Cargo.lock index c210728..44e5dd6 100644 --- a/codetracer-python-recorder/Cargo.lock +++ b/codetracer-python-recorder/Cargo.lock @@ -2,6 +2,27 @@ # It is not intended for manual editing. version = 4 +[[package]] +name = "aho-corasick" +version = "1.1.3" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "8e60d3430d3a69478ad0993f19238d2df97c507009a52b3c10addcd7f6bcb916" +dependencies = [ + "memchr", +] + +[[package]] +name = "anes" +version = "0.1.6" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "4b46cbb362ab8752921c97e041f5e366ee6297bd428a31275b9fcf1e380f7299" + +[[package]] +name = "anstyle" +version = "1.0.13" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "5192cca8006f1fd4f7237516f40fa183bb07f8fbdfedaa0036de5ea9b0b45e78" + [[package]] name = "autocfg" version = "1.5.0" @@ -20,6 +41,25 @@ version = "2.9.1" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "1b8e56985ec62d17e9c1001dc89c88ecd7dc08e47eba5ec7c29c7b5eeecde967" +[[package]] +name = "block-buffer" +version = "0.10.4" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "3078c7629b62d3f0439517fa394996acacc5cbc91c5a20d8c658e77abd503a71" +dependencies = [ + "generic-array", +] + +[[package]] +name = "bstr" +version = "1.12.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "234113d19d0d7d613b40e86fb654acf958910802bcceab913a4f9e7cda03b1a4" +dependencies = [ + "memchr", + "serde", +] + [[package]] name = "bumpalo" version = "3.19.0" @@ -44,6 +84,12 @@ dependencies = [ "capnp", ] +[[package]] +name = "cast" +version = "0.3.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "37b2a672a2cb129a2e41c10b1224bb368f9f37a2b16b612598138befd7b37eb5" + [[package]] name = "cbor4ii" version = "1.0.0" @@ -70,24 +116,167 @@ version = "1.0.1" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "9555578bc9e57714c812a1f84e4fc5b4d21fcb063490c624de019f7464c91268" +[[package]] +name = "ciborium" +version = "0.2.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "42e69ffd6f0917f5c029256a24d0161db17cea3997d185db0d35926308770f0e" +dependencies = [ + "ciborium-io", + "ciborium-ll", + "serde", +] + +[[package]] +name = "ciborium-io" +version = "0.2.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "05afea1e0a06c9be33d539b876f1ce3692f4afea2cb41f740e7743225ed1c757" + +[[package]] +name = "ciborium-ll" +version = "0.2.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "57663b653d948a338bfb3eeba9bb2fd5fcfaecb9e199e87e1eda4d9e8b240fd9" +dependencies = [ + "ciborium-io", + "half", +] + +[[package]] +name = "clap" +version = "4.5.49" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "f4512b90fa68d3a9932cea5184017c5d200f5921df706d45e853537dea51508f" +dependencies = [ + "clap_builder", +] + +[[package]] +name = "clap_builder" +version = "4.5.49" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "0025e98baa12e766c67ba13ff4695a887a1eba19569aad00a472546795bd6730" +dependencies = [ + "anstyle", + "clap_lex", +] + +[[package]] +name = "clap_lex" +version = "0.7.6" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "a1d728cc89cf3aee9ff92b05e62b19ee65a02b5702cff7d5a377e32c6ae29d8d" + [[package]] name = "codetracer-python-recorder" version = "0.1.0" dependencies = [ "bitflags", + "criterion", "dashmap", + "globset", "libc", "log", "once_cell", "pyo3", "recorder-errors", + "regex", "runtime_tracing", "serde", "serde_json", + "sha2", "tempfile", + "toml", "uuid", ] +[[package]] +name = "cpufeatures" +version = "0.2.17" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "59ed5838eebb26a2bb2e58f6d5b5316989ae9d08bab10e0e6d103e656d1b0280" +dependencies = [ + "libc", +] + +[[package]] +name = "criterion" +version = "0.5.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "f2b12d017a929603d80db1831cd3a24082f8137ce19c69e6447f54f5fc8d692f" +dependencies = [ + "anes", + "cast", + "ciborium", + "clap", + "criterion-plot", + "is-terminal", + "itertools", + "num-traits", + "once_cell", + "oorandom", + "plotters", + "rayon", + "regex", + "serde", + "serde_derive", + "serde_json", + "tinytemplate", + "walkdir", +] + +[[package]] +name = "criterion-plot" +version = "0.5.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "6b50826342786a51a89e2da3a28f1c32b06e387201bc2d19791f622c673706b1" +dependencies = [ + "cast", + "itertools", +] + +[[package]] +name = "crossbeam-deque" +version = "0.8.6" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "9dd111b7b7f7d55b72c0a6ae361660ee5853c9af73f70c3c2ef6858b950e2e51" +dependencies = [ + "crossbeam-epoch", + "crossbeam-utils", +] + +[[package]] +name = "crossbeam-epoch" +version = "0.9.18" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "5b82ac4a3c2ca9c3460964f020e1402edd5753411d7737aa39c3714ad1b5420e" +dependencies = [ + "crossbeam-utils", +] + +[[package]] +name = "crossbeam-utils" +version = "0.8.21" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "d0a5c400df2834b80a4c3327b3aad3a4c4cd4de0629063962b03235697506a28" + +[[package]] +name = "crunchy" +version = "0.2.4" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "460fbee9c2c2f33933d720630a6a0bac33ba7053db5344fac858d4b8952d77d5" + +[[package]] +name = "crypto-common" +version = "0.1.6" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "1bfb12502f3fc46cca1bb51ac28df9d618d813cdc3d2f25b9fe775a34af26bb3" +dependencies = [ + "generic-array", + "typenum", +] + [[package]] name = "dashmap" version = "5.5.3" @@ -95,18 +284,40 @@ source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "978747c1d849a7d2ee5e8adc0159961c48fb7e5db2f06af6723b80123bb53856" dependencies = [ "cfg-if", - "hashbrown", + "hashbrown 0.14.5", "lock_api", "once_cell", "parking_lot_core", ] +[[package]] +name = "digest" +version = "0.10.7" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "9ed9a281f7bc9b7576e61468ba615a66a5c8cfdff42420a70aa82701a3b1e292" +dependencies = [ + "block-buffer", + "crypto-common", +] + +[[package]] +name = "either" +version = "1.15.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "48c757948c5ede0e46177b7add2e67155f70e33c07fea8284df6576da70b3719" + [[package]] name = "embedded-io" version = "0.6.1" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "edd0f118536f44f5ccd48bcb8b111bdc3de888b58c74639dfb034a357d0f206d" +[[package]] +name = "equivalent" +version = "1.0.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "877a4ace8713b0bcf2a4e7eec82529c029f1d0619886d18145fea96c3ffe5c0f" + [[package]] name = "errno" version = "0.3.14" @@ -114,7 +325,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "39cab71617ae0d63f51a36d69f866391735b51691dbda63cf6f96d042b63efeb" dependencies = [ "libc", - "windows-sys", + "windows-sys 0.60.2", ] [[package]] @@ -132,6 +343,16 @@ dependencies = [ "log", ] +[[package]] +name = "generic-array" +version = "0.14.9" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "4bb6743198531e02858aeaea5398fcc883e71851fcbcb5a2f773e2fb6cb1edf2" +dependencies = [ + "typenum", + "version_check", +] + [[package]] name = "getrandom" version = "0.3.3" @@ -144,24 +365,90 @@ dependencies = [ "wasi", ] +[[package]] +name = "globset" +version = "0.4.17" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "eab69130804d941f8075cfd713bf8848a2c3b3f201a9457a11e6f87e1ab62305" +dependencies = [ + "aho-corasick", + "bstr", + "log", + "regex-automata", + "regex-syntax", +] + +[[package]] +name = "half" +version = "2.7.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "6ea2d84b969582b4b1864a92dc5d27cd2b77b622a8d79306834f1be5ba20d84b" +dependencies = [ + "cfg-if", + "crunchy", + "zerocopy", +] + [[package]] name = "hashbrown" version = "0.14.5" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "e5274423e17b7c9fc20b6e7e208532f9b19825d82dfd615708b70edd83df41f1" +[[package]] +name = "hashbrown" +version = "0.16.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "5419bdc4f6a9207fbeba6d11b604d481addf78ecd10c11ad51e76c2f6482748d" + [[package]] name = "heck" version = "0.5.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "2304e00983f87ffb38b55b444b5e3b60a884b5d30c0fca7d82fe33449bbe55ea" +[[package]] +name = "hermit-abi" +version = "0.5.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "fc0fef456e4baa96da950455cd02c081ca953b141298e41db3fc7e36b1da849c" + +[[package]] +name = "indexmap" +version = "2.11.4" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "4b0f83760fb341a774ed326568e19f5a863af4a952def8c39f9ab92fd95b88e5" +dependencies = [ + "equivalent", + "hashbrown 0.16.0", +] + [[package]] name = "indoc" version = "2.0.6" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "f4c7245a08504955605670dbf141fceab975f15ca21570696aebe9d2e71576bd" +[[package]] +name = "is-terminal" +version = "0.4.16" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "e04d7f318608d35d4b61ddd75cbdaee86b023ebe2bd5a66ee0915f0bf93095a9" +dependencies = [ + "hermit-abi", + "libc", + "windows-sys 0.59.0", +] + +[[package]] +name = "itertools" +version = "0.10.5" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "b0fd2260e829bddf4cb6ea802289de2f86d6a7a690192fbe91b3f46e0f2c8473" +dependencies = [ + "either", +] + [[package]] name = "itoa" version = "1.0.15" @@ -257,6 +544,12 @@ version = "1.21.3" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "42f5e15c9953c5e4ccceeb2e7382a716482c34515315f7b03532b8b4e8393d2d" +[[package]] +name = "oorandom" +version = "11.1.5" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "d6790f58c7ff633d8771f42965289203411a5e5c68388703c06e14f24770b41e" + [[package]] name = "parking_lot_core" version = "0.9.11" @@ -276,6 +569,34 @@ version = "0.3.32" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "7edddbd0b52d732b21ad9a5fab5c704c14cd949e5e9a1ec5929a24fded1b904c" +[[package]] +name = "plotters" +version = "0.3.7" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "5aeb6f403d7a4911efb1e33402027fc44f29b5bf6def3effcc22d7bb75f2b747" +dependencies = [ + "num-traits", + "plotters-backend", + "plotters-svg", + "wasm-bindgen", + "web-sys", +] + +[[package]] +name = "plotters-backend" +version = "0.3.7" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "df42e13c12958a16b3f7f4386b9ab1f3e7933914ecea48da7139435263a4172a" + +[[package]] +name = "plotters-svg" +version = "0.3.7" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "51bae2ac328883f7acdfea3d66a7c35751187f870bc81f94563733a154d7a670" +dependencies = [ + "plotters-backend", +] + [[package]] name = "portable-atomic" version = "1.11.1" @@ -368,6 +689,26 @@ version = "5.3.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "69cdb34c158ceb288df11e18b4bd39de994f6657d83847bdffdbd7f346754b0f" +[[package]] +name = "rayon" +version = "1.11.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "368f01d005bf8fd9b1206fb6fa653e6c4a81ceb1466406b81792d87c5677a58f" +dependencies = [ + "either", + "rayon-core", +] + +[[package]] +name = "rayon-core" +version = "1.13.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "22e18b0f0062d30d4230b2e85ff77fdfe4326feb054b9783a3460d8435c8ab91" +dependencies = [ + "crossbeam-deque", + "crossbeam-utils", +] + [[package]] name = "recorder-errors" version = "0.1.0" @@ -384,6 +725,35 @@ dependencies = [ "bitflags", ] +[[package]] +name = "regex" +version = "1.12.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "843bc0191f75f3e22651ae5f1e72939ab2f72a4bc30fa80a066bd66edefc24d4" +dependencies = [ + "aho-corasick", + "memchr", + "regex-automata", + "regex-syntax", +] + +[[package]] +name = "regex-automata" +version = "0.4.13" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "5276caf25ac86c8d810222b3dbb938e512c55c6831a10f3e6ed1c93b84041f1c" +dependencies = [ + "aho-corasick", + "memchr", + "regex-syntax", +] + +[[package]] +name = "regex-syntax" +version = "0.8.8" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "7a2d987857b319362043e95f5353c0535c1f58eec5336fdfcf626430af7def58" + [[package]] name = "runtime_tracing" version = "0.14.0" @@ -413,7 +783,7 @@ dependencies = [ "errno", "libc", "linux-raw-sys", - "windows-sys", + "windows-sys 0.60.2", ] [[package]] @@ -428,6 +798,15 @@ version = "1.0.20" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "28d3b2b1366ec20994f1fd18c3c594f05c5dd4bc44d8bb0c1c632c8d6829481f" +[[package]] +name = "same-file" +version = "1.0.6" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "93fc1dc3aaa9bfed95e02e6eadabb4baf7e3078b0bd1b4d7b6b0b68378900502" +dependencies = [ + "winapi-util", +] + [[package]] name = "scopeguard" version = "1.2.0" @@ -477,6 +856,26 @@ dependencies = [ "syn", ] +[[package]] +name = "serde_spanned" +version = "0.6.9" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "bf41e0cfaf7226dca15e8197172c295a782857fcb97fad1808a166870dee75a3" +dependencies = [ + "serde", +] + +[[package]] +name = "sha2" +version = "0.10.9" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "a7507d819769d01a365ab707794a4084392c824f54a7a6a7862f8c3d0892b283" +dependencies = [ + "cfg-if", + "cpufeatures", + "digest", +] + [[package]] name = "shlex" version = "1.3.0" @@ -516,9 +915,66 @@ dependencies = [ "getrandom", "once_cell", "rustix", - "windows-sys", + "windows-sys 0.60.2", +] + +[[package]] +name = "tinytemplate" +version = "1.2.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "be4d6b5f19ff7664e8c98d03e2139cb510db9b0a60b55f8e8709b689d939b6bc" +dependencies = [ + "serde", + "serde_json", +] + +[[package]] +name = "toml" +version = "0.8.23" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "dc1beb996b9d83529a9e75c17a1686767d148d70663143c7854d8b4a09ced362" +dependencies = [ + "serde", + "serde_spanned", + "toml_datetime", + "toml_edit", +] + +[[package]] +name = "toml_datetime" +version = "0.6.11" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "22cddaf88f4fbc13c51aebbf5f8eceb5c7c5a9da2ac40a13519eb5b0a0e8f11c" +dependencies = [ + "serde", +] + +[[package]] +name = "toml_edit" +version = "0.22.27" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "41fe8c660ae4257887cf66394862d21dbca4a6ddd26f04a3560410406a2f819a" +dependencies = [ + "indexmap", + "serde", + "serde_spanned", + "toml_datetime", + "toml_write", + "winnow", ] +[[package]] +name = "toml_write" +version = "0.1.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "5d99f8c9a7727884afe522e9bd5edbfc91a3312b36a77b5fb8926e4c31a41801" + +[[package]] +name = "typenum" +version = "1.19.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "562d481066bde0658276a35467c4af00bdc6ee726305698a55b86e61d7ad82bb" + [[package]] name = "unicode-ident" version = "1.0.18" @@ -542,6 +998,22 @@ dependencies = [ "wasm-bindgen", ] +[[package]] +name = "version_check" +version = "0.9.5" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "0b928f33d975fc6ad9f86c8f283853ad26bdd5b10b7f1542aa2fa15e2289105a" + +[[package]] +name = "walkdir" +version = "2.5.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "29790946404f91d9c5d06f9874efddea1dc06c5efe94541a7d6863108e3a5e4b" +dependencies = [ + "same-file", + "winapi-util", +] + [[package]] name = "wasi" version = "0.14.2+wasi-0.2.4" @@ -610,12 +1082,40 @@ dependencies = [ "unicode-ident", ] +[[package]] +name = "web-sys" +version = "0.3.81" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "9367c417a924a74cae129e6a2ae3b47fabb1f8995595ab474029da749a8be120" +dependencies = [ + "js-sys", + "wasm-bindgen", +] + +[[package]] +name = "winapi-util" +version = "0.1.11" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "c2a7b1c03c876122aa43f3020e6c3c3ee5c05081c9a00739faf7503aeba10d22" +dependencies = [ + "windows-sys 0.60.2", +] + [[package]] name = "windows-link" version = "0.1.3" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "5e6ad25900d524eaabdbbb96d20b4311e1e7ae1699af4fb28c17ae66c80d798a" +[[package]] +name = "windows-sys" +version = "0.59.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "1e38bc4d79ed67fd075bcc251a1c39b32a1776bbe92e5bef1f0bf1f8c531853b" +dependencies = [ + "windows-targets 0.52.6", +] + [[package]] name = "windows-sys" version = "0.60.2" @@ -754,6 +1254,15 @@ version = "0.53.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "271414315aff87387382ec3d271b52d7ae78726f5d44ac98b4f4030c91880486" +[[package]] +name = "winnow" +version = "0.7.13" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "21a0236b59786fed61e2a80582dd500fe61f18b5dca67a4a067d0bc9039339cf" +dependencies = [ + "memchr", +] + [[package]] name = "wit-bindgen-rt" version = "0.39.0" @@ -772,6 +1281,26 @@ dependencies = [ "zstd-safe", ] +[[package]] +name = "zerocopy" +version = "0.8.27" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "0894878a5fa3edfd6da3f88c4805f4c8558e2b996227a3d864f47fe11e38282c" +dependencies = [ + "zerocopy-derive", +] + +[[package]] +name = "zerocopy-derive" +version = "0.8.27" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "88d2b8d9c68ad2b9e4340d7832716a4d21a22a1154777ad56ea55c51a9cf3831" +dependencies = [ + "proc-macro2", + "quote", + "syn", +] + [[package]] name = "zstd-safe" version = "7.2.4" diff --git a/codetracer-python-recorder/Cargo.toml b/codetracer-python-recorder/Cargo.toml index 63eb836..e9484d6 100644 --- a/codetracer-python-recorder/Cargo.toml +++ b/codetracer-python-recorder/Cargo.toml @@ -32,7 +32,16 @@ serde_json = "1.0" uuid = { version = "1.10", features = ["v4"] } recorder-errors = { version = "0.1.0", path = "crates/recorder-errors" } libc = "0.2" +globset = "0.4" +regex = "1.11" +toml = "0.8" +sha2 = "0.10" [dev-dependencies] pyo3 = { version = "0.25.1", features = ["auto-initialize"] } tempfile = "3.10" +criterion = { version = "0.5", features = ["html_reports"] } + +[[bench]] +name = "trace_filter" +harness = false diff --git a/codetracer-python-recorder/README.md b/codetracer-python-recorder/README.md index b9ff494..683433b 100644 --- a/codetracer-python-recorder/README.md +++ b/codetracer-python-recorder/README.md @@ -30,7 +30,7 @@ python -m codetracer_python_recorder \ --trace-dir ./trace-out \ --format json \ --activation-path app/main.py \ - --with-diff \ + --trace-filter config/trace-filter.toml \ app/main.py --arg=value ``` @@ -40,14 +40,48 @@ python -m codetracer_python_recorder \ integration with the DB backend importer. - `--activation-path` – optional gate that postpones tracing until the interpreter executes this file (defaults to the target script). -- `--with-diff` / `--no-with-diff` – records the caller’s preference in - `trace_metadata.json`. The desktop Codetracer CLI is responsible for generating - diff artefacts; the recorder simply surfaces the flag. +- `--trace-filter` – path to a filter file. Provide multiple times or use `::` + separators within a single argument to build a chain. When present, the recorder + prepends the project default `.codetracer/trace-filter.toml` (if found near the + target script) so later entries override the defaults. The + `CODETRACER_TRACE_FILTER` environment variable accepts the same `::`-separated + syntax when using the auto-start hook. All additional arguments are forwarded to the target script unchanged. The CLI reuses whichever interpreter launches it so wrappers such as `uv run`, `pipx`, or activated virtual environments behave identically to `python script.py`. +## Trace filter configuration +- Filter files are TOML with `[meta]`, `[scope]`, and `[[scope.rules]]` tables. Rules evaluate in declaration order and can tweak both execution (`exec`) and value decisions (`value_default`). +- Supported selector domains: `pkg`, `file`, `obj` for scopes; `local`, `global`, `arg`, `ret`, `attr` for value policies. Match types default to `glob` and also accept `regex` or `literal` (e.g. `local:regex:^(metric|masked)_\w+$`). +- Default discovery: `.codetracer/trace-filter.toml` next to the traced script. Chain additional files via CLI (`--trace-filter path_a --trace-filter path_b`), environment variable (`CODETRACER_TRACE_FILTER=path_a::path_b`), or Python helpers (`trace(..., trace_filter=[path_a, path_b])`). Later entries override earlier ones when selectors overlap. +- A built-in `builtin-default` filter is always prepended. It skips CPython standard-library frames (e.g. `asyncio`, `threading`, `importlib`) while re-enabling third-party packages under `site-packages` (except helpers such as `_virtualenv.py`), and redacts common secrets (`password`, `token`, API keys, etc.) across locals/globals/args/returns/attributes. Project filters can loosen or tighten these defaults as required. +- Runtime metadata captures the active chain under `trace_metadata.json -> trace_filter`, including per-kind redaction and drop counters. See `docs/onboarding/trace-filters.md` for the full DSL reference and examples. + +Example snippet: +```toml +[meta] +name = "local-redaction" +version = 1 + +[scope] +default_exec = "trace" +default_value_action = "allow" + +[[scope.rules]] +selector = "pkg:my_app.services.*" +value_default = "redact" +[[scope.rules.value_patterns]] +selector = "local:glob:public_*" +action = "allow" +[[scope.rules.value_patterns]] +selector = 'local:regex:^(metric|masked)_\w+$' +action = "allow" +[[scope.rules.value_patterns]] +selector = "arg:literal:debug_payload" +action = "drop" +``` + ## Packaging expectations Desktop installers add the wheel to `PYTHONPATH` before invoking the user’s @@ -58,3 +92,8 @@ with the interpreter you want to trace. The CLI writes recorder metadata into `trace_metadata.json` describing the wheel version, target script, and diff preference so downstream tooling can make decisions without re-running the trace. + +## Development benchmarks +- Rust microbench: `cargo bench --bench trace_filter --no-default-features` exercises baseline, glob-heavy, and regex-heavy selector chains. +- Python smoke benchmark: `pytest codetracer-python-recorder/tests/python/perf/test_trace_filter_perf.py -q` when the environment variable `CODETRACER_TRACE_FILTER_PERF=1` is set. +- Run both together with `just bench`. The helper seeds a virtualenv, runs Criterion, then executes the Python smoke test while writing `target/perf/trace_filter_py.json` (per-scenario durations plus redaction/drop statistics). diff --git a/codetracer-python-recorder/benches/trace_filter.rs b/codetracer-python-recorder/benches/trace_filter.rs new file mode 100644 index 0000000..0ed7e2e --- /dev/null +++ b/codetracer-python-recorder/benches/trace_filter.rs @@ -0,0 +1,417 @@ +use std::ffi::CString; +use std::fs; +use std::path::{Path, PathBuf}; +use std::sync::Arc; + +use codetracer_python_recorder::trace_filter::config::TraceFilterConfig; +use codetracer_python_recorder::trace_filter::engine::{TraceFilterEngine, ValueKind}; +use codetracer_python_recorder::CodeObjectWrapper; +use criterion::{black_box, criterion_group, criterion_main, BenchmarkId, Criterion, Throughput}; +use pyo3::prelude::*; +use pyo3::types::{PyAny, PyCode, PyModule}; +use tempfile::{tempdir, TempDir}; + +const CALLS_PER_BATCH: usize = 10_000; +const LOCALS_PER_CALL: usize = 50; +const FUNCTIONS_PER_MODULE: usize = 10; +const SERVICES_MODULES: usize = 6; +const WORKER_MODULES: usize = 3; +const EXTERNAL_MODULES: usize = 1; +const UNIQUE_CODE_OBJECTS: usize = + (SERVICES_MODULES + WORKER_MODULES + EXTERNAL_MODULES) * FUNCTIONS_PER_MODULE; + +fn bench_trace_filters(c: &mut Criterion) { + pyo3::prepare_freethreaded_python(); + + let workspace = BenchWorkspace::initialise(); + let dataset = Arc::clone(&workspace.dataset); + let scenarios = workspace.build_scenarios(); + + let mut group = c.benchmark_group("trace_filter"); + group.throughput(Throughput::Elements(CALLS_PER_BATCH as u64)); + + for scenario in scenarios { + let engine = Arc::clone(&scenario.engine); + prewarm_engine(engine.as_ref(), dataset.as_ref()); + + let dataset_ref = Arc::clone(&dataset); + group.bench_function(BenchmarkId::new("workload", scenario.label), move |b| { + b.iter(|| run_workload(engine.as_ref(), dataset_ref.as_ref())); + }); + } + + group.finish(); +} + +criterion_group!(trace_filter, bench_trace_filters); +criterion_main!(trace_filter); + +fn run_workload(engine: &TraceFilterEngine, dataset: &WorkloadDataset) { + let kind = ValueKind::Local; + Python::with_gil(|py| { + for &index in &dataset.event_indices { + let code = dataset.codes[index].as_ref(); + let resolution = engine + .resolve(py, code) + .expect("trace filter resolution should succeed during benchmarking"); + let policy = resolution.value_policy(); + for name in dataset.locals.iter() { + black_box(policy.decide(kind, name)); + } + } + }); +} + +fn prewarm_engine(engine: &TraceFilterEngine, dataset: &WorkloadDataset) { + Python::with_gil(|py| { + for code in &dataset.codes { + let _ = engine + .resolve(py, code.as_ref()) + .expect("prewarm resolution failed"); + } + }); +} + +struct BenchWorkspace { + _root: TempDir, + filters: FilterFiles, + dataset: Arc, +} + +impl BenchWorkspace { + fn initialise() -> Self { + let root = tempdir().expect("failed to create benchmark workspace"); + let project_root = root.path().to_path_buf(); + let codetracer_dir = project_root.join(".codetracer"); + fs::create_dir_all(&codetracer_dir).expect("failed to create .codetracer directory"); + + let filters = FilterFiles::create(&codetracer_dir); + let dataset = Python::with_gil(|py| WorkloadDataset::new(py, &project_root)) + .expect("failed to build workload dataset"); + + BenchWorkspace { + _root: root, + filters, + dataset: Arc::new(dataset), + } + } + + fn build_scenarios(&self) -> Vec { + vec![ + FilterScenario::new("baseline", &self.filters.baseline), + FilterScenario::new("glob", &self.filters.glob), + FilterScenario::new("regex", &self.filters.regex), + ] + } +} + +struct FilterScenario { + label: &'static str, + engine: Arc, +} + +impl FilterScenario { + fn new(label: &'static str, path: &Path) -> Self { + let config = TraceFilterConfig::from_paths(&[path.to_path_buf()]) + .expect("failed to load benchmark trace filter"); + let engine = TraceFilterEngine::new(config); + FilterScenario { + label, + engine: Arc::new(engine), + } + } +} + +struct FilterFiles { + baseline: PathBuf, + glob: PathBuf, + regex: PathBuf, +} + +impl FilterFiles { + fn create(dir: &Path) -> Self { + let baseline = dir.join("bench-baseline.toml"); + let glob = dir.join("bench-glob.toml"); + let regex = dir.join("bench-regex.toml"); + + fs::write(&baseline, baseline_config()).expect("failed to write baseline filter"); + fs::write(&glob, glob_config()).expect("failed to write glob filter"); + fs::write(®ex, regex_config()).expect("failed to write regex filter"); + + FilterFiles { + baseline, + glob, + regex, + } + } +} + +struct WorkloadDataset { + codes: Vec>, + event_indices: Vec, + locals: Arc<[String]>, +} + +impl WorkloadDataset { + fn new(py: Python<'_>, project_root: &Path) -> PyResult { + let local_names = build_local_names(); + let specs = build_module_specs(); + let mut codes = Vec::with_capacity(UNIQUE_CODE_OBJECTS); + + for spec in specs { + let file_path = project_root.join(&spec.relative_path); + let source = module_source(&spec.func_prefix, spec.functions, &local_names); + + let source_c = CString::new(source).expect("module source cannot contain NUL bytes"); + let file_c = CString::new(file_path.to_string_lossy().into_owned()) + .expect("file path cannot contain NUL bytes"); + let module_c = CString::new(spec.module_name.clone()) + .expect("module name cannot contain NUL bytes"); + + let module = PyModule::from_code( + py, + source_c.as_c_str(), + file_c.as_c_str(), + module_c.as_c_str(), + )?; + for idx in 0..spec.functions { + let func_name = format!("{}_{}", spec.func_prefix, idx); + let func: Bound<'_, PyAny> = module.getattr(&func_name)?; + let code = func.getattr("__code__")?.downcast_into::()?; + codes.push(Arc::new(CodeObjectWrapper::new(py, &code))); + } + } + + assert_eq!( + codes.len(), + UNIQUE_CODE_OBJECTS, + "unexpected number of code objects generated for benchmark" + ); + + let mut event_indices = Vec::with_capacity(CALLS_PER_BATCH); + for i in 0..CALLS_PER_BATCH { + event_indices.push(i % codes.len()); + } + + let locals: Arc<[String]> = Arc::from(local_names); + + Ok(WorkloadDataset { + codes, + event_indices, + locals, + }) + } +} + +struct ModuleSpec { + relative_path: String, + module_name: String, + func_prefix: String, + functions: usize, +} + +impl ModuleSpec { + fn new( + relative_path: String, + module_name: String, + func_prefix: String, + functions: usize, + ) -> Self { + ModuleSpec { + relative_path, + module_name, + func_prefix, + functions, + } + } +} + +fn build_module_specs() -> Vec { + let mut specs = Vec::with_capacity(SERVICES_MODULES + WORKER_MODULES + EXTERNAL_MODULES); + + for idx in 0..SERVICES_MODULES { + specs.push(ModuleSpec::new( + format!("bench_pkg/services/api/module_{idx}.py"), + format!("bench_pkg.services.api.module_{idx}"), + format!("api_handler_{idx}"), + FUNCTIONS_PER_MODULE, + )); + } + + for idx in 0..WORKER_MODULES { + specs.push(ModuleSpec::new( + format!("bench_pkg/jobs/worker/module_{idx}.py"), + format!("bench_pkg.jobs.worker.module_{idx}"), + format!("worker_task_{idx}"), + FUNCTIONS_PER_MODULE, + )); + } + + for idx in 0..EXTERNAL_MODULES { + specs.push(ModuleSpec::new( + format!("bench_pkg/external/integration_{idx}.py"), + format!("bench_pkg.external.integration_{idx}"), + format!("integration_op_{idx}"), + FUNCTIONS_PER_MODULE, + )); + } + + specs +} + +fn module_source(func_prefix: &str, function_count: usize, local_names: &[String]) -> String { + let mut source = String::new(); + for idx in 0..function_count { + let func_name = format!("{func_prefix}_{idx}"); + source.push_str("def "); + source.push_str(&func_name); + source.push_str("(value):\n"); + for (offset, name) in local_names.iter().enumerate() { + source.push_str(" "); + source.push_str(name); + source.push_str(" = value + "); + source.push_str(&offset.to_string()); + source.push('\n'); + } + source.push_str(" return value\n\n"); + } + source +} + +fn build_local_names() -> Vec { + let mut names = Vec::with_capacity(LOCALS_PER_CALL); + for idx in 0..15 { + names.push(format!("public_field_{idx}")); + } + for idx in 0..15 { + names.push(format!("secret_field_{idx}")); + } + for idx in 0..10 { + names.push(format!("token_{idx}")); + } + names.push("password_hash".to_string()); + names.push("api_key".to_string()); + names.push("credit_card".to_string()); + names.push("session_id".to_string()); + names.push("metric_latency".to_string()); + names.push("metric_throughput".to_string()); + names.push("metric_error_rate".to_string()); + names.push("masked_value".to_string()); + names.push("debug_flag".to_string()); + names.push("trace_id".to_string()); + + assert_eq!(names.len(), LOCALS_PER_CALL, "local name count mismatch"); + names +} + +fn baseline_config() -> String { + r#" +[meta] +name = "bench-baseline" +version = 1 +description = "Tracing baseline without additional filter overhead." + +[scope] +default_exec = "trace" +default_value_action = "allow" +"# + .trim_start_matches('\n') + .to_string() +} + +fn glob_config() -> String { + r#" +[meta] +name = "bench-glob" +version = 1 +description = "Glob-heavy rule set for microbenchmark coverage." + +[scope] +default_exec = "trace" +default_value_action = "allow" + +[[scope.rules]] +selector = "pkg:bench_pkg.services.api.*" +value_default = "redact" +reason = "Redact service locals except approved public fields" +[[scope.rules.value_patterns]] +selector = "local:glob:public_*" +action = "allow" +[[scope.rules.value_patterns]] +selector = "local:glob:metric_*" +action = "allow" +[[scope.rules.value_patterns]] +selector = "local:glob:secret_*" +action = "redact" +[[scope.rules.value_patterns]] +selector = "local:glob:token_*" +action = "redact" +[[scope.rules.value_patterns]] +selector = "local:glob:masked_*" +action = "allow" +[[scope.rules.value_patterns]] +selector = "local:glob:password_*" +action = "redact" + +[[scope.rules]] +selector = "file:glob:bench_pkg/jobs/worker/module_*.py" +exec = "skip" +reason = "Disable redundant worker instrumentation" + +[[scope.rules]] +selector = "pkg:bench_pkg.external.integration_*" +value_default = "redact" +[[scope.rules.value_patterns]] +selector = "local:glob:metric_*" +action = "allow" +[[scope.rules.value_patterns]] +selector = "local:glob:public_*" +action = "allow" +"# + .trim_start_matches('\n') + .to_string() +} + +fn regex_config() -> String { + r#" +[meta] +name = "bench-regex" +version = 1 +description = "Regex-heavy rule set for microbenchmark coverage." + +[scope] +default_exec = "trace" +default_value_action = "allow" + +[[scope.rules]] +selector = 'pkg:regex:^bench_pkg\.services\.api\.module_\d+$' +value_default = "redact" +reason = "Regex match on service modules" +[[scope.rules.value_patterns]] +selector = 'local:regex:^(public|metric)_\w+$' +action = "allow" +[[scope.rules.value_patterns]] +selector = 'local:regex:^(secret|token)_\w+$' +action = "redact" +[[scope.rules.value_patterns]] +selector = 'local:regex:^(password|api|credit|session)_.*$' +action = "redact" + +[[scope.rules]] +selector = 'file:regex:^bench_pkg/jobs/worker/module_\d+\.py$' +exec = "skip" +reason = "Regex skip for worker modules" + +[[scope.rules]] +selector = 'obj:regex:^bench_pkg\.external\.integration_\d+\.integration_op_\d+$' +value_default = "redact" +[[scope.rules.value_patterns]] +selector = 'local:regex:^masked_.*$' +action = "allow" +[[scope.rules.value_patterns]] +selector = 'local:regex:^metric_.*$' +action = "allow" +"# + .trim_start_matches('\n') + .to_string() +} diff --git a/codetracer-python-recorder/codetracer_python_recorder/auto_start.py b/codetracer-python-recorder/codetracer_python_recorder/auto_start.py index cb3ea05..7576387 100644 --- a/codetracer-python-recorder/codetracer_python_recorder/auto_start.py +++ b/codetracer-python-recorder/codetracer_python_recorder/auto_start.py @@ -9,6 +9,7 @@ ENV_TRACE_PATH = "CODETRACER_TRACE" ENV_TRACE_FORMAT = "CODETRACER_FORMAT" +ENV_TRACE_FILTER = "CODETRACER_TRACE_FILTER" log = logging.getLogger(__name__) @@ -18,6 +19,7 @@ def auto_start_from_env() -> None: path = os.getenv(ENV_TRACE_PATH) if not path: return + filter_spec = os.getenv(ENV_TRACE_FILTER) # Delay import to avoid boot-time circular dependencies. from . import session @@ -31,13 +33,15 @@ def auto_start_from_env() -> None: fmt = os.getenv(ENV_TRACE_FORMAT, DEFAULT_FORMAT) log.debug( - "codetracer auto-start triggered", extra={"trace_path": path, "format": fmt} + "codetracer auto-start triggered", + extra={"trace_path": path, "format": fmt, "trace_filter": filter_spec}, ) - session.start(path, format=fmt) + session.start(path, format=fmt, trace_filter=filter_spec) __all__: Iterable[str] = ( "ENV_TRACE_FORMAT", "ENV_TRACE_PATH", + "ENV_TRACE_FILTER", "auto_start_from_env", ) diff --git a/codetracer-python-recorder/codetracer_python_recorder/cli.py b/codetracer-python-recorder/codetracer_python_recorder/cli.py index 0131532..c6e4793 100644 --- a/codetracer-python-recorder/codetracer_python_recorder/cli.py +++ b/codetracer-python-recorder/codetracer_python_recorder/cli.py @@ -3,6 +3,7 @@ import argparse import json +import os import runpy import sys from dataclasses import dataclass @@ -11,6 +12,7 @@ from typing import Iterable, Sequence from . import flush, start, stop +from .auto_start import ENV_TRACE_FILTER from .formats import DEFAULT_FORMAT, SUPPORTED_FORMATS, normalize_format @@ -23,6 +25,7 @@ class RecorderCLIConfig: activation_path: Path script: Path script_args: list[str] + trace_filter: tuple[str, ...] policy_overrides: dict[str, object] @@ -65,6 +68,17 @@ def _parse_args(argv: Sequence[str]) -> RecorderCLIConfig: "interpreter enters this file. Defaults to the target script." ), ) + parser.add_argument( + "--trace-filter", + action="append", + help=( + "Path to a trace filter file. Provide multiple times to chain filters; " + "specify multiple paths within a single argument using '::' separators. " + "Filters load after any project default '.codetracer/trace-filter.toml' so " + "later entries override earlier ones; the CODETRACER_TRACE_FILTER " + "environment variable accepts the same syntax for env auto-start." + ), + ) parser.add_argument( "--on-recorder-error", choices=["abort", "disable"], @@ -174,6 +188,7 @@ def _parse_args(argv: Sequence[str]) -> RecorderCLIConfig: activation_path=activation_path, script=script_path, script_args=script_args, + trace_filter=tuple(known.trace_filter or ()), policy_overrides=policy, ) @@ -238,6 +253,10 @@ def main(argv: Iterable[str] | None = None) -> int: trace_dir = config.trace_dir script_path = config.script script_args = config.script_args + filter_specs = list(config.trace_filter) + env_filter = os.getenv(ENV_TRACE_FILTER) + if env_filter: + filter_specs.insert(0, env_filter) policy_overrides = config.policy_overrides if config.policy_overrides else None old_argv = sys.argv @@ -248,6 +267,7 @@ def main(argv: Iterable[str] | None = None) -> int: trace_dir, format=config.format, start_on_enter=config.activation_path, + trace_filter=filter_specs or None, policy=policy_overrides, ) except Exception as exc: diff --git a/codetracer-python-recorder/codetracer_python_recorder/session.py b/codetracer-python-recorder/codetracer_python_recorder/session.py index cd43a43..d067954 100644 --- a/codetracer-python-recorder/codetracer_python_recorder/session.py +++ b/codetracer-python-recorder/codetracer_python_recorder/session.py @@ -7,6 +7,7 @@ import contextlib import os +from collections.abc import Sequence from pathlib import Path from typing import Iterator, Mapping, Optional @@ -58,6 +59,7 @@ def start( *, format: str = DEFAULT_FORMAT, start_on_enter: str | Path | None = None, + trace_filter: str | os.PathLike[str] | Sequence[str | os.PathLike[str]] | None = None, policy: Mapping[str, object] | None = None, apply_env_policy: bool = True, ) -> TraceSession: @@ -72,6 +74,10 @@ def start( start_on_enter: Optional path that delays trace activation until the interpreter enters the referenced file. + trace_filter: + Optional filter specification. Accepts a path-like object, an iterable + of path-like objects, or a string containing ``::``-separated paths. + Paths are expanded to absolute locations and must exist. policy: Optional mapping of runtime policy overrides forwarded to :func:`configure_policy` before tracing begins. Keys match the policy @@ -102,13 +108,14 @@ def start( trace_path = _validate_trace_path(Path(path)) normalized_format = _coerce_format(format) activation_path = _normalize_activation_path(start_on_enter) + filter_chain = _normalize_trace_filter(trace_filter) if apply_env_policy: _configure_policy_from_env() if policy: _configure_policy(**_coerce_policy_kwargs(policy)) - _start_backend(str(trace_path), normalized_format, activation_path) + _start_backend(str(trace_path), normalized_format, activation_path, filter_chain) session = TraceSession(path=trace_path, format=normalized_format) _active_session = session return session @@ -139,6 +146,7 @@ def trace( path: str | Path, *, format: str = DEFAULT_FORMAT, + trace_filter: str | os.PathLike[str] | Sequence[str | os.PathLike[str]] | None = None, policy: Mapping[str, object] | None = None, apply_env_policy: bool = True, ) -> Iterator[TraceSession]: @@ -146,6 +154,7 @@ def trace( session = start( path, format=format, + trace_filter=trace_filter, policy=policy, apply_env_policy=apply_env_policy, ) @@ -178,6 +187,58 @@ def _normalize_activation_path(value: str | Path | None) -> str | None: return str(Path(value).expanduser()) +def _normalize_trace_filter( + value: str | os.PathLike[str] | Sequence[str | os.PathLike[str]] | None, +) -> list[str] | None: + if value is None: + return None + + segments = _extract_filter_segments(value) + if not segments: + raise ValueError("trace_filter must resolve to at least one path") + + resolved: list[str] = [] + for segment in segments: + target = _resolve_trace_filter_path(segment) + resolved.append(str(target)) + return resolved + + +def _extract_filter_segments( + value: str | os.PathLike[str] | Sequence[str | os.PathLike[str]], +) -> list[str]: + if isinstance(value, (str, os.PathLike)): + return _split_filter_spec(os.fspath(value)) + + if isinstance(value, Sequence): + segments: list[str] = [] + for item in value: + if not isinstance(item, (str, os.PathLike)): + raise TypeError( + "trace_filter sequence entries must be str or os.PathLike" + ) + segments.extend(_split_filter_spec(os.fspath(item))) + return segments + + raise TypeError("trace_filter must be a path, iterable of paths, or None") + + +def _split_filter_spec(value: str) -> list[str]: + parts = [segment.strip() for segment in value.split("::")] + return [segment for segment in parts if segment] + + +def _resolve_trace_filter_path(raw: str) -> Path: + candidate = Path(raw).expanduser() + if not candidate.exists(): + raise FileNotFoundError(f"trace filter '{candidate}' does not exist") + + resolved = candidate.resolve() + if not resolved.is_file(): + raise ValueError(f"trace filter '{resolved}' is not a file") + return resolved + + def _coerce_policy_kwargs(policy: Mapping[str, object]) -> dict[str, object]: normalized: dict[str, object] = {} for key, raw_value in policy.items(): diff --git a/codetracer-python-recorder/resources/trace_filters/builtin_default.toml b/codetracer-python-recorder/resources/trace_filters/builtin_default.toml new file mode 100644 index 0000000..6749850 --- /dev/null +++ b/codetracer-python-recorder/resources/trace_filters/builtin_default.toml @@ -0,0 +1,63 @@ +[meta] +name = "builtin-default" +version = 1 +description = "Skip CPython stdlib internals, redact sensitive identifiers, and drop nothing by default." +labels = ["builtin", "default"] + +[scope] +default_exec = "trace" +default_value_action = "allow" + +[[scope.rules]] +selector = 'file:regex:.*[\\/](lib|Lib)[\\/]python\d+\.\d+[\/].*' +exec = "skip" +reason = "Skip Python standard library files" + +[[scope.rules]] +selector = 'file:regex:.*[\\/](lib|Lib)[\\/]python\d+\.\d+[\/]site-packages[\/].*' +exec = "trace" +reason = "Allow third-party packages under site-packages" + +[[scope.rules]] +selector = 'file:regex:.*[\\/]site-packages[\\/]_virtualenv\.py$' +exec = "skip" +reason = "Skip virtualenv bootstrap helper" + +[[scope.rules]] +selector = 'pkg:regex:^(asyncio|selectors|concurrent|importlib|threading|multiprocessing)(\.|$)' +exec = "skip" +reason = "Skip noisy stdlib async/concurrency internals" + +[[scope.rules]] +selector = 'pkg:literal:builtins' +exec = "skip" +reason = "Skip builtins module instrumentation" + +[[scope.rules]] +selector = 'pkg:glob:*' +value_default = "allow" + +[[scope.rules.value_patterns]] +selector = 'local:regex:(?i).*(pass(word)?|passwd|pwd|secret|token|session|cookie|auth|credential|creds|bearer|ssn|credit|card|iban|cvv|cvc|pan|api[_-]?key|private[_-]?key|secret[_-]?key|ssh[_-]?key|jwt|refresh[_-]?token|access[_-]?token).*' +action = "redact" +reason = "Redact sensitive locals" + +[[scope.rules.value_patterns]] +selector = 'global:regex:(?i).*(pass(word)?|passwd|pwd|secret|token|session|cookie|auth|credential|creds|bearer|ssn|credit|card|iban|cvv|cvc|pan|api[_-]?key|private[_-]?key|secret[_-]?key|ssh[_-]?key|jwt|refresh[_-]?token|access[_-]?token).*' +action = "redact" +reason = "Redact sensitive globals" + +[[scope.rules.value_patterns]] +selector = 'arg:regex:(?i).*(pass(word)?|passwd|pwd|secret|token|session|cookie|auth|credential|creds|bearer|ssn|credit|card|iban|cvv|cvc|pan|api[_-]?key|private[_-]?key|secret[_-]?key|ssh[_-]?key|jwt|refresh[_-]?token|access[_-]?token).*' +action = "redact" +reason = "Redact sensitive arguments" + +[[scope.rules.value_patterns]] +selector = 'ret:regex:(?i).*(pass(word)?|passwd|pwd|secret|token|session|cookie|auth|credential|creds|bearer|ssn|credit|card|iban|cvv|cvc|pan|api[_-]?key|private[_-]?key|secret[_-]?key|ssh[_-]?key|jwt|refresh[_-]?token|access[_-]?token).*' +action = "redact" +reason = "Redact sensitive return values" + +[[scope.rules.value_patterns]] +selector = 'attr:regex:(?i).*(pass(word)?|passwd|pwd|secret|token|session|cookie|auth|credential|creds|bearer|ssn|credit|card|iban|cvv|cvc|pan|api[_-]?key|private[_-]?key|secret[_-]?key|ssh[_-]?key|jwt|refresh[_-]?token|access[_-]?token).*' +action = "redact" +reason = "Redact sensitive attributes" diff --git a/codetracer-python-recorder/src/lib.rs b/codetracer-python-recorder/src/lib.rs index 21711c6..8ec5818 100644 --- a/codetracer-python-recorder/src/lib.rs +++ b/codetracer-python-recorder/src/lib.rs @@ -12,6 +12,7 @@ pub mod monitoring; mod policy; mod runtime; mod session; +pub mod trace_filter; pub use crate::code_object::{CodeObjectRegistry, CodeObjectWrapper}; pub use crate::monitoring as tracer; diff --git a/codetracer-python-recorder/src/runtime/mod.rs b/codetracer-python-recorder/src/runtime/mod.rs index 2de61da..c17c1ba 100644 --- a/codetracer-python-recorder/src/runtime/mod.rs +++ b/codetracer-python-recorder/src/runtime/mod.rs @@ -15,7 +15,9 @@ pub use output_paths::TraceOutputPaths; use activation::ActivationController; use frame_inspector::capture_frame; use logging::log_event; -use value_capture::{capture_call_arguments, record_return_value, record_visible_scope}; +use value_capture::{ + capture_call_arguments, record_return_value, record_visible_scope, ValueFilterStats, +}; use std::collections::{hash_map::Entry, HashMap, HashSet}; use std::fs; @@ -46,8 +48,9 @@ use crate::policy::{policy_snapshot, RecorderPolicy}; use crate::runtime::io_capture::{ IoCapturePipeline, IoCaptureSettings, IoChunk, IoChunkFlags, IoStream, ScopedMuteIoCapture, }; +use crate::trace_filter::engine::{ExecDecision, ScopeResolution, TraceFilterEngine, ValueKind}; use serde::Serialize; -use serde_json; +use serde_json::{self, json}; use uuid::Uuid; @@ -103,6 +106,9 @@ pub struct RuntimeTracer { trace_id: String, line_snapshots: Arc, io_capture: Option, + trace_filter: Option>, + scope_cache: HashMap>, + filter_stats: FilterStats, } #[derive(Clone, Copy, Debug, PartialEq, Eq)] @@ -128,6 +134,47 @@ impl FailureStage { } } +#[derive(Debug, Default)] +struct FilterStats { + skipped_scopes: u64, + values: ValueFilterStats, +} + +impl FilterStats { + fn record_skip(&mut self) { + self.skipped_scopes += 1; + } + + fn values_mut(&mut self) -> &mut ValueFilterStats { + &mut self.values + } + + fn reset(&mut self) { + self.skipped_scopes = 0; + self.values = ValueFilterStats::default(); + } + + fn summary_json(&self) -> serde_json::Value { + let mut redactions = serde_json::Map::new(); + let mut drops = serde_json::Map::new(); + for kind in ValueKind::ALL { + redactions.insert( + kind.label().to_string(), + json!(self.values.redacted_count(kind)), + ); + drops.insert( + kind.label().to_string(), + json!(self.values.dropped_count(kind)), + ); + } + json!({ + "scopes_skipped": self.skipped_scopes, + "value_redactions": serde_json::Value::Object(redactions), + "value_drops": serde_json::Value::Object(drops), + }) + } +} + // Failure injection helpers are only compiled for integration tests. #[cfg_attr(not(feature = "integration-test"), allow(dead_code))] #[derive(Clone, Copy, Debug, PartialEq, Eq)] @@ -249,6 +296,7 @@ impl RuntimeTracer { args: &[String], format: TraceEventsFileFormat, activation_path: Option<&Path>, + trace_filter: Option>, ) -> Self { let mut writer = NonStreamingTraceWriter::new(program, args); writer.set_format(format); @@ -267,6 +315,9 @@ impl RuntimeTracer { trace_id: Uuid::new_v4().to_string(), line_snapshots: Arc::new(LineSnapshotStore::new()), io_capture: None, + trace_filter, + scope_cache: HashMap::new(), + filter_stats: FilterStats::default(), } } @@ -337,6 +388,44 @@ impl RuntimeTracer { self.mark_event(); } + fn scope_resolution( + &mut self, + py: Python<'_>, + code: &CodeObjectWrapper, + ) -> Option> { + let engine = self.trace_filter.as_ref()?; + let code_id = code.id(); + + if let Some(existing) = self.scope_cache.get(&code_id) { + return Some(existing.clone()); + } + + match engine.resolve(py, code) { + Ok(resolution) => { + if resolution.exec() == ExecDecision::Trace { + self.scope_cache.insert(code_id, Arc::clone(&resolution)); + } else { + self.scope_cache.remove(&code_id); + } + Some(resolution) + } + Err(err) => { + let message = err.to_string(); + let error_code = err.code; + with_error_code(error_code, || { + let _mute = ScopedMuteIoCapture::new(); + log::error!( + "[RuntimeTracer] trace filter resolution failed for code id {}: {}", + code_id, + message + ); + }); + record_dropped_event("filter_resolution_error"); + None + } + } + } + fn build_io_metadata(&self, chunk: &IoChunk) -> String { #[derive(Serialize)] struct IoEventMetadata<'a> { @@ -459,6 +548,7 @@ impl RuntimeTracer { enverr!(ErrorCode::Io, "failed to finalise trace metadata") .with_context("source", err.to_string()) })?; + self.append_filter_metadata()?; TraceWriter::finish_writing_trace_paths(&mut self.writer).map_err(|err| { enverr!(ErrorCode::Io, "failed to finalise trace paths") .with_context("source", err.to_string()) @@ -470,6 +560,68 @@ impl RuntimeTracer { Ok(()) } + fn append_filter_metadata(&self) -> RecorderResult<()> { + let Some(outputs) = &self.output_paths else { + return Ok(()); + }; + let Some(engine) = self.trace_filter.as_ref() else { + return Ok(()); + }; + + let path = outputs.metadata(); + let original = fs::read_to_string(path).map_err(|err| { + enverr!(ErrorCode::Io, "failed to read trace metadata") + .with_context("path", path.display().to_string()) + .with_context("source", err.to_string()) + })?; + + let mut metadata: serde_json::Value = serde_json::from_str(&original).map_err(|err| { + enverr!(ErrorCode::Io, "failed to parse trace metadata JSON") + .with_context("path", path.display().to_string()) + .with_context("source", err.to_string()) + })?; + + let filters = engine.summary(); + let filters_json: Vec = filters + .entries + .iter() + .map(|entry| { + json!({ + "path": entry.path.to_string_lossy(), + "sha256": entry.sha256, + "name": entry.name, + "version": entry.version, + }) + }) + .collect(); + + if let serde_json::Value::Object(ref mut obj) = metadata { + obj.insert( + "trace_filter".to_string(), + json!({ + "filters": filters_json, + "stats": self.filter_stats.summary_json(), + }), + ); + let serialised = serde_json::to_string(&metadata).map_err(|err| { + enverr!(ErrorCode::Io, "failed to serialise trace metadata") + .with_context("path", path.display().to_string()) + .with_context("source", err.to_string()) + })?; + fs::write(path, serialised).map_err(|err| { + enverr!(ErrorCode::Io, "failed to write trace metadata") + .with_context("path", path.display().to_string()) + .with_context("source", err.to_string()) + })?; + Ok(()) + } else { + Err( + enverr!(ErrorCode::Io, "trace metadata must be a JSON object") + .with_context("path", path.display().to_string()), + ) + } + } + fn ensure_function_id( &mut self, py: Python<'_>, @@ -497,6 +649,22 @@ impl RuntimeTracer { if self.ignored_code_ids.contains(&code_id) { return ShouldTrace::SkipAndDisable; } + + if let Some(resolution) = self.scope_resolution(py, code) { + match resolution.exec() { + ExecDecision::Skip => { + self.scope_cache.remove(&code_id); + self.filter_stats.record_skip(); + self.ignored_code_ids.insert(code_id); + record_dropped_event("filter_scope_skip"); + return ShouldTrace::SkipAndDisable; + } + ExecDecision::Trace => { + // already cached for future use + } + } + } + let filename = match code.filename(py) { Ok(name) => name, Err(err) => { @@ -505,6 +673,7 @@ impl RuntimeTracer { log::error!("failed to resolve code filename: {err}"); }); record_dropped_event("filename_lookup_failed"); + self.scope_cache.remove(&code_id); self.ignored_code_ids.insert(code_id); return ShouldTrace::SkipAndDisable; } @@ -512,6 +681,7 @@ impl RuntimeTracer { if is_real_filename(filename) { ShouldTrace::Trace } else { + self.scope_cache.remove(&code_id); self.ignored_code_ids.insert(code_id); record_dropped_event("synthetic_filename"); ShouldTrace::SkipAndDisable @@ -558,8 +728,18 @@ impl Tracer for RuntimeTracer { log_event(py, code, "on_py_start", None); + let scope_resolution = self.scope_cache.get(&code.id()).cloned(); + let value_policy = scope_resolution.as_ref().map(|res| res.value_policy()); + let wants_telemetry = value_policy.is_some(); + if let Ok(fid) = self.ensure_function_id(py, code) { - match capture_call_arguments(py, &mut self.writer, code) { + let mut telemetry_holder = if wants_telemetry { + Some(self.filter_stats.values_mut()) + } else { + None + }; + let telemetry = telemetry_holder.as_deref_mut(); + match capture_call_arguments(py, &mut self.writer, code, value_policy, telemetry) { Ok(args) => TraceWriter::register_call(&mut self.writer, fid, args), Err(err) => { let details = err.to_string(); @@ -609,6 +789,10 @@ impl Tracer for RuntimeTracer { self.flush_io_before_step(thread::current().id()); + let scope_resolution = self.scope_cache.get(&code.id()).cloned(); + let value_policy = scope_resolution.as_ref().map(|res| res.value_policy()); + let wants_telemetry = value_policy.is_some(); + let line_value = Line(lineno as i64); let mut recorded_path: Option<(PathId, Line)> = None; @@ -629,7 +813,20 @@ impl Tracer for RuntimeTracer { } let mut recorded: HashSet = HashSet::new(); - record_visible_scope(py, &mut self.writer, &snapshot, &mut recorded); + let mut telemetry_holder = if wants_telemetry { + Some(self.filter_stats.values_mut()) + } else { + None + }; + let telemetry = telemetry_holder.as_deref_mut(); + record_visible_scope( + py, + &mut self.writer, + &snapshot, + &mut recorded, + value_policy, + telemetry, + ); Ok(CallbackOutcome::Continue) } @@ -656,7 +853,26 @@ impl Tracer for RuntimeTracer { self.flush_pending_io(); - record_return_value(py, &mut self.writer, retval); + let scope_resolution = self.scope_cache.get(&code.id()).cloned(); + let value_policy = scope_resolution.as_ref().map(|res| res.value_policy()); + let wants_telemetry = value_policy.is_some(); + let object_name = scope_resolution.as_ref().and_then(|res| res.object_name()); + + let mut telemetry_holder = if wants_telemetry { + Some(self.filter_stats.values_mut()) + } else { + None + }; + let telemetry = telemetry_holder.as_deref_mut(); + + record_return_value( + py, + &mut self.writer, + retval, + value_policy, + telemetry, + object_name, + ); self.mark_event(); if self.activation.handle_return_event(code.id()) { let _mute = ScopedMuteIoCapture::new(); @@ -692,6 +908,7 @@ impl Tracer for RuntimeTracer { } } self.ignored_code_ids.clear(); + self.scope_cache.clear(); Ok(()) } @@ -734,7 +951,9 @@ impl Tracer for RuntimeTracer { } self.ignored_code_ids.clear(); self.function_ids.clear(); + self.scope_cache.clear(); self.line_snapshots.clear(); + self.filter_stats.reset(); return Ok(()); } @@ -743,6 +962,8 @@ impl Tracer for RuntimeTracer { self.finalise_writer().map_err(ffi::map_recorder_error)?; self.ignored_code_ids.clear(); self.function_ids.clear(); + self.scope_cache.clear(); + self.filter_stats.reset(); self.line_snapshots.clear(); Ok(()) } @@ -753,6 +974,7 @@ mod tests { use super::*; use crate::monitoring::CallbackOutcome; use crate::policy; + use crate::trace_filter::config::TraceFilterConfig; use pyo3::types::{PyAny, PyCode, PyModule}; use pyo3::wrap_pyfunction; use runtime_tracing::{FullValueRecord, StepRecord, TraceLowLevelEvent, ValueRecord}; @@ -760,6 +982,9 @@ mod tests { use std::cell::Cell; use std::collections::BTreeMap; use std::ffi::CString; + use std::fs; + use std::path::Path; + use std::sync::Arc; use std::thread; thread_local! { @@ -814,7 +1039,8 @@ mod tests { #[test] fn skips_synthetic_filename_events() { Python::with_gil(|py| { - let mut tracer = RuntimeTracer::new("test.py", &[], TraceEventsFileFormat::Json, None); + let mut tracer = + RuntimeTracer::new("test.py", &[], TraceEventsFileFormat::Json, None, None); ensure_test_module(py); let script = format!("{PRELUDE}\nsnapshot()\n"); { @@ -901,6 +1127,7 @@ result = compute()\n" &[], TraceEventsFileFormat::Json, Some(script_path.as_path()), + None, ); { @@ -946,8 +1173,13 @@ result = compute()\n" let script = format!("{PRELUDE}\n\nsnapshot()\n"); std::fs::write(&script_path, &script).expect("write script"); - let mut tracer = - RuntimeTracer::new("snapshot_script.py", &[], TraceEventsFileFormat::Json, None); + let mut tracer = RuntimeTracer::new( + "snapshot_script.py", + &[], + TraceEventsFileFormat::Json, + None, + None, + ); let store = tracer.line_snapshot_store(); { @@ -1021,6 +1253,7 @@ result = compute()\n" &[], TraceEventsFileFormat::Json, None, + None, ); let outputs = TraceOutputPaths::new(tmp.path(), TraceEventsFileFormat::Json); tracer.begin(&outputs, 1).expect("begin tracer"); @@ -1112,6 +1345,7 @@ result = compute()\n" &[], TraceEventsFileFormat::Json, None, + None, ); let outputs = TraceOutputPaths::new(tmp.path(), TraceEventsFileFormat::Json); tracer.begin(&outputs, 1).expect("begin tracer"); @@ -1217,6 +1451,7 @@ result = compute()\n" &[], TraceEventsFileFormat::Json, None, + None, ); let outputs = TraceOutputPaths::new(tmp.path(), TraceEventsFileFormat::Json); tracer.begin(&outputs, 1).expect("begin tracer"); @@ -1438,7 +1673,8 @@ def emit_return(value): fn run_traced_script(body: &str) -> Vec { Python::with_gil(|py| { - let mut tracer = RuntimeTracer::new("test.py", &[], TraceEventsFileFormat::Json, None); + let mut tracer = + RuntimeTracer::new("test.py", &[], TraceEventsFileFormat::Json, None, None); ensure_test_module(py); let tmp = tempfile::tempdir().expect("create temp dir"); let script_path = tmp.path().join("script.py"); @@ -1459,6 +1695,353 @@ def emit_return(value): }) } + fn write_filter(path: &Path, contents: &str) { + fs::write(path, contents.trim_start()).expect("write filter"); + } + + #[test] + fn trace_filter_redacts_values() { + Python::with_gil(|py| { + ensure_test_module(py); + + let project = tempfile::tempdir().expect("project dir"); + let project_root = project.path(); + let filters_dir = project_root.join(".codetracer"); + fs::create_dir(&filters_dir).expect("create .codetracer"); + let filter_path = filters_dir.join("filters.toml"); + write_filter( + &filter_path, + r#" + [meta] + name = "redact" + version = 1 + + [scope] + default_exec = "trace" + default_value_action = "allow" + + [[scope.rules]] + selector = "pkg:app.sec" + exec = "trace" + value_default = "allow" + + [[scope.rules.value_patterns]] + selector = "arg:password" + action = "redact" + + [[scope.rules.value_patterns]] + selector = "local:password" + action = "redact" + + [[scope.rules.value_patterns]] + selector = "local:secret" + action = "redact" + + [[scope.rules.value_patterns]] + selector = "global:shared_secret" + action = "redact" + + [[scope.rules.value_patterns]] + selector = "ret:literal:app.sec.sensitive" + action = "redact" + + [[scope.rules.value_patterns]] + selector = "local:internal" + action = "drop" + "#, + ); + let config = TraceFilterConfig::from_paths(&[filter_path]).expect("load filter"); + let engine = Arc::new(TraceFilterEngine::new(config)); + + let app_dir = project_root.join("app"); + fs::create_dir_all(&app_dir).expect("create app dir"); + let script_path = app_dir.join("sec.py"); + let body = r#" +shared_secret = "initial" + +def sensitive(password): + secret = "token" + internal = "hidden" + public = "visible" + globals()['shared_secret'] = password + snapshot() + emit_return(password) + return password + +sensitive("s3cr3t") +"#; + let script = format!("{PRELUDE}\n{body}", PRELUDE = PRELUDE, body = body); + fs::write(&script_path, script).expect("write script"); + + let mut tracer = RuntimeTracer::new( + script_path.to_string_lossy().as_ref(), + &[], + TraceEventsFileFormat::Json, + None, + Some(engine), + ); + + { + let _guard = ScopedTracer::new(&mut tracer); + LAST_OUTCOME.with(|cell| cell.set(None)); + let run_code = format!( + "import runpy, sys\nsys.path.insert(0, r\"{}\")\nrunpy.run_path(r\"{}\")", + project_root.display(), + script_path.display() + ); + let run_code_c = CString::new(run_code).expect("script contains nul byte"); + py.run(run_code_c.as_c_str(), None, None) + .expect("execute filtered script"); + } + + let mut variable_names: Vec = Vec::new(); + for event in &tracer.writer.events { + if let TraceLowLevelEvent::VariableName(name) = event { + variable_names.push(name.clone()); + } + } + assert!( + !variable_names.iter().any(|name| name == "internal"), + "internal variable should not be recorded" + ); + + let password_index = variable_names + .iter() + .position(|name| name == "password") + .expect("password variable recorded"); + let password_value = tracer + .writer + .events + .iter() + .find_map(|event| match event { + TraceLowLevelEvent::Value(record) if record.variable_id.0 == password_index => { + Some(record.value.clone()) + } + _ => None, + }) + .expect("password value recorded"); + match password_value { + ValueRecord::Error { ref msg, .. } => assert_eq!(msg, ""), + ref other => panic!("expected password argument redacted, got {other:?}"), + } + + let snapshots = collect_snapshots(&tracer.writer.events); + let snapshot = find_snapshot_with_vars( + &snapshots, + &["secret", "public", "shared_secret", "password"], + ); + assert_var( + snapshot, + "secret", + SimpleValue::Raw("".to_string()), + ); + assert_var( + snapshot, + "public", + SimpleValue::String("visible".to_string()), + ); + assert_var( + snapshot, + "shared_secret", + SimpleValue::Raw("".to_string()), + ); + assert_var( + snapshot, + "password", + SimpleValue::Raw("".to_string()), + ); + assert_no_variable(&snapshots, "internal"); + + let return_record = tracer + .writer + .events + .iter() + .find_map(|event| match event { + TraceLowLevelEvent::Return(record) => Some(record.clone()), + _ => None, + }) + .expect("return record"); + + match return_record.return_value { + ValueRecord::Error { ref msg, .. } => assert_eq!(msg, ""), + ref other => panic!("expected redacted return value, got {other:?}"), + } + }); + } + + #[test] + fn trace_filter_metadata_includes_summary() { + Python::with_gil(|py| { + reset_policy(py); + ensure_test_module(py); + + let project = tempfile::tempdir().expect("project dir"); + let project_root = project.path(); + let filters_dir = project_root.join(".codetracer"); + fs::create_dir(&filters_dir).expect("create .codetracer"); + let filter_path = filters_dir.join("filters.toml"); + write_filter( + &filter_path, + r#" + [meta] + name = "redact" + version = 1 + + [scope] + default_exec = "trace" + default_value_action = "allow" + + [[scope.rules]] + selector = "pkg:app.sec" + exec = "trace" + value_default = "allow" + + [[scope.rules.value_patterns]] + selector = "arg:password" + action = "redact" + + [[scope.rules.value_patterns]] + selector = "local:password" + action = "redact" + + [[scope.rules.value_patterns]] + selector = "local:secret" + action = "redact" + + [[scope.rules.value_patterns]] + selector = "global:shared_secret" + action = "redact" + + [[scope.rules.value_patterns]] + selector = "ret:literal:app.sec.sensitive" + action = "redact" + + [[scope.rules.value_patterns]] + selector = "local:internal" + action = "drop" + "#, + ); + let config = TraceFilterConfig::from_paths(&[filter_path]).expect("load filter"); + let engine = Arc::new(TraceFilterEngine::new(config)); + + let app_dir = project_root.join("app"); + fs::create_dir_all(&app_dir).expect("create app dir"); + let script_path = app_dir.join("sec.py"); + let body = r#" +shared_secret = "initial" + +def sensitive(password): + secret = "token" + internal = "hidden" + public = "visible" + globals()['shared_secret'] = password + snapshot() + emit_return(password) + return password + +sensitive("s3cr3t") +"#; + let script = format!("{PRELUDE}\n{body}", PRELUDE = PRELUDE, body = body); + fs::write(&script_path, script).expect("write script"); + + let outputs_dir = tempfile::tempdir().expect("outputs dir"); + let outputs = TraceOutputPaths::new(outputs_dir.path(), TraceEventsFileFormat::Json); + + let program = script_path.to_string_lossy().into_owned(); + let mut tracer = RuntimeTracer::new( + &program, + &[], + TraceEventsFileFormat::Json, + None, + Some(engine), + ); + tracer.begin(&outputs, 1).expect("begin tracer"); + + { + let _guard = ScopedTracer::new(&mut tracer); + LAST_OUTCOME.with(|cell| cell.set(None)); + let run_code = format!( + "import runpy, sys\nsys.path.insert(0, r\"{}\")\nrunpy.run_path(r\"{}\")", + project_root.display(), + script_path.display() + ); + let run_code_c = CString::new(run_code).expect("script contains nul byte"); + py.run(run_code_c.as_c_str(), None, None) + .expect("execute script"); + } + + tracer.finish(py).expect("finish tracer"); + + let metadata_str = fs::read_to_string(outputs.metadata()).expect("read metadata"); + let metadata: serde_json::Value = + serde_json::from_str(&metadata_str).expect("parse metadata"); + let trace_filter = metadata + .get("trace_filter") + .and_then(|value| value.as_object()) + .expect("trace_filter metadata"); + + let filters = trace_filter + .get("filters") + .and_then(|value| value.as_array()) + .expect("filters array"); + assert_eq!(filters.len(), 1); + let filter_entry = filters[0].as_object().expect("filter entry"); + assert_eq!( + filter_entry.get("name").and_then(|v| v.as_str()), + Some("redact") + ); + + let stats = trace_filter + .get("stats") + .and_then(|value| value.as_object()) + .expect("stats object"); + assert_eq!( + stats.get("scopes_skipped").and_then(|v| v.as_u64()), + Some(0) + ); + let value_redactions = stats + .get("value_redactions") + .and_then(|value| value.as_object()) + .expect("value_redactions object"); + assert_eq!( + value_redactions.get("argument").and_then(|v| v.as_u64()), + Some(0) + ); + // Argument values currently surface through local snapshots; once call-record redaction wiring lands this count should rise above zero. + assert_eq!( + value_redactions.get("local").and_then(|v| v.as_u64()), + Some(2) + ); + assert_eq!( + value_redactions.get("global").and_then(|v| v.as_u64()), + Some(1) + ); + assert_eq!( + value_redactions.get("return").and_then(|v| v.as_u64()), + Some(1) + ); + assert_eq!( + value_redactions.get("attribute").and_then(|v| v.as_u64()), + Some(0) + ); + let value_drops = stats + .get("value_drops") + .and_then(|value| value.as_object()) + .expect("value_drops object"); + assert_eq!( + value_drops.get("argument").and_then(|v| v.as_u64()), + Some(0) + ); + assert_eq!(value_drops.get("local").and_then(|v| v.as_u64()), Some(1)); + assert_eq!(value_drops.get("global").and_then(|v| v.as_u64()), Some(0)); + assert_eq!(value_drops.get("return").and_then(|v| v.as_u64()), Some(0)); + assert_eq!( + value_drops.get("attribute").and_then(|v| v.as_u64()), + Some(0) + ); + }); + } + fn assert_var(snapshot: &Snapshot, name: &str, expected: SimpleValue) { let actual = snapshot .vars @@ -1925,6 +2508,7 @@ snapshot() &[], TraceEventsFileFormat::Json, None, + None, ); tracer.begin(&outputs, 1).expect("begin tracer"); @@ -1958,6 +2542,7 @@ snapshot() &[], TraceEventsFileFormat::Json, None, + None, ); tracer.begin(&outputs, 1).expect("begin tracer"); tracer.mark_failure(); @@ -2000,6 +2585,7 @@ snapshot() &[], TraceEventsFileFormat::Json, None, + None, ); tracer.begin(&outputs, 1).expect("begin tracer"); tracer.mark_failure(); diff --git a/codetracer-python-recorder/src/runtime/value_capture.rs b/codetracer-python-recorder/src/runtime/value_capture.rs index f0950f2..caf084a 100644 --- a/codetracer-python-recorder/src/runtime/value_capture.rs +++ b/codetracer-python-recorder/src/runtime/value_capture.rs @@ -6,12 +6,109 @@ use pyo3::prelude::*; use pyo3::types::PyString; use recorder_errors::{usage, ErrorCode}; -use runtime_tracing::{FullValueRecord, NonStreamingTraceWriter, TraceWriter}; +use runtime_tracing::{ + FullValueRecord, NonStreamingTraceWriter, TraceWriter, TypeKind, ValueRecord, +}; use crate::code_object::CodeObjectWrapper; use crate::ffi; +use crate::logging::record_dropped_event; use crate::runtime::frame_inspector::{capture_frame, FrameSnapshot}; use crate::runtime::value_encoder::encode_value; +use crate::trace_filter::config::ValueAction; +use crate::trace_filter::engine::{ValueKind, ValuePolicy}; + +const REDACTED_SENTINEL: &str = ""; + +const VALUE_KIND_COUNT: usize = 5; + +#[derive(Debug, Default, Clone)] +pub struct ValueFilterStats { + redacted: [u64; VALUE_KIND_COUNT], + dropped: [u64; VALUE_KIND_COUNT], +} + +impl ValueFilterStats { + pub fn record_redaction(&mut self, kind: ValueKind) { + self.redacted[kind.index()] += 1; + } + + pub fn record_drop(&mut self, kind: ValueKind) { + self.dropped[kind.index()] += 1; + } + + pub fn redacted_count(&self, kind: ValueKind) -> u64 { + self.redacted[kind.index()] + } + + pub fn dropped_count(&self, kind: ValueKind) -> u64 { + self.dropped[kind.index()] + } +} + +fn redacted_value(writer: &mut NonStreamingTraceWriter) -> ValueRecord { + let ty = TraceWriter::ensure_type_id(writer, TypeKind::Raw, "Redacted"); + ValueRecord::Error { + msg: REDACTED_SENTINEL.to_string(), + type_id: ty, + } +} + +fn record_redaction(kind: ValueKind, candidate: &str, telemetry: Option<&mut ValueFilterStats>) { + if let Some(stats) = telemetry { + stats.record_redaction(kind); + } + let metric = match kind { + ValueKind::Arg => "filter_value_redacted.arg", + ValueKind::Local => "filter_value_redacted.local", + ValueKind::Global => "filter_value_redacted.global", + ValueKind::Return => "filter_value_redacted.return", + ValueKind::Attr => "filter_value_redacted.attr", + }; + record_dropped_event(metric); + log::debug!("[RuntimeTracer] redacted {} '{}'", kind.label(), candidate); +} + +fn record_drop(kind: ValueKind, candidate: &str, telemetry: Option<&mut ValueFilterStats>) { + if let Some(stats) = telemetry { + stats.record_drop(kind); + } + let metric = match kind { + ValueKind::Arg => "filter_value_dropped.arg", + ValueKind::Local => "filter_value_dropped.local", + ValueKind::Global => "filter_value_dropped.global", + ValueKind::Return => "filter_value_dropped.return", + ValueKind::Attr => "filter_value_dropped.attr", + }; + record_dropped_event(metric); + log::debug!( + "[RuntimeTracer] dropped {} '{}' from trace", + kind.label(), + candidate + ); +} + +fn encode_with_policy<'py>( + py: Python<'py>, + writer: &mut NonStreamingTraceWriter, + value: &Bound<'py, PyAny>, + policy: Option<&ValuePolicy>, + kind: ValueKind, + candidate: &str, + telemetry: Option<&mut ValueFilterStats>, +) -> Option { + match policy.map(|p| p.decide(kind, candidate)) { + Some(ValueAction::Redact) => { + record_redaction(kind, candidate, telemetry); + Some(redacted_value(writer)) + } + Some(ValueAction::Drop) => { + record_drop(kind, candidate, telemetry); + None + } + _ => Some(encode_value(py, writer, value)), + } +} /// Capture Python call arguments for the provided code object and encode them /// using the runtime tracer writer. @@ -19,6 +116,8 @@ pub fn capture_call_arguments<'py>( py: Python<'py>, writer: &mut NonStreamingTraceWriter, code: &CodeObjectWrapper, + policy: Option<&ValuePolicy>, + mut telemetry: Option<&mut ValueFilterStats>, ) -> PyResult> { let snapshot = capture_frame(py, code)?; let locals = snapshot.locals(); @@ -45,16 +144,34 @@ pub fn capture_call_arguments<'py>( "missing positional arg '{name}'" )) })?; - let encoded = encode_value(py, writer, &value); - args.push(TraceWriter::arg(writer, name, encoded)); + if let Some(encoded) = encode_with_policy( + py, + writer, + &value, + policy, + ValueKind::Arg, + name, + telemetry.as_deref_mut(), + ) { + args.push(TraceWriter::arg(writer, name, encoded)); + } idx += 1; } if (flags & CO_VARARGS) != 0 && idx < varnames.len() { let name = &varnames[idx]; if let Some(value) = locals.get_item(name)? { - let encoded = encode_value(py, writer, &value); - args.push(TraceWriter::arg(writer, name, encoded)); + if let Some(encoded) = encode_with_policy( + py, + writer, + &value, + policy, + ValueKind::Arg, + name, + telemetry.as_deref_mut(), + ) { + args.push(TraceWriter::arg(writer, name, encoded)); + } } idx += 1; } @@ -67,16 +184,34 @@ pub fn capture_call_arguments<'py>( "missing kw-only arg '{name}'" )) })?; - let encoded = encode_value(py, writer, &value); - args.push(TraceWriter::arg(writer, name, encoded)); + if let Some(encoded) = encode_with_policy( + py, + writer, + &value, + policy, + ValueKind::Arg, + name, + telemetry.as_deref_mut(), + ) { + args.push(TraceWriter::arg(writer, name, encoded)); + } } idx = idx.saturating_add(kwonly_take); if (flags & CO_VARKEYWORDS) != 0 && idx < varnames.len() { let name = &varnames[idx]; if let Some(value) = locals.get_item(name)? { - let encoded = encode_value(py, writer, &value); - args.push(TraceWriter::arg(writer, name, encoded)); + if let Some(encoded) = encode_with_policy( + py, + writer, + &value, + policy, + ValueKind::Arg, + name, + telemetry.as_deref_mut(), + ) { + args.push(TraceWriter::arg(writer, name, encoded)); + } } } @@ -89,6 +224,8 @@ pub fn record_visible_scope( writer: &mut NonStreamingTraceWriter, snapshot: &FrameSnapshot<'_>, recorded: &mut HashSet, + policy: Option<&ValuePolicy>, + mut telemetry: Option<&mut ValueFilterStats>, ) { for (key, value) in snapshot.locals().iter() { let name = match key.downcast::() { @@ -98,9 +235,19 @@ pub fn record_visible_scope( }, Err(_) => continue, }; - let encoded = encode_value(py, writer, &value); - TraceWriter::register_variable_with_full_value(writer, &name, encoded); - recorded.insert(name); + let encoded = encode_with_policy( + py, + writer, + &value, + policy, + ValueKind::Local, + &name, + telemetry.as_deref_mut(), + ); + if let Some(encoded) = encoded { + TraceWriter::register_variable_with_full_value(writer, &name, encoded); + recorded.insert(name); + } } if snapshot.locals_is_globals() { @@ -119,9 +266,19 @@ pub fn record_visible_scope( if name == "__builtins__" || recorded.contains(name) { continue; } - let encoded = encode_value(py, writer, &value); - TraceWriter::register_variable_with_full_value(writer, name, encoded); - recorded.insert(name.to_owned()); + let encoded = encode_with_policy( + py, + writer, + &value, + policy, + ValueKind::Global, + name, + telemetry.as_deref_mut(), + ); + if let Some(encoded) = encoded { + TraceWriter::register_variable_with_full_value(writer, name, encoded); + recorded.insert(name.to_owned()); + } } } } @@ -131,7 +288,21 @@ pub fn record_return_value( py: Python<'_>, writer: &mut NonStreamingTraceWriter, value: &Bound<'_, PyAny>, + policy: Option<&ValuePolicy>, + mut telemetry: Option<&mut ValueFilterStats>, + candidate: Option<&str>, ) { - let encoded = encode_value(py, writer, value); - TraceWriter::register_return(writer, encoded); + let name = candidate.unwrap_or(""); + let encoded = encode_with_policy( + py, + writer, + value, + policy, + ValueKind::Return, + name, + telemetry.as_deref_mut(), + ); + if let Some(encoded) = encoded { + TraceWriter::register_return(writer, encoded); + } } diff --git a/codetracer-python-recorder/src/session.rs b/codetracer-python-recorder/src/session.rs index 21270c9..b57ceea 100644 --- a/codetracer-python-recorder/src/session.rs +++ b/codetracer-python-recorder/src/session.rs @@ -19,8 +19,13 @@ use bootstrap::TraceSessionBootstrap; static ACTIVE: AtomicBool = AtomicBool::new(false); /// Start tracing using sys.monitoring and runtime_tracing writer. -#[pyfunction] -pub fn start_tracing(path: &str, format: &str, activation_path: Option<&str>) -> PyResult<()> { +#[pyfunction(signature = (path, format, activation_path=None, trace_filter=None))] +pub fn start_tracing( + path: &str, + format: &str, + activation_path: Option<&str>, + trace_filter: Option>, +) -> PyResult<()> { ffi::wrap_pyfunction("start_tracing", || { // Ensure logging is ready before any tracer logs might be emitted. // Default our crate to warnings-only so tests stay quiet unless explicitly enabled. @@ -33,6 +38,8 @@ pub fn start_tracing(path: &str, format: &str, activation_path: Option<&str>) -> } let activation_path = activation_path.map(PathBuf::from); + let filter_paths: Option> = + trace_filter.map(|items| items.into_iter().map(PathBuf::from).collect()); Python::with_gil(|py| { let bootstrap = TraceSessionBootstrap::prepare( @@ -40,6 +47,7 @@ pub fn start_tracing(path: &str, format: &str, activation_path: Option<&str>) -> Path::new(path), format, activation_path.as_deref(), + filter_paths.as_ref().map(|paths| paths.as_slice()), ) .map_err(ffi::map_recorder_error)?; @@ -51,6 +59,7 @@ pub fn start_tracing(path: &str, format: &str, activation_path: Option<&str>) -> bootstrap.args(), bootstrap.format(), bootstrap.activation_path(), + bootstrap.trace_filter(), ); tracer.begin(&outputs, 1)?; tracer.install_io_capture(py, &policy)?; diff --git a/codetracer-python-recorder/src/session/bootstrap.rs b/codetracer-python-recorder/src/session/bootstrap.rs index 2e0e53d..a4697f1 100644 --- a/codetracer-python-recorder/src/session/bootstrap.rs +++ b/codetracer-python-recorder/src/session/bootstrap.rs @@ -1,13 +1,18 @@ //! Helpers for preparing a tracing session before installing the runtime tracer. +use std::env; +use std::fmt; use std::fs; use std::path::{Path, PathBuf}; +use std::sync::Arc; use pyo3::prelude::*; use recorder_errors::{enverr, usage, ErrorCode}; use runtime_tracing::TraceEventsFileFormat; use crate::errors::Result; +use crate::trace_filter::config::TraceFilterConfig; +use crate::trace_filter::engine::TraceFilterEngine; /// Basic metadata about the currently running Python program. #[derive(Debug, Clone)] @@ -17,12 +22,31 @@ pub struct ProgramMetadata { } /// Collected data required to start a tracing session. -#[derive(Debug, Clone)] +#[derive(Clone)] pub struct TraceSessionBootstrap { trace_directory: PathBuf, format: TraceEventsFileFormat, activation_path: Option, metadata: ProgramMetadata, + trace_filter: Option>, +} + +const TRACE_FILTER_DIR: &str = ".codetracer"; +const TRACE_FILTER_FILE: &str = "trace-filter.toml"; +const BUILTIN_FILTER_LABEL: &str = "builtin-default"; +const BUILTIN_TRACE_FILTER: &str = + include_str!("../../resources/trace_filters/builtin_default.toml"); + +impl fmt::Debug for TraceSessionBootstrap { + fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { + f.debug_struct("TraceSessionBootstrap") + .field("trace_directory", &self.trace_directory) + .field("format", &self.format) + .field("activation_path", &self.activation_path) + .field("metadata", &self.metadata) + .field("trace_filter", &self.trace_filter.is_some()) + .finish() + } } impl TraceSessionBootstrap { @@ -33,6 +57,7 @@ impl TraceSessionBootstrap { trace_directory: &Path, format: &str, activation_path: Option<&Path>, + explicit_trace_filters: Option<&[PathBuf]>, ) -> Result { ensure_trace_directory(trace_directory)?; let format = resolve_trace_format(format)?; @@ -40,11 +65,13 @@ impl TraceSessionBootstrap { enverr!(ErrorCode::Io, "failed to collect program metadata") .with_context("details", err.to_string()) })?; + let trace_filter = load_trace_filter(explicit_trace_filters, &metadata.program)?; Ok(Self { trace_directory: trace_directory.to_path_buf(), format, activation_path: activation_path.map(|p| p.to_path_buf()), metadata, + trace_filter, }) } @@ -67,6 +94,10 @@ impl TraceSessionBootstrap { pub fn args(&self) -> &[String] { &self.metadata.args } + + pub fn trace_filter(&self) -> Option> { + self.trace_filter.as_ref().map(Arc::clone) + } } /// Ensure the requested trace directory exists and is writable. @@ -131,11 +162,81 @@ pub fn collect_program_metadata(py: Python<'_>) -> PyResult { Ok(ProgramMetadata { program, args }) } +fn load_trace_filter( + explicit: Option<&[PathBuf]>, + program: &str, +) -> Result>> { + let mut chain: Vec = Vec::new(); + + if let Some(default) = discover_default_trace_filter(program)? { + chain.push(default); + } + + if let Some(paths) = explicit { + chain.extend(paths.iter().cloned()); + } + + let config = TraceFilterConfig::from_inline_and_paths( + &[(BUILTIN_FILTER_LABEL, BUILTIN_TRACE_FILTER)], + &chain, + )?; + Ok(Some(Arc::new(TraceFilterEngine::new(config)))) +} + +fn discover_default_trace_filter(program: &str) -> Result> { + let start_dir = resolve_program_directory(program)?; + let mut current: Option<&Path> = Some(start_dir.as_path()); + while let Some(dir) = current { + let candidate = dir.join(TRACE_FILTER_DIR).join(TRACE_FILTER_FILE); + if matches!(fs::metadata(&candidate), Ok(metadata) if metadata.is_file()) { + return Ok(Some(candidate)); + } + current = dir.parent(); + } + Ok(None) +} + +fn resolve_program_directory(program: &str) -> Result { + let trimmed = program.trim(); + if trimmed.is_empty() || trimmed == "" { + return current_directory(); + } + + let path = Path::new(trimmed); + if path.is_absolute() { + if path.is_dir() { + return Ok(path.to_path_buf()); + } + if let Some(parent) = path.parent() { + return Ok(parent.to_path_buf()); + } + return current_directory(); + } + + let cwd = current_directory()?; + let joined = cwd.join(path); + if joined.is_dir() { + return Ok(joined); + } + if let Some(parent) = joined.parent() { + return Ok(parent.to_path_buf()); + } + Ok(cwd) +} + +fn current_directory() -> Result { + env::current_dir().map_err(|err| { + enverr!(ErrorCode::Io, "failed to resolve current directory") + .with_context("io", err.to_string()) + }) +} + #[cfg(test)] mod tests { use super::*; use pyo3::types::PyList; use recorder_errors::ErrorCode; + use std::path::PathBuf; use tempfile::tempdir; #[test] @@ -232,6 +333,7 @@ mod tests { trace_dir.as_path(), "json", Some(activation.as_path()), + None, ); sys.setattr("argv", original.bind(py)) .expect("restore argv"); @@ -246,4 +348,172 @@ mod tests { assert_eq!(bootstrap.args(), expected_args.as_slice()); }); } + + #[test] + fn prepare_bootstrap_applies_builtin_trace_filter() { + Python::with_gil(|py| { + let tmp = tempdir().expect("tempdir"); + let trace_dir = tmp.path().join("out"); + let script_path = tmp.path().join("app.py"); + std::fs::write(&script_path, "print('hello')\n").expect("write script"); + + let sys = py.import("sys").expect("import sys"); + let original = sys.getattr("argv").expect("argv").unbind(); + let argv = PyList::new(py, [script_path.to_str().expect("utf8 path")]).expect("argv"); + sys.setattr("argv", argv).expect("set argv"); + + let result = + TraceSessionBootstrap::prepare(py, trace_dir.as_path(), "json", None, None); + sys.setattr("argv", original.bind(py)) + .expect("restore argv"); + + let bootstrap = result.expect("bootstrap"); + let engine = bootstrap.trace_filter().expect("builtin filter"); + let summary = engine.summary(); + assert_eq!(summary.entries.len(), 1); + assert_eq!( + summary.entries[0].path, + PathBuf::from("") + ); + }); + } + + #[test] + fn prepare_bootstrap_loads_default_trace_filter() { + Python::with_gil(|py| { + let project = tempdir().expect("project"); + let project_root = project.path(); + let trace_dir = project_root.join("out"); + + let app_dir = project_root.join("src"); + std::fs::create_dir_all(&app_dir).expect("create src dir"); + let script_path = app_dir.join("main.py"); + std::fs::write(&script_path, "print('run')\n").expect("write script"); + + let filters_dir = project_root.join(TRACE_FILTER_DIR); + std::fs::create_dir(&filters_dir).expect("create filter dir"); + let filter_path = filters_dir.join(TRACE_FILTER_FILE); + std::fs::write( + &filter_path, + r#" + [meta] + name = "default" + version = 1 + + [scope] + default_exec = "trace" + default_value_action = "allow" + + [[scope.rules]] + selector = "pkg:src" + exec = "trace" + value_default = "allow" + "#, + ) + .expect("write filter"); + + let sys = py.import("sys").expect("import sys"); + let original = sys.getattr("argv").expect("argv").unbind(); + let argv = PyList::new(py, [script_path.to_str().expect("utf8 path")]).expect("argv"); + sys.setattr("argv", argv).expect("set argv"); + + let result = + TraceSessionBootstrap::prepare(py, trace_dir.as_path(), "json", None, None); + sys.setattr("argv", original.bind(py)) + .expect("restore argv"); + + let bootstrap = result.expect("bootstrap"); + let engine = bootstrap.trace_filter().expect("filter engine"); + let summary = engine.summary(); + assert_eq!(summary.entries.len(), 2); + assert_eq!( + summary.entries[0].path, + PathBuf::from("") + ); + assert_eq!(summary.entries[1].path, filter_path); + }); + } + + #[test] + fn prepare_bootstrap_merges_explicit_trace_filters() { + Python::with_gil(|py| { + let project = tempdir().expect("project"); + let project_root = project.path(); + let trace_dir = project_root.join("out"); + + let app_dir = project_root.join("src"); + std::fs::create_dir_all(&app_dir).expect("create src dir"); + let script_path = app_dir.join("main.py"); + std::fs::write(&script_path, "print('run')\n").expect("write script"); + + let filters_dir = project_root.join(TRACE_FILTER_DIR); + std::fs::create_dir(&filters_dir).expect("create filter dir"); + let default_filter_path = filters_dir.join(TRACE_FILTER_FILE); + std::fs::write( + &default_filter_path, + r#" + [meta] + name = "default" + version = 1 + + [scope] + default_exec = "trace" + default_value_action = "allow" + + [[scope.rules]] + selector = "pkg:src" + exec = "trace" + value_default = "allow" + "#, + ) + .expect("write default filter"); + + let override_filter_path = project_root.join("override-filter.toml"); + std::fs::write( + &override_filter_path, + r#" + [meta] + name = "override" + version = 1 + + [scope] + default_exec = "trace" + default_value_action = "allow" + + [[scope.rules]] + selector = "pkg:src.special" + exec = "skip" + value_default = "redact" + "#, + ) + .expect("write override filter"); + + let sys = py.import("sys").expect("import sys"); + let original = sys.getattr("argv").expect("argv").unbind(); + let argv = PyList::new(py, [script_path.to_str().expect("utf8 path")]).expect("argv"); + sys.setattr("argv", argv).expect("set argv"); + + let explicit = vec![override_filter_path.clone()]; + let result = TraceSessionBootstrap::prepare( + py, + trace_dir.as_path(), + "json", + None, + Some(explicit.as_slice()), + ); + sys.setattr("argv", original.bind(py)) + .expect("restore argv"); + + let bootstrap = result.expect("bootstrap"); + let engine = bootstrap.trace_filter().expect("filter engine"); + let summary = engine.summary(); + assert_eq!(summary.entries.len(), 3); + assert_eq!( + summary.entries[0].path, + PathBuf::from("") + ); + assert_eq!(summary.entries[1].path, default_filter_path); + assert_eq!(summary.entries[2].path, override_filter_path); + }); + } } diff --git a/codetracer-python-recorder/src/trace_filter/config.rs b/codetracer-python-recorder/src/trace_filter/config.rs new file mode 100644 index 0000000..f1507ed --- /dev/null +++ b/codetracer-python-recorder/src/trace_filter/config.rs @@ -0,0 +1,1005 @@ +//! Filter configuration loader that parses TOML files, resolves inheritance, and +//! prepares flattened scope/value rules for the runtime engine. +//! +//! The implementation follows the schema defined in +//! `design-docs/US0028 - Configurable Python trace filters.md`. + +use crate::trace_filter::selector::{MatchType, Selector, SelectorKind}; +use recorder_errors::{usage, ErrorCode, RecorderResult}; +use serde::Deserialize; +use sha2::{Digest, Sha256}; +use std::collections::HashSet; +use std::fs; +use std::path::{Component, Path, PathBuf}; + +/// Scope-level execution directive. +#[derive(Debug, Clone, Copy, PartialEq, Eq)] +pub enum ExecDirective { + Trace, + Skip, +} + +impl ExecDirective { + fn parse(token: &str) -> Option { + match token { + "trace" => Some(ExecDirective::Trace), + "skip" => Some(ExecDirective::Skip), + _ => None, + } + } +} + +/// Value-level capture directive. +#[derive(Debug, Clone, Copy, PartialEq, Eq)] +pub enum ValueAction { + Allow, + Redact, + Drop, +} + +impl ValueAction { + fn parse(token: &str) -> Option { + match token { + "allow" => Some(ValueAction::Allow), + "redact" => Some(ValueAction::Redact), + "drop" => Some(ValueAction::Drop), + // Backwards compatibility for deprecated `deny`. + "deny" => Some(ValueAction::Redact), + _ => None, + } + } +} + +/// IO streams that can be captured in addition to scope/value rules. +#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)] +pub enum IoStream { + Stdout, + Stderr, + Stdin, + Files, +} + +impl IoStream { + fn parse(token: &str) -> Option { + match token { + "stdout" => Some(IoStream::Stdout), + "stderr" => Some(IoStream::Stderr), + "stdin" => Some(IoStream::Stdin), + "files" => Some(IoStream::Files), + _ => None, + } + } +} + +/// Metadata describing the source filter file. +#[derive(Debug, Clone, PartialEq, Eq)] +pub struct FilterMeta { + pub name: String, + pub version: u32, + pub description: Option, + pub labels: Vec, +} + +/// IO capture configuration. +#[derive(Debug, Clone, PartialEq, Eq)] +pub struct IoConfig { + pub capture: bool, + pub streams: Vec, +} + +impl Default for IoConfig { + fn default() -> Self { + IoConfig { + capture: false, + streams: Vec::new(), + } + } +} + +/// Value pattern applied within a scope rule. +#[derive(Debug, Clone)] +pub struct ValuePattern { + pub selector: Selector, + pub action: ValueAction, + pub reason: Option, + pub source_id: usize, +} + +/// Scope rule constructed from the flattened configuration chain. +#[derive(Debug, Clone)] +pub struct ScopeRule { + pub selector: Selector, + pub exec: Option, + pub value_default: Option, + pub value_patterns: Vec, + pub reason: Option, + pub source_id: usize, +} + +/// Source information for each filter file participating in the chain. +#[derive(Debug, Clone)] +pub struct FilterSource { + pub path: PathBuf, + pub sha256: String, + pub project_root: PathBuf, + pub meta: FilterMeta, +} + +/// Summary used for embedding in trace metadata. +#[derive(Debug, Clone)] +pub struct FilterSummary { + pub entries: Vec, +} + +/// Single entry in the filter summary. +#[derive(Debug, Clone)] +pub struct FilterSummaryEntry { + pub path: PathBuf, + pub sha256: String, + pub name: String, + pub version: u32, +} + +/// Fully resolved filter configuration ready for runtime consumption. +#[derive(Debug, Clone)] +pub struct TraceFilterConfig { + default_exec: ExecDirective, + default_value_action: ValueAction, + io: IoConfig, + rules: Vec, + sources: Vec, +} + +impl TraceFilterConfig { + /// Load and compose filters from the provided paths. + pub fn from_paths(paths: &[PathBuf]) -> RecorderResult { + Self::from_inline_and_paths(&[], paths) + } + + /// Load and compose filters from inline TOML sources combined with paths. + /// + /// Inline entries are ingested first in the order provided, followed by files. + pub fn from_inline_and_paths( + inline: &[(&str, &str)], + paths: &[PathBuf], + ) -> RecorderResult { + if inline.is_empty() && paths.is_empty() { + return Err(usage!( + ErrorCode::InvalidPolicyValue, + "no trace filter sources supplied" + )); + } + + let mut aggregator = ConfigAggregator::default(); + for (label, contents) in inline { + aggregator.ingest_inline(label, contents)?; + } + for path in paths { + aggregator.ingest_file(path)?; + } + + aggregator.finish() + } + + /// Default execution directive applied before scope rules run. + pub fn default_exec(&self) -> ExecDirective { + self.default_exec + } + + /// Default value action applied before rule-specific overrides. + pub fn default_value_action(&self) -> ValueAction { + self.default_value_action + } + + /// IO capture configuration associated with the composed filter chain. + pub fn io(&self) -> &IoConfig { + &self.io + } + + /// Flattened scope rules in execution order. + pub fn rules(&self) -> &[ScopeRule] { + &self.rules + } + + /// Source filter metadata used for embedding in trace output. + pub fn sources(&self) -> &[FilterSource] { + &self.sources + } + + /// Helper producing a summary used by metadata writers. + pub fn summary(&self) -> FilterSummary { + let entries = self + .sources + .iter() + .map(|source| FilterSummaryEntry { + path: source.path.clone(), + sha256: source.sha256.clone(), + name: source.meta.name.clone(), + version: source.meta.version, + }) + .collect(); + FilterSummary { entries } + } +} + +#[derive(Default)] +struct ConfigAggregator { + default_exec: Option, + default_value_action: Option, + io: Option, + rules: Vec, + sources: Vec, +} + +impl ConfigAggregator { + fn ingest_file(&mut self, path: &Path) -> RecorderResult<()> { + let contents = fs::read_to_string(path).map_err(|err| { + usage!( + ErrorCode::InvalidPolicyValue, + "failed to read trace filter '{}': {}", + path.display(), + err + ) + })?; + + self.ingest_source(path, &contents) + } + + fn ingest_inline(&mut self, label: &str, contents: &str) -> RecorderResult<()> { + let pseudo_path = PathBuf::from(format!("")); + self.ingest_source(&pseudo_path, contents) + } + + fn ingest_source(&mut self, path: &Path, contents: &str) -> RecorderResult<()> { + let checksum = calculate_sha256(contents); + let raw: RawFilterFile = toml::from_str(contents).map_err(|err| { + usage!( + ErrorCode::InvalidPolicyValue, + "failed to parse trace filter '{}': {}", + path.display(), + err + ) + })?; + + let project_root = detect_project_root(path); + let source_index = self.sources.len(); + self.sources.push(FilterSource { + path: path.to_path_buf(), + sha256: checksum, + project_root: project_root.clone(), + meta: parse_meta(&raw.meta, path)?, + }); + + let defaults = resolve_defaults( + &raw.scope, + path, + self.default_exec, + self.default_value_action, + )?; + if let Some(exec) = defaults.exec { + self.default_exec = Some(exec); + } + if let Some(value_action) = defaults.value_action { + self.default_value_action = Some(value_action); + } + + if let Some(io) = parse_io(raw.io.as_ref(), path)? { + self.io = Some(io); + } + + let rules = parse_rules( + raw.scope.rules.as_deref().unwrap_or_default(), + path, + &project_root, + source_index, + )?; + self.rules.extend(rules); + + Ok(()) + } + + fn finish(self) -> RecorderResult { + let default_exec = self.default_exec.ok_or_else(|| { + usage!( + ErrorCode::InvalidPolicyValue, + "composed filters never set 'scope.default_exec'" + ) + })?; + let default_value_action = self.default_value_action.ok_or_else(|| { + usage!( + ErrorCode::InvalidPolicyValue, + "composed filters never set 'scope.default_value_action'" + ) + })?; + + let io = self.io.unwrap_or_default(); + + Ok(TraceFilterConfig { + default_exec, + default_value_action, + io, + rules: self.rules, + sources: self.sources, + }) + } +} + +fn calculate_sha256(contents: &str) -> String { + let mut hasher = Sha256::new(); + hasher.update(contents.as_bytes()); + let digest = hasher.finalize(); + format!("{:x}", digest) +} + +fn detect_project_root(path: &Path) -> PathBuf { + let mut current = path.parent(); + while let Some(dir) = current { + if dir.file_name().and_then(|name| name.to_str()) == Some(".codetracer") { + return dir + .parent() + .map(Path::to_path_buf) + .unwrap_or_else(|| dir.to_path_buf()); + } + current = dir.parent(); + } + path.parent() + .map(Path::to_path_buf) + .unwrap_or_else(|| PathBuf::from(".")) +} + +fn parse_meta(raw: &RawMeta, path: &Path) -> RecorderResult { + if raw.name.trim().is_empty() { + return Err(usage!( + ErrorCode::InvalidPolicyValue, + "'meta.name' cannot be empty in '{}'", + path.display() + )); + } + if raw.version < 1 { + return Err(usage!( + ErrorCode::InvalidPolicyValue, + "'meta.version' must be >= 1 in '{}'", + path.display() + )); + } + + let mut labels = Vec::new(); + let mut seen = HashSet::new(); + for label in &raw.labels { + if seen.insert(label) { + labels.push(label.clone()); + } + } + + Ok(FilterMeta { + name: raw.name.clone(), + version: raw.version as u32, + description: raw.description.clone(), + labels, + }) +} + +struct ResolvedDefaults { + exec: Option, + value_action: Option, +} + +fn resolve_defaults( + scope: &RawScope, + path: &Path, + current_exec: Option, + current_value_action: Option, +) -> RecorderResult { + let exec = parse_default_exec(&scope.default_exec, path, current_exec)?; + let value_action = + parse_default_value_action(&scope.default_value_action, path, current_value_action)?; + Ok(ResolvedDefaults { exec, value_action }) +} + +fn parse_default_exec( + token: &str, + path: &Path, + current_exec: Option, +) -> RecorderResult> { + match token { + "inherit" => { + if current_exec.is_none() { + return Err(usage!( + ErrorCode::InvalidPolicyValue, + "'scope.default_exec' in '{}' cannot inherit without a previous filter", + path.display() + )); + } + Ok(None) + } + _ => ExecDirective::parse(token) + .ok_or_else(|| { + usage!( + ErrorCode::InvalidPolicyValue, + "unsupported value '{}' for 'scope.default_exec' in '{}'", + token, + path.display() + ) + }) + .map(Some), + } +} + +fn parse_default_value_action( + token: &str, + path: &Path, + current_value_action: Option, +) -> RecorderResult> { + match token { + "inherit" => { + if current_value_action.is_none() { + return Err(usage!( + ErrorCode::InvalidPolicyValue, + "'scope.default_value_action' in '{}' cannot inherit without a previous filter", + path.display() + )); + } + Ok(None) + } + _ => ValueAction::parse(token) + .ok_or_else(|| { + usage!( + ErrorCode::InvalidPolicyValue, + "unsupported value '{}' for 'scope.default_value_action' in '{}'", + token, + path.display() + ) + }) + .map(Some), + } +} + +fn parse_io(raw: Option<&RawIo>, path: &Path) -> RecorderResult> { + let Some(raw) = raw else { + return Ok(None); + }; + + let capture = raw.capture.unwrap_or(false); + let streams = match raw.streams.as_ref() { + Some(values) => { + let mut parsed = Vec::new(); + let mut seen = HashSet::new(); + for value in values { + let stream = IoStream::parse(value).ok_or_else(|| { + usage!( + ErrorCode::InvalidPolicyValue, + "unsupported IO stream '{}' in '{}'", + value, + path.display() + ) + })?; + if seen.insert(stream) { + parsed.push(stream); + } + } + parsed + } + None => Vec::new(), + }; + + if capture && streams.is_empty() { + return Err(usage!( + ErrorCode::InvalidPolicyValue, + "'io.streams' must be provided when 'io.capture = true' in '{}'", + path.display() + )); + } + if let Some(modes) = raw.modes.as_ref() { + if !modes.is_empty() { + return Err(usage!( + ErrorCode::InvalidPolicyValue, + "'io.modes' is reserved and must be empty in '{}'", + path.display() + )); + } + } + + Ok(Some(IoConfig { capture, streams })) +} + +fn parse_rules( + raw_rules: &[RawScopeRule], + path: &Path, + project_root: &Path, + source_id: usize, +) -> RecorderResult> { + let mut rules = Vec::new(); + for (idx, raw_rule) in raw_rules.iter().enumerate() { + let location = format!("{} scope.rules[{}]", path.display(), idx); + let selector = + Selector::parse(&raw_rule.selector, &SCOPE_SELECTOR_KINDS).map_err(|err| { + usage!( + ErrorCode::InvalidPolicyValue, + "invalid scope selector in {}: {}", + location, + err + ) + })?; + let selector = normalize_scope_selector(selector, project_root, &location)?; + + let exec = match raw_rule.exec.as_deref() { + None | Some("inherit") => None, + Some(value) => Some(ExecDirective::parse(value).ok_or_else(|| { + usage!( + ErrorCode::InvalidPolicyValue, + "unsupported value '{}' for 'exec' in {}", + value, + location + ) + })?), + }; + + let value_default = match raw_rule.value_default.as_deref() { + None | Some("inherit") => None, + Some(value) => Some(ValueAction::parse(value).ok_or_else(|| { + usage!( + ErrorCode::InvalidPolicyValue, + "unsupported value '{}' for 'value_default' in {}", + value, + location + ) + })?), + }; + + let mut value_patterns = Vec::new(); + if let Some(patterns) = raw_rule.value_patterns.as_ref() { + for (pidx, pattern) in patterns.iter().enumerate() { + let pattern_location = format!("{} value_patterns[{}]", location, pidx); + let selector = + Selector::parse(&pattern.selector, &VALUE_SELECTOR_KINDS).map_err(|err| { + usage!( + ErrorCode::InvalidPolicyValue, + "invalid value selector in {}: {}", + pattern_location, + err + ) + })?; + let action = ValueAction::parse(pattern.action.as_str()).ok_or_else(|| { + usage!( + ErrorCode::InvalidPolicyValue, + "unsupported value '{}' for 'action' in {}", + pattern.action, + pattern_location + ) + })?; + + value_patterns.push(ValuePattern { + selector, + action, + reason: pattern.reason.clone(), + source_id, + }); + } + } + + rules.push(ScopeRule { + selector, + exec, + value_default, + value_patterns, + reason: raw_rule.reason.clone(), + source_id, + }); + } + Ok(rules) +} + +fn normalize_scope_selector( + selector: Selector, + project_root: &Path, + location: &str, +) -> RecorderResult { + if selector.kind() != SelectorKind::File { + return Ok(selector); + } + + let normalized_pattern = normalize_file_pattern( + selector.pattern(), + selector.match_type(), + project_root, + location, + )?; + if normalized_pattern == selector.pattern() { + return Ok(selector); + } + + let raw = match selector.match_type() { + MatchType::Glob => format!("file:{}", normalized_pattern), + MatchType::Literal => format!("file:literal:{}", normalized_pattern), + MatchType::Regex => format!("file:regex:{}", normalized_pattern), + }; + Selector::parse(&raw, &SCOPE_SELECTOR_KINDS).map_err(|err| { + usage!( + ErrorCode::InvalidPolicyValue, + "failed to normalise file selector in {}: {}", + location, + err + ) + }) +} + +fn normalize_file_pattern( + pattern: &str, + match_type: MatchType, + project_root: &Path, + location: &str, +) -> RecorderResult { + match match_type { + MatchType::Literal => normalize_literal_path(pattern, project_root, location), + MatchType::Glob => normalize_glob_pattern(pattern, project_root), + MatchType::Regex => Ok(pattern.to_string()), + } +} + +fn normalize_literal_path( + pattern: &str, + project_root: &Path, + location: &str, +) -> RecorderResult { + let path = Path::new(pattern); + let relative = if path.is_absolute() { + path.strip_prefix(project_root) + .map_err(|_| { + usage!( + ErrorCode::InvalidPolicyValue, + "file selector '{}' in {} must reside within project root '{}'", + pattern, + location, + project_root.display() + ) + })? + .to_path_buf() + } else { + path.to_path_buf() + }; + + let normalized = normalize_components(&relative, pattern, location)?; + Ok(pathbuf_to_posix(&normalized)) +} + +fn normalize_components(path: &Path, raw: &str, location: &str) -> RecorderResult { + let mut normalised = PathBuf::new(); + for component in path.components() { + match component { + Component::Prefix(_) | Component::RootDir => continue, + Component::CurDir => {} + Component::ParentDir => { + if !normalised.pop() { + return Err(usage!( + ErrorCode::InvalidPolicyValue, + "file selector '{}' in {} escapes the project root", + raw, + location + )); + } + } + Component::Normal(part) => normalised.push(part), + } + } + Ok(normalised) +} + +fn normalize_glob_pattern(pattern: &str, project_root: &Path) -> RecorderResult { + let mut replaced = pattern.replace('\\', "/"); + while replaced.starts_with("./") { + replaced = replaced[2..].to_string(); + } + + let trimmed = replaced.trim_start_matches('/'); + let root = pathbuf_to_posix(project_root); + if root.is_empty() { + return Ok(trimmed.to_string()); + } + + let root_with_slash = format!("{}/", root); + if trimmed.starts_with(&root_with_slash) { + Ok(trimmed[root_with_slash.len()..].to_string()) + } else if trimmed == root { + Ok(String::new()) + } else { + Ok(trimmed.to_string()) + } +} + +fn pathbuf_to_posix(path: &Path) -> String { + let mut parts = Vec::new(); + for component in path.components() { + if let Component::Normal(part) = component { + parts.push(part.to_string_lossy()); + } + } + parts.join("/") +} + +#[derive(Debug, Deserialize)] +#[serde(deny_unknown_fields)] +struct RawFilterFile { + meta: RawMeta, + #[serde(default)] + io: Option, + scope: RawScope, +} + +#[derive(Debug, Deserialize)] +#[serde(deny_unknown_fields)] +struct RawMeta { + name: String, + version: u32, + #[serde(default)] + description: Option, + #[serde(default)] + labels: Vec, +} + +#[derive(Debug, Deserialize)] +#[serde(deny_unknown_fields)] +struct RawIo { + #[serde(default)] + capture: Option, + #[serde(default)] + streams: Option>, + #[serde(default)] + modes: Option>, +} + +#[derive(Debug, Deserialize)] +#[serde(deny_unknown_fields)] +struct RawScope { + default_exec: String, + default_value_action: String, + #[serde(default)] + rules: Option>, +} + +#[derive(Debug, Deserialize)] +#[serde(deny_unknown_fields)] +struct RawScopeRule { + selector: String, + #[serde(default)] + exec: Option, + #[serde(default)] + value_default: Option, + #[serde(default)] + reason: Option, + #[serde(default)] + value_patterns: Option>, +} + +#[derive(Debug, Deserialize)] +#[serde(deny_unknown_fields)] +struct RawValuePattern { + selector: String, + action: String, + #[serde(default)] + reason: Option, +} + +const SCOPE_SELECTOR_KINDS: [SelectorKind; 3] = [ + SelectorKind::Package, + SelectorKind::File, + SelectorKind::Object, +]; + +const VALUE_SELECTOR_KINDS: [SelectorKind; 5] = [ + SelectorKind::Local, + SelectorKind::Global, + SelectorKind::Arg, + SelectorKind::Return, + SelectorKind::Attr, +]; + +#[cfg(test)] +mod tests { + use super::*; + use std::io::Write; + use std::path::PathBuf; + use tempfile::tempdir; + + #[test] + fn composes_filters_and_resolves_inheritance() -> RecorderResult<()> { + let temp = tempdir().expect("temp dir"); + let project_root = temp.path(); + let filters_dir = project_root.join(".codetracer"); + fs::create_dir(&filters_dir).unwrap(); + fs::create_dir_all(project_root.join("app")).unwrap(); + + let base_path = filters_dir.join("base.toml"); + write_filter( + &base_path, + r#" + [meta] + name = "base" + version = 1 + + [scope] + default_exec = "trace" + default_value_action = "redact" + + [[scope.rules]] + selector = "pkg:my_app.core.*" + exec = "trace" + value_default = "allow" + + [[scope.rules.value_patterns]] + selector = "local:literal:user" + action = "allow" + + [io] + capture = false + "#, + ); + + let overrides_path = filters_dir.join("overrides.toml"); + let literal_path = project_root + .join(".codetracer") + .join("..") + .join("app") + .join("__init__.py"); + let overrides = format!( + r#" + [meta] + name = "overrides" + version = 1 + + [scope] + default_exec = "inherit" + default_value_action = "inherit" + + [[scope.rules]] + selector = "file:literal:{literal}" + exec = "inherit" + value_default = "redact" + + [[scope.rules.value_patterns]] + selector = "arg:password" + action = "redact" + + [io] + capture = true + streams = ["stdout", "stderr"] + "#, + literal = literal_path.to_string_lossy() + ); + write_filter(&overrides_path, overrides.as_str()); + + let config = TraceFilterConfig::from_paths(&[base_path.clone(), overrides_path.clone()])?; + + assert_eq!(config.default_exec(), ExecDirective::Trace); + assert_eq!(config.default_value_action(), ValueAction::Redact); + assert_eq!(config.io().capture, true); + assert_eq!( + config.io().streams, + vec![IoStream::Stdout, IoStream::Stderr] + ); + + assert_eq!(config.rules().len(), 2); + let file_rule = &config.rules()[1]; + assert!(matches!(file_rule.exec, None)); + assert_eq!(file_rule.value_default, Some(ValueAction::Redact)); + assert_eq!(file_rule.value_patterns.len(), 1); + assert_eq!(file_rule.value_patterns[0].selector.raw(), "arg:password"); + assert_eq!( + file_rule.selector.pattern(), + "app/__init__.py", + "absolute literal path normalised relative to project root" + ); + + let summary = config.summary(); + assert_eq!(summary.entries.len(), 2); + assert_eq!(summary.entries[0].name, "base"); + assert_eq!(summary.entries[1].name, "overrides"); + + Ok(()) + } + + #[test] + fn from_inline_and_paths_parses_inline_only() -> RecorderResult<()> { + let inline_filter = r#" + [meta] + name = "inline" + version = 1 + + [scope] + default_exec = "trace" + default_value_action = "allow" + "#; + + let config = TraceFilterConfig::from_inline_and_paths(&[("inline", inline_filter)], &[])?; + + assert_eq!(config.default_exec(), ExecDirective::Trace); + assert_eq!(config.default_value_action(), ValueAction::Allow); + assert_eq!(config.rules().len(), 0); + let summary = config.summary(); + assert_eq!(summary.entries.len(), 1); + assert_eq!(summary.entries[0].name, "inline"); + assert_eq!(summary.entries[0].path, PathBuf::from("")); + Ok(()) + } + + #[test] + fn rejects_unknown_keys() { + let temp = tempdir().expect("temp dir"); + let project_root = temp.path(); + let filters_dir = project_root.join(".codetracer"); + fs::create_dir(&filters_dir).unwrap(); + let path = filters_dir.join("invalid.toml"); + write_filter( + &path, + r#" + [meta] + name = "invalid" + version = 1 + extra = "nope" + + [scope] + default_exec = "trace" + default_value_action = "redact" + "#, + ); + + let err = TraceFilterConfig::from_paths(&[path]).expect_err("expected failure"); + assert_eq!(err.code, ErrorCode::InvalidPolicyValue); + } + + #[test] + fn rejects_inherit_without_base() { + let temp = tempdir().expect("temp dir"); + let project_root = temp.path(); + let filters_dir = project_root.join(".codetracer"); + fs::create_dir(&filters_dir).unwrap(); + let path = filters_dir.join("empty.toml"); + write_filter( + &path, + r#" + [meta] + name = "empty" + version = 1 + + [scope] + default_exec = "inherit" + default_value_action = "inherit" + "#, + ); + + let err = TraceFilterConfig::from_paths(&[path]).expect_err("expected failure"); + assert_eq!(err.code, ErrorCode::InvalidPolicyValue); + } + + #[test] + fn rejects_invalid_stream_value() { + let temp = tempdir().expect("temp dir"); + let project_root = temp.path(); + let filters_dir = project_root.join(".codetracer"); + fs::create_dir(&filters_dir).unwrap(); + let path = filters_dir.join("io.toml"); + write_filter( + &path, + r#" + [meta] + name = "io" + version = 1 + + [scope] + default_exec = "trace" + default_value_action = "allow" + + [io] + capture = true + streams = ["stdout", "invalid"] + "#, + ); + + let err = TraceFilterConfig::from_paths(&[path]).expect_err("expected failure"); + assert_eq!(err.code, ErrorCode::InvalidPolicyValue); + } + + fn write_filter(path: &Path, contents: &str) { + let mut file = fs::File::create(path).unwrap(); + file.write_all(contents.trim_start().as_bytes()).unwrap(); + } +} diff --git a/codetracer-python-recorder/src/trace_filter/engine.rs b/codetracer-python-recorder/src/trace_filter/engine.rs new file mode 100644 index 0000000..50666ff --- /dev/null +++ b/codetracer-python-recorder/src/trace_filter/engine.rs @@ -0,0 +1,716 @@ +//! Runtime filter engine evaluating scope selectors and value policies for code objects. +//! +//! The engine consumes a [`TraceFilterConfig`](crate::trace_filter::config::TraceFilterConfig) +//! and caches per-code-object resolutions so the hot tracing callbacks only pay a fast lookup. + +use crate::code_object::CodeObjectWrapper; +use crate::trace_filter::config::{ + ExecDirective, FilterSource, FilterSummary, ScopeRule, TraceFilterConfig, ValueAction, + ValuePattern, +}; +use crate::trace_filter::selector::{Selector, SelectorKind}; +use dashmap::DashMap; +use pyo3::{prelude::*, PyErr}; +use recorder_errors::{target, ErrorCode, RecorderResult}; +use std::borrow::Cow; +use std::path::{Component, Path, PathBuf}; +use std::sync::Arc; + +/// Final execution decision emitted by the engine. +#[derive(Debug, Clone, Copy, PartialEq, Eq)] +pub enum ExecDecision { + Trace, + Skip, +} + +impl From for ExecDecision { + fn from(value: ExecDirective) -> Self { + match value { + ExecDirective::Trace => ExecDecision::Trace, + ExecDirective::Skip => ExecDecision::Skip, + } + } +} + +/// Kind of value inspected while deciding redaction. +#[derive(Debug, Clone, Copy, PartialEq, Eq)] +pub enum ValueKind { + Local, + Global, + Arg, + Return, + Attr, +} + +impl ValueKind { + fn selector_kind(self) -> SelectorKind { + match self { + ValueKind::Local => SelectorKind::Local, + ValueKind::Global => SelectorKind::Global, + ValueKind::Arg => SelectorKind::Arg, + ValueKind::Return => SelectorKind::Return, + ValueKind::Attr => SelectorKind::Attr, + } + } + + pub fn label(self) -> &'static str { + match self { + ValueKind::Local => "local", + ValueKind::Global => "global", + ValueKind::Arg => "argument", + ValueKind::Return => "return", + ValueKind::Attr => "attribute", + } + } + + pub fn index(self) -> usize { + match self { + ValueKind::Local => 0, + ValueKind::Global => 1, + ValueKind::Arg => 2, + ValueKind::Return => 3, + ValueKind::Attr => 4, + } + } + + pub const ALL: [ValueKind; 5] = [ + ValueKind::Local, + ValueKind::Global, + ValueKind::Arg, + ValueKind::Return, + ValueKind::Attr, + ]; +} + +/// Value redaction strategy resolved for a scope. +#[derive(Debug, Clone)] +pub struct ValuePolicy { + default_action: ValueAction, + patterns: Arc<[CompiledValuePattern]>, +} + +impl ValuePolicy { + fn new(default_action: ValueAction, patterns: Arc<[CompiledValuePattern]>) -> Self { + ValuePolicy { + default_action, + patterns, + } + } + + /// Default action applied when no selector matches. + pub fn default_action(&self) -> ValueAction { + self.default_action + } + + /// Evaluate the policy for a value of `kind` with identifier `name`. + pub fn decide(&self, kind: ValueKind, name: &str) -> ValueAction { + let selector_kind = kind.selector_kind(); + for pattern in self.patterns.iter() { + if pattern.selector.kind() == selector_kind && pattern.selector.matches(name) { + return pattern.action; + } + } + self.default_action + } + + /// Expose rule metadata for debugging or telemetry. + pub fn patterns(&self) -> &[CompiledValuePattern] { + &self.patterns + } +} + +/// Resolution emitted by the engine for a given code object. +#[derive(Debug, Clone)] +pub struct ScopeResolution { + exec: ExecDecision, + value_policy: Arc, + module_name: Option, + object_name: Option, + relative_path: Option, + absolute_path: Option, + matched_rule_index: Option, + matched_rule_source: Option, + matched_rule_reason: Option, +} + +impl ScopeResolution { + /// Execution decision (trace vs skip). + pub fn exec(&self) -> ExecDecision { + self.exec + } + + /// Value redaction policy derived for this scope. + pub fn value_policy(&self) -> &ValuePolicy { + &self.value_policy + } + + /// Module name derived from the code object's filename (if any). + pub fn module_name(&self) -> Option<&str> { + self.module_name.as_deref() + } + + /// Qualified object identifier (module + qualname when available). + pub fn object_name(&self) -> Option<&str> { + self.object_name.as_deref() + } + + /// Project-relative POSIX path for the file containing the code object. + pub fn relative_path(&self) -> Option<&str> { + self.relative_path.as_deref() + } + + /// Absolute POSIX path for the file containing the code object. + pub fn absolute_path(&self) -> Option<&str> { + self.absolute_path.as_deref() + } + + /// Index within the flattened rule list that last matched this code object. + pub fn matched_rule_index(&self) -> Option { + self.matched_rule_index + } + + /// Source identifier (filter file index) of the last matched rule. + pub fn matched_rule_source(&self) -> Option { + self.matched_rule_source + } + + /// Reason string attached to the last matched rule, if present. + pub fn matched_rule_reason(&self) -> Option<&str> { + self.matched_rule_reason.as_deref() + } +} + +/// Runtime engine wrapping a compiled filter configuration. +pub struct TraceFilterEngine { + config: Arc, + default_exec: ExecDecision, + default_value_action: ValueAction, + rules: Arc<[CompiledScopeRule]>, + cache: DashMap>, +} + +impl TraceFilterEngine { + /// Construct the engine from a fully resolved configuration. + pub fn new(config: TraceFilterConfig) -> Self { + let default_exec = config.default_exec().into(); + let default_value_action = config.default_value_action(); + let rules = compile_rules(config.rules()); + + TraceFilterEngine { + config: Arc::new(config), + default_exec, + default_value_action, + rules, + cache: DashMap::new(), + } + } + + /// Resolve the scope decision for `code`, reusing cached results when available. + pub fn resolve<'py>( + &self, + py: Python<'py>, + code: &CodeObjectWrapper, + ) -> RecorderResult> { + if let Some(entry) = self.cache.get(&code.id()) { + return Ok(entry.clone()); + } + + let resolution = Arc::new(self.resolve_uncached(py, code)?); + let entry = self + .cache + .entry(code.id()) + .or_insert_with(|| resolution.clone()); + Ok(entry.clone()) + } + + fn resolve_uncached( + &self, + py: Python<'_>, + code: &CodeObjectWrapper, + ) -> RecorderResult { + let filename = code + .filename(py) + .map_err(|err| py_attr_error("co_filename", err))?; + let qualname = code + .qualname(py) + .map_err(|err| py_attr_error("co_qualname", err))?; + + let context = ScopeContext::derive(filename, qualname, self.config.sources()); + + let mut exec = self.default_exec; + let mut value_default = self.default_value_action; + let mut patterns: Arc<[CompiledValuePattern]> = Arc::from(Vec::new()); + let mut matched_rule_index = None; + let mut matched_rule_source = context.source_id; + let mut matched_rule_reason = None; + + for rule in self.rules.iter() { + if rule.matches(&context) { + if let Some(rule_exec) = rule.exec { + exec = rule_exec; + } + if let Some(rule_value) = rule.value_default { + value_default = rule_value; + } + if !rule.value_patterns.is_empty() { + patterns = rule.value_patterns.clone(); + } + matched_rule_index = Some(rule.index); + matched_rule_source = Some(rule.source_id); + matched_rule_reason = rule.reason.clone(); + } + } + + let value_policy = Arc::new(ValuePolicy::new(value_default, patterns)); + + Ok(ScopeResolution { + exec, + value_policy, + module_name: context.module_name, + object_name: context.object_name, + relative_path: context.relative_path, + absolute_path: context.absolute_path, + matched_rule_index, + matched_rule_source, + matched_rule_reason, + }) + } + + /// Return a summary of the filters that produced this engine. + pub fn summary(&self) -> FilterSummary { + self.config.summary() + } +} + +#[derive(Debug, Clone)] +struct CompiledScopeRule { + selector: Selector, + exec: Option, + value_default: Option, + value_patterns: Arc<[CompiledValuePattern]>, + reason: Option, + source_id: usize, + index: usize, +} + +impl CompiledScopeRule { + fn matches(&self, context: &ScopeContext) -> bool { + match self.selector.kind() { + SelectorKind::Package => context + .module_name + .as_deref() + .map(|module| self.selector.matches(module)) + .unwrap_or(false), + SelectorKind::File => context + .relative_path + .as_deref() + .map(|path| self.selector.matches(path)) + .or_else(|| { + context + .absolute_path + .as_deref() + .map(|path| self.selector.matches(path)) + }) + .unwrap_or(false), + SelectorKind::Object => context + .object_name + .as_deref() + .map(|object| self.selector.matches(object)) + .unwrap_or(false), + _ => false, + } + } +} + +/// A compiled value selector and its associated action. +#[derive(Debug, Clone)] +pub struct CompiledValuePattern { + pub selector: Selector, + pub action: ValueAction, + pub reason: Option, + pub source_id: usize, +} + +fn compile_rules(rules: &[ScopeRule]) -> Arc<[CompiledScopeRule]> { + let compiled: Vec = rules + .iter() + .enumerate() + .map(|(index, rule)| CompiledScopeRule { + selector: rule.selector.clone(), + exec: rule.exec.map(ExecDecision::from), + value_default: rule.value_default, + value_patterns: compile_value_patterns(&rule.value_patterns), + reason: rule.reason.clone(), + source_id: rule.source_id, + index, + }) + .collect(); + compiled.into() +} + +fn compile_value_patterns(patterns: &[ValuePattern]) -> Arc<[CompiledValuePattern]> { + let compiled: Vec = patterns + .iter() + .map(|pattern| CompiledValuePattern { + selector: pattern.selector.clone(), + action: pattern.action, + reason: pattern.reason.clone(), + source_id: pattern.source_id, + }) + .collect(); + compiled.into() +} + +#[derive(Debug)] +struct ScopeContext { + module_name: Option, + object_name: Option, + relative_path: Option, + absolute_path: Option, + source_id: Option, +} + +impl ScopeContext { + fn derive(filename: &str, qualname: &str, sources: &[FilterSource]) -> Self { + let absolute_path = normalise_to_posix(Path::new(filename)); + + let mut best_match: Option<(usize, PathBuf)> = None; + for (idx, source) in sources.iter().enumerate() { + if let Ok(stripped) = Path::new(filename).strip_prefix(&source.project_root) { + let stripped_owned = stripped.to_path_buf(); + let better = match &best_match { + Some((_, current)) => { + stripped_owned.components().count() >= current.components().count() + } + None => true, + }; + if better { + best_match = Some((idx, stripped_owned)); + } + } + } + + let (source_id, relative_path) = best_match.map_or((None, None), |(idx, rel)| { + let normalized = normalise_relative(rel); + if normalized.is_empty() { + (Some(idx), None) + } else { + (Some(idx), Some(normalized)) + } + }); + + let module_name = relative_path + .as_deref() + .and_then(|rel| module_from_relative(rel).map(|cow| cow.into_owned())); + + let object_name = module_name + .as_ref() + .map(|module| format!("{}.{}", module, qualname)) + .or_else(|| { + if qualname.is_empty() { + None + } else { + Some(qualname.to_string()) + } + }); + + ScopeContext { + module_name, + object_name, + relative_path, + absolute_path, + source_id, + } + } +} + +fn normalise_to_posix(path: &Path) -> Option { + if path.as_os_str().is_empty() { + return None; + } + let mut parts = Vec::new(); + for component in path.components() { + match component { + Component::Normal(part) => parts.push(part.to_string_lossy()), + Component::Prefix(prefix) => parts.push(prefix.as_os_str().to_string_lossy()), + Component::RootDir => parts.push(Cow::Borrowed("")), + Component::CurDir => continue, + Component::ParentDir => { + parts.push(Cow::Borrowed("..")); + } + } + } + if parts.is_empty() { + None + } else { + Some(parts.join("/")) + } +} + +fn normalise_relative(relative: PathBuf) -> String { + let mut components = Vec::new(); + for component in relative.components() { + match component { + Component::Normal(part) => components.push(part.to_string_lossy().to_string()), + Component::CurDir => continue, + Component::ParentDir => { + if !components.is_empty() { + components.pop(); + } + } + _ => {} + } + } + components.join("/") +} + +fn module_from_relative(relative: &str) -> Option> { + if relative.is_empty() { + return None; + } + let trimmed = relative.trim_start_matches("./"); + let without_suffix = trimmed.strip_suffix(".py").unwrap_or(trimmed); + if without_suffix.is_empty() { + return None; + } + let mut parts: Vec<&str> = without_suffix.split('/').collect(); + if let Some(last) = parts.last().copied() { + if last == "__init__" { + parts.pop(); + } + } + if parts.is_empty() { + return None; + } + Some(Cow::Owned(parts.join("."))) +} + +fn py_attr_error(attr: &str, err: PyErr) -> recorder_errors::RecorderError { + target!( + ErrorCode::FrameIntrospectionFailed, + "failed to read {} from code object: {}", + attr, + err + ) +} + +#[cfg(test)] +mod tests { + use super::*; + use crate::trace_filter::config::TraceFilterConfig; + use pyo3::types::{PyAny, PyCode, PyModule}; + use std::ffi::CString; + use std::fs; + use std::io::Write; + use tempfile::tempdir; + + #[test] + fn caches_resolution_and_applies_value_patterns() -> RecorderResult<()> { + let (config, file_path) = filter_with_pkg_rule( + r#" + [scope] + default_exec = "skip" + default_value_action = "redact" + + [[scope.rules]] + selector = "pkg:app.foo" + exec = "trace" + value_default = "allow" + + [[scope.rules.value_patterns]] + selector = "local:literal:user" + action = "allow" + + [[scope.rules.value_patterns]] + selector = "arg:password" + action = "redact" + + [[scope.rules.value_patterns]] + selector = "local:temp" + action = "drop" + "#, + )?; + + Python::with_gil(|py| -> RecorderResult<()> { + let module = load_module( + py, + "app.foo", + &file_path, + "def foo(user, password):\n return user\n", + )?; + let code_obj = get_code(&module, "foo")?; + let wrapper = CodeObjectWrapper::new(py, &code_obj); + + let engine = TraceFilterEngine::new(config); + + let first = engine.resolve(py, &wrapper)?; + assert_eq!(first.exec(), ExecDecision::Trace); + assert_eq!(first.matched_rule_index(), Some(0)); + assert_eq!(first.module_name(), Some("app.foo")); + assert_eq!(first.relative_path(), Some("app/foo.py")); + + let policy = first.value_policy(); + assert_eq!(policy.default_action(), ValueAction::Allow); + assert_eq!(policy.decide(ValueKind::Local, "user"), ValueAction::Allow); + assert_eq!( + policy.decide(ValueKind::Arg, "password"), + ValueAction::Redact + ); + assert_eq!(policy.decide(ValueKind::Local, "temp"), ValueAction::Drop); + assert_eq!( + policy.decide(ValueKind::Global, "anything"), + ValueAction::Allow + ); + + let second = engine.resolve(py, &wrapper)?; + assert!(Arc::ptr_eq(&first, &second)); + Ok(()) + }) + } + + #[test] + fn object_rule_overrides_package_rule() -> RecorderResult<()> { + let (config, file_path) = filter_with_pkg_rule( + r#" + [scope] + default_exec = "trace" + default_value_action = "allow" + + [[scope.rules]] + selector = "pkg:app.foo" + exec = "skip" + + [[scope.rules]] + selector = "obj:app.foo.bar" + exec = "trace" + value_default = "redact" + "#, + )?; + + Python::with_gil(|py| -> RecorderResult<()> { + let module = load_module( + py, + "app.foo", + &file_path, + "def bar():\n secret = 1\n return secret\n", + )?; + let code_obj = get_code(&module, "bar")?; + let wrapper = CodeObjectWrapper::new(py, &code_obj); + + let engine = TraceFilterEngine::new(config); + let resolution = engine.resolve(py, &wrapper)?; + + assert_eq!(resolution.exec(), ExecDecision::Trace); + assert_eq!(resolution.matched_rule_index(), Some(1)); + assert_eq!( + resolution.value_policy().default_action(), + ValueAction::Redact + ); + Ok(()) + }) + } + + #[test] + fn file_selector_matches_relative_path() -> RecorderResult<()> { + let (config, file_path) = filter_with_pkg_rule( + r#" + [scope] + default_exec = "trace" + default_value_action = "allow" + + [[scope.rules]] + selector = "file:app/foo.py" + exec = "skip" + "#, + )?; + + Python::with_gil(|py| -> RecorderResult<()> { + let module = load_module(py, "app.foo", &file_path, "def baz():\n return 42\n")?; + let code_obj = get_code(&module, "baz")?; + let wrapper = CodeObjectWrapper::new(py, &code_obj); + + let engine = TraceFilterEngine::new(config); + let resolution = engine.resolve(py, &wrapper)?; + + assert_eq!(resolution.exec(), ExecDecision::Skip); + assert_eq!(resolution.relative_path(), Some("app/foo.py")); + Ok(()) + }) + } + + fn filter_with_pkg_rule(body: &str) -> RecorderResult<(TraceFilterConfig, String)> { + let temp = tempdir().expect("temp dir"); + let project_root = temp.path(); + let codetracer_dir = project_root.join(".codetracer"); + fs::create_dir(&codetracer_dir).unwrap(); + + let filter_path = codetracer_dir.join("filters.toml"); + write_filter(&filter_path, body); + + let config = TraceFilterConfig::from_paths(&[filter_path])?; + + let file_path = project_root.join("app").join("foo.py"); + fs::create_dir_all(file_path.parent().unwrap()).unwrap(); + // Touch the file so the path exists for debugging. + fs::File::create(&file_path).unwrap(); + + Ok((config, file_path.to_string_lossy().to_string())) + } + + fn write_filter(path: &Path, body: &str) { + let mut file = fs::File::create(path).unwrap(); + writeln!( + file, + r#" + [meta] + name = "test" + version = 1 + + {} + "#, + body.trim() + ) + .unwrap(); + } + + fn load_module<'py>( + py: Python<'py>, + module_name: &str, + file_path: &str, + source: &str, + ) -> RecorderResult> { + let code_c = CString::new(source).expect("source without NUL"); + let file_c = CString::new(file_path).expect("path without NUL"); + let module_c = CString::new(module_name).expect("module without NUL"); + + let module = PyModule::from_code( + py, + code_c.as_c_str(), + file_c.as_c_str(), + module_c.as_c_str(), + ) + .map_err(|err| { + target!( + ErrorCode::FrameIntrospectionFailed, + "failed to load module for engine test: {}", + err + ) + })?; + Ok(module.into()) + } + + fn get_code<'py>( + module: &Bound<'py, PyModule>, + func_name: &str, + ) -> RecorderResult> { + let func: Bound<'py, PyAny> = module + .getattr(func_name) + .map_err(|err| py_attr_error("function", err))?; + let code_obj = func + .getattr("__code__") + .map_err(|err| py_attr_error("__code__", err))? + .downcast_into::() + .map_err(|err| py_attr_error("__code__", err.into()))?; + Ok(code_obj) + } +} diff --git a/codetracer-python-recorder/src/trace_filter/mod.rs b/codetracer-python-recorder/src/trace_filter/mod.rs new file mode 100644 index 0000000..15af3fd --- /dev/null +++ b/codetracer-python-recorder/src/trace_filter/mod.rs @@ -0,0 +1,5 @@ +//! Trace filter utilities covering selector parsing, configuration loading, and runtime evaluation. + +pub mod config; +pub mod engine; +pub mod selector; diff --git a/codetracer-python-recorder/src/trace_filter/selector.rs b/codetracer-python-recorder/src/trace_filter/selector.rs new file mode 100644 index 0000000..6a89b55 --- /dev/null +++ b/codetracer-python-recorder/src/trace_filter/selector.rs @@ -0,0 +1,378 @@ +//! Selector parsing and matching utilities shared across scope and value filters. + +use dashmap::DashSet; +use globset::{GlobBuilder, GlobMatcher}; +use once_cell::sync::Lazy; +use recorder_errors::{usage, ErrorCode, RecorderResult}; +use regex::{Error as RegexError, Regex}; +use std::borrow::Cow; +use std::fmt; + +/// Domains supported by the selector grammar. +#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)] +pub enum SelectorKind { + Package, + File, + Object, + Local, + Global, + Arg, + Return, + Attr, +} + +impl SelectorKind { + /// Return the token used in selector strings. + pub fn token(self) -> &'static str { + match self { + SelectorKind::Package => "pkg", + SelectorKind::File => "file", + SelectorKind::Object => "obj", + SelectorKind::Local => "local", + SelectorKind::Global => "global", + SelectorKind::Arg => "arg", + SelectorKind::Return => "ret", + SelectorKind::Attr => "attr", + } + } + + fn parse(token: &str) -> Option { + match token { + "pkg" => Some(SelectorKind::Package), + "file" => Some(SelectorKind::File), + "obj" => Some(SelectorKind::Object), + "local" => Some(SelectorKind::Local), + "global" => Some(SelectorKind::Global), + "arg" => Some(SelectorKind::Arg), + "ret" => Some(SelectorKind::Return), + "attr" => Some(SelectorKind::Attr), + _ => None, + } + } + + /// Return true when the selector kind targets scope-level decisions. + pub fn is_scope_kind(self) -> bool { + matches!( + self, + SelectorKind::Package | SelectorKind::File | SelectorKind::Object + ) + } + + /// Return true when the selector kind targets value-level decisions. + pub fn is_value_kind(self) -> bool { + !self.is_scope_kind() + } +} + +impl fmt::Display for SelectorKind { + fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { + f.write_str(self.token()) + } +} + +/// Match strategy configured for a selector. +#[derive(Debug, Clone, Copy, PartialEq, Eq)] +pub enum MatchType { + Glob, + Regex, + Literal, +} + +impl MatchType { + fn parse(token: &str) -> Option { + match token { + "glob" => Some(MatchType::Glob), + "regex" => Some(MatchType::Regex), + "literal" => Some(MatchType::Literal), + _ => None, + } + } +} + +#[derive(Debug, Clone)] +enum Matcher { + Glob(GlobMatcher), + Regex(Regex), + Literal(String), +} + +impl Matcher { + fn matches(&self, candidate: &str) -> bool { + match self { + Matcher::Glob(matcher) => matcher.is_match(candidate), + Matcher::Regex(regex) => regex.is_match(candidate), + Matcher::Literal(expected) => candidate == expected, + } + } +} + +/// Parsed selector with compiled matcher. +#[derive(Debug, Clone)] +pub struct Selector { + raw: String, + kind: SelectorKind, + match_type: MatchType, + pattern: String, + matcher: Matcher, +} + +impl Selector { + /// Parse a selector string constrained to the provided kinds. + /// + /// When `permitted_kinds` is empty, all selector kinds are accepted. + pub fn parse(raw: &str, permitted_kinds: &[SelectorKind]) -> RecorderResult { + if raw.is_empty() { + return Err(usage!( + ErrorCode::InvalidPolicyValue, + "selector string is empty" + )); + } + + let mut segments = raw.splitn(3, ':'); + let kind_token = segments.next().ok_or_else(|| { + usage!( + ErrorCode::InvalidPolicyValue, + "selector must include a kind" + ) + })?; + let remainder = segments + .next() + .ok_or_else(|| usage!(ErrorCode::InvalidPolicyValue, "selector missing pattern"))?; + + let kind = SelectorKind::parse(kind_token).ok_or_else(|| { + usage!( + ErrorCode::InvalidPolicyValue, + "unsupported selector kind '{}'", + kind_token + ) + })?; + + if !permitted_kinds.is_empty() && !permitted_kinds.contains(&kind) { + return Err(usage!( + ErrorCode::InvalidPolicyValue, + "selector kind '{}' is not allowed in this context", + kind + )); + } + + let (match_type, pattern) = match segments.next() { + Some(pattern) => { + let match_token = remainder; + if match_token.is_empty() { + return Err(usage!( + ErrorCode::InvalidPolicyValue, + "selector match type cannot be empty" + )); + } + let resolved_match = MatchType::parse(match_token).ok_or_else(|| { + usage!( + ErrorCode::InvalidPolicyValue, + "unsupported selector match type '{}'", + match_token + ) + })?; + (resolved_match, pattern) + } + None => (MatchType::Glob, remainder), + }; + + if pattern.is_empty() { + return Err(usage!( + ErrorCode::InvalidPolicyValue, + "selector pattern cannot be empty" + )); + } + + let matcher = build_matcher(match_type, pattern)?; + Ok(Selector { + raw: raw.to_string(), + kind, + match_type, + pattern: pattern.to_string(), + matcher, + }) + } + + /// Selector kind. + pub fn kind(&self) -> SelectorKind { + self.kind + } + + /// Match strategy. + pub fn match_type(&self) -> MatchType { + self.match_type + } + + /// Raw pattern string (without kind/match prefix). + pub fn pattern(&self) -> &str { + &self.pattern + } + + /// Original selector string. + pub fn raw(&self) -> &str { + &self.raw + } + + /// Evaluate whether the selector matches `candidate`. + pub fn matches(&self, candidate: &str) -> bool { + self.matcher.matches(candidate) + } +} + +fn build_matcher(match_type: MatchType, pattern: &str) -> RecorderResult { + match match_type { + MatchType::Literal => Ok(Matcher::Literal(pattern.to_string())), + MatchType::Glob => { + let glob = GlobBuilder::new(pattern) + .literal_separator(true) + .build() + .map_err(|err| { + usage!( + ErrorCode::InvalidPolicyValue, + "invalid glob pattern '{}': {}", + pattern, + err + ) + })?; + Ok(Matcher::Glob(glob.compile_matcher())) + } + MatchType::Regex => match Regex::new(pattern) { + Ok(regex) => Ok(Matcher::Regex(regex)), + Err(err) => { + log_regex_failure(pattern, &err); + Err(usage!( + ErrorCode::InvalidPolicyValue, + "invalid regex pattern '{}': {}", + pattern, + err + )) + } + }, + } +} + +fn log_regex_failure(pattern: &str, err: &RegexError) { + static LOGGED: Lazy> = Lazy::new(DashSet::new); + if !LOGGED.insert(pattern.to_string()) { + return; + } + + let display_pattern = sanitize_pattern(pattern); + crate::logging::with_error_code(ErrorCode::InvalidPolicyValue, || { + log::warn!( + target: "codetracer_python_recorder::trace_filters", + "Rejected trace filter regex pattern '{}': {err}. Update the expression or switch to `match = \"glob\"` if a simple wildcard suffices.", + display_pattern + ); + }); +} + +fn sanitize_pattern(pattern: &str) -> Cow<'_, str> { + const MAX_CHARS: usize = 120; + if pattern.chars().count() > MAX_CHARS { + let mut truncated: String = pattern.chars().take(MAX_CHARS).collect(); + truncated.push('…'); + Cow::Owned(truncated) + } else { + Cow::Borrowed(pattern) + } +} + +#[cfg(test)] +mod tests { + use super::*; + + fn assert_parse( + raw: &str, + expected_kind: SelectorKind, + mt: MatchType, + pattern: &str, + ) -> Selector { + let selector = Selector::parse(raw, &[]).unwrap_or_else(|err| { + panic!("selector parse failed for '{}': {}", raw, err); + }); + assert_eq!(selector.kind(), expected_kind); + assert_eq!(selector.match_type(), mt); + assert_eq!(selector.pattern(), pattern); + selector + } + + #[test] + fn parses_default_glob_scope_selector() { + let selector = assert_parse( + "pkg:my_app.core.*", + SelectorKind::Package, + MatchType::Glob, + "my_app.core.*", + ); + assert!(selector.matches("my_app.core.services")); + assert!(!selector.matches("other.module")); + } + + #[test] + fn parses_literal_selector() { + let selector = assert_parse( + "file:literal:src/services/api.py", + SelectorKind::File, + MatchType::Literal, + "src/services/api.py", + ); + assert!(selector.matches("src/services/api.py")); + assert!(!selector.matches("src/services/API.py")); + } + + #[test] + fn parses_regex_selector_with_colons() { + let selector = assert_parse( + "obj:regex:^my_app::service::[A-Z]\\w+$", + SelectorKind::Object, + MatchType::Regex, + "^my_app::service::[A-Z]\\w+$", + ); + assert!(selector.matches("my_app::service::Handler")); + assert!(!selector.matches("my_app.service.Handler")); + } + + #[test] + fn rejects_unknown_kind() { + let err = Selector::parse("unknown:foo", &[]).expect_err("expected kind error"); + assert_eq!(err.code, ErrorCode::InvalidPolicyValue); + } + + #[test] + fn rejects_disallowed_kind() { + let err = + Selector::parse("pkg:foo", &[SelectorKind::Local]).expect_err("kind not permitted"); + assert_eq!(err.code, ErrorCode::InvalidPolicyValue); + } + + #[test] + fn rejects_unknown_match_type() { + let err = Selector::parse("pkg:invalid:foo", &[]).expect_err("expected match type error"); + assert_eq!(err.code, ErrorCode::InvalidPolicyValue); + } + + #[test] + fn rejects_empty_pattern() { + let err = Selector::parse("pkg:", &[]).expect_err("expected empty pattern error"); + assert_eq!(err.code, ErrorCode::InvalidPolicyValue); + } + + #[test] + fn rejects_empty_string() { + let err = Selector::parse("", &[]).expect_err("expected empty string error"); + assert_eq!(err.code, ErrorCode::InvalidPolicyValue); + } + + #[test] + fn matches_glob_against_values() { + let selector = assert_parse( + "local:user_*", + SelectorKind::Local, + MatchType::Glob, + "user_*", + ); + assert!(selector.matches("user_id")); + assert!(!selector.matches("order_id")); + } +} diff --git a/codetracer-python-recorder/tests/python/perf/__init__.py b/codetracer-python-recorder/tests/python/perf/__init__.py new file mode 100644 index 0000000..8bbc4ee --- /dev/null +++ b/codetracer-python-recorder/tests/python/perf/__init__.py @@ -0,0 +1 @@ +"""Performance-oriented smoke tests for the Python surface.""" diff --git a/codetracer-python-recorder/tests/python/perf/test_trace_filter_perf.py b/codetracer-python-recorder/tests/python/perf/test_trace_filter_perf.py new file mode 100644 index 0000000..a81fc2e --- /dev/null +++ b/codetracer-python-recorder/tests/python/perf/test_trace_filter_perf.py @@ -0,0 +1,515 @@ +from __future__ import annotations + +import importlib +import json +import math +import os +import sys +import time +from dataclasses import dataclass +from pathlib import Path +from typing import Callable, Iterable, Sequence +import textwrap + +import pytest + +from codetracer_python_recorder import trace + +CALLS_PER_BATCH = 1_000 +LOCALS_PER_CALL = 50 +FUNCTIONS_PER_MODULE = 10 +SERVICES_MODULES = 6 +WORKER_MODULES = 3 +EXTERNAL_MODULES = 1 +UNIQUE_CODE_OBJECTS = ( + SERVICES_MODULES + WORKER_MODULES + EXTERNAL_MODULES +) * FUNCTIONS_PER_MODULE + +MAX_RUNTIME_RATIO = { + "glob": 60.0, + "regex": 30.0, +} + +_SKIP_REASON = ( + "trace filter perf smoke disabled; set CODETRACER_TRACE_FILTER_PERF=1 to enable" +) +pytestmark = pytest.mark.skipif( + os.environ.get("CODETRACER_TRACE_FILTER_PERF") != "1", reason=_SKIP_REASON +) + + +@dataclass(frozen=True) +class ModuleSpec: + relative_path: str + module_name: str + func_prefix: str + functions: int + + +@dataclass(frozen=True) +class PerfScenario: + label: str + filter_path: Path + + +@dataclass(frozen=True) +class PerfResult: + label: str + duration_seconds: float + filter_names: list[str] + scopes_skipped: int + value_redactions: dict[str, int] + value_drops: dict[str, int] + + def to_dict(self) -> dict[str, object]: + payload: dict[str, object] = { + "label": self.label, + "duration_seconds": self.duration_seconds, + "filter_names": list(self.filter_names), + "scopes_skipped": self.scopes_skipped, + "value_redactions": dict(self.value_redactions), + "value_drops": dict(self.value_drops), + } + return payload + + +@dataclass +class PerfDataset: + functions: list[Callable[[int], int]] + event_indices: list[int] + imported_modules: set[str] + imported_packages: set[str] + + +class PerfWorkspace: + def __init__(self, root: Path) -> None: + self.root = root + self.project_root = root / "project" + self.project_root.mkdir(parents=True, exist_ok=True) + self.filters_dir = self.project_root / ".codetracer" + self.filters_dir.mkdir(parents=True, exist_ok=True) + + self._filters = FilterFiles.create(self.filters_dir) + self.dataset = self._build_dataset() + self.scenarios = self._build_scenarios() + + def cleanup(self) -> None: + for name in sorted( + self.dataset.imported_modules | self.dataset.imported_packages, + key=len, + reverse=True, + ): + sys.modules.pop(name, None) + + def _build_scenarios(self) -> list[PerfScenario]: + return [ + PerfScenario("baseline", self._filters.baseline), + PerfScenario("glob", self._filters.glob), + PerfScenario("regex", self._filters.regex), + ] + + def _build_dataset(self) -> PerfDataset: + local_names = build_local_names() + specs = build_module_specs() + functions: list[Callable[[int], int]] = [] + for spec in specs: + relative = Path(spec.relative_path) + self._ensure_package_inits(relative) + module_path = self.project_root / relative + module_path.parent.mkdir(parents=True, exist_ok=True) + module_path.write_text( + module_source(spec.func_prefix, spec.functions, local_names), + encoding="utf-8", + ) + + sys.path.insert(0, str(self.project_root)) + imported_modules: set[str] = set() + imported_packages: set[str] = set() + try: + for spec in specs: + module = importlib.import_module(spec.module_name) + imported_modules.update(_module_lineage(spec.module_name)) + for idx in range(spec.functions): + func_name = f"{spec.func_prefix}_{idx}" + func = getattr(module, func_name) + functions.append(func) + finally: + sys.path.pop(0) + + if len(functions) != UNIQUE_CODE_OBJECTS: + raise AssertionError( + f"expected {UNIQUE_CODE_OBJECTS} code objects, found {len(functions)}" + ) + + # Collect package lineage for cleanup (parents only; module entries already captured). + for name in imported_modules: + parts = name.split(".") + imported_packages.update(".".join(parts[:idx]) for idx in range(1, len(parts))) + + event_indices = [i % len(functions) for i in range(CALLS_PER_BATCH)] + return PerfDataset( + functions=functions, + event_indices=event_indices, + imported_modules=imported_modules, + imported_packages=imported_packages, + ) + + def _ensure_package_inits(self, relative_path: Path) -> None: + current = self.project_root + parts = relative_path.parts[:-1] + for part in parts: + current = current / part + current.mkdir(parents=True, exist_ok=True) + init_file = current / "__init__.py" + if not init_file.exists(): + init_file.write_text("", encoding="utf-8") + + +@dataclass +class FilterFiles: + baseline: Path + glob: Path + regex: Path + + @classmethod + def create(cls, root: Path) -> FilterFiles: + baseline = root / "bench-baseline.toml" + glob = root / "bench-glob.toml" + regex = root / "bench-regex.toml" + + baseline.write_text(baseline_config(), encoding="utf-8") + glob.write_text(glob_config(), encoding="utf-8") + regex.write_text(regex_config(), encoding="utf-8") + + return cls(baseline=baseline, glob=glob, regex=regex) + + +def test_trace_filter_perf_smoke(tmp_path: Path) -> None: + workspace = PerfWorkspace(tmp_path) + results: list[PerfResult] = [] + try: + for scenario in workspace.scenarios: + results.append(run_scenario(workspace, scenario)) + + baseline = _result_by_label(results, "baseline") + glob = _result_by_label(results, "glob") + regex = _result_by_label(results, "regex") + + assert baseline.duration_seconds > 0 + assert glob.duration_seconds > 0 + assert regex.duration_seconds > 0 + + assert baseline.filter_names == ["bench-baseline"] + assert "bench-glob" in glob.filter_names + assert "bench-regex" in regex.filter_names + + assert glob.scopes_skipped > 0 + + assert baseline.value_redactions.get("local", 0) == 0 + assert glob.value_redactions.get("local", 0) > 0 + assert regex.value_redactions.get("local", 0) > 0 + + baseline_time = baseline.duration_seconds + assert baseline_time > 0 and math.isfinite(baseline_time) + + for label, limit in MAX_RUNTIME_RATIO.items(): + candidate = _result_by_label(results, label) + ceiling = baseline_time * limit + 0.5 + assert candidate.duration_seconds <= ceiling, ( + f"{label} scenario exceeded runtime ceiling " + f"{candidate.duration_seconds:.4f}s > {ceiling:.4f}s " + f"(baseline {baseline_time:.4f}s, limit {limit}x)" + ) + finally: + workspace.cleanup() + _maybe_write_results(results) + + +def run_scenario(workspace: PerfWorkspace, scenario: PerfScenario) -> PerfResult: + dataset = workspace.dataset + trace_dir = workspace.root / f"trace-{scenario.label}" + with trace( + trace_dir, + format="json", + trace_filter=str(scenario.filter_path), + ): + prewarm_dataset(dataset) + start = time.perf_counter() + run_workload(dataset) + duration = time.perf_counter() - start + + metadata = _load_metadata(trace_dir) + filter_meta = metadata.get("trace_filter", {}) if metadata else {} + filters = filter_meta.get("filters") or [] + filter_names = [ + entry.get("name") # type: ignore[union-attr] + for entry in filters + if isinstance(entry, dict) and entry.get("name") + ] + stats = filter_meta.get("stats") or {} + scopes_skipped = int(stats.get("scopes_skipped") or 0) + value_redactions_obj = stats.get("value_redactions") or {} + value_redactions = { + key: int(value) + for key, value in value_redactions_obj.items() + if isinstance(key, str) + } + value_drops_obj = stats.get("value_drops") or {} + value_drops = { + key: int(value) + for key, value in value_drops_obj.items() + if isinstance(key, str) + } + + return PerfResult( + label=scenario.label, + duration_seconds=duration, + filter_names=filter_names, + scopes_skipped=scopes_skipped, + value_redactions=value_redactions, + value_drops=value_drops, + ) + + +def prewarm_dataset(dataset: PerfDataset) -> None: + for func in dataset.functions: + func(0) + + +def run_workload(dataset: PerfDataset) -> None: + functions = dataset.functions + for index in dataset.event_indices: + functions[index](index) + + +def _load_metadata(trace_dir: Path) -> dict[str, object]: + metadata_path = trace_dir / "trace_metadata.json" + if not metadata_path.exists(): + raise AssertionError(f"trace metadata not generated for {trace_dir}") + return json.loads(metadata_path.read_text(encoding="utf-8")) + + +def _result_by_label(results: Sequence[PerfResult], label: str) -> PerfResult: + for entry in results: + if entry.label == label: + return entry + raise AssertionError(f"missing result for {label!r}") + + +def _maybe_write_results(results: Sequence[PerfResult]) -> None: + destination = os.environ.get("CODETRACER_TRACE_FILTER_PERF_OUTPUT") + if not destination or not results: + return + + output_path = Path(destination) + output_path.parent.mkdir(parents=True, exist_ok=True) + + baseline = next((r for r in results if r.label == "baseline"), None) + baseline_time = baseline.duration_seconds if baseline else None + + payload = { + "calls_per_batch": CALLS_PER_BATCH, + "locals_per_call": LOCALS_PER_CALL, + "results": [ + { + **result.to_dict(), + "relative_to_baseline": ( + result.duration_seconds / baseline_time + if baseline_time and baseline_time > 0 + else None + ), + } + for result in results + ], + } + output_path.write_text(json.dumps(payload, indent=2), encoding="utf-8") + + +def build_module_specs() -> list[ModuleSpec]: + specs: list[ModuleSpec] = [] + for idx in range(SERVICES_MODULES): + specs.append( + ModuleSpec( + relative_path=f"bench_pkg/services/api/module_{idx}.py", + module_name=f"bench_pkg.services.api.module_{idx}", + func_prefix=f"api_handler_{idx}", + functions=FUNCTIONS_PER_MODULE, + ) + ) + for idx in range(WORKER_MODULES): + specs.append( + ModuleSpec( + relative_path=f"bench_pkg/jobs/worker/module_{idx}.py", + module_name=f"bench_pkg.jobs.worker.module_{idx}", + func_prefix=f"worker_task_{idx}", + functions=FUNCTIONS_PER_MODULE, + ) + ) + for idx in range(EXTERNAL_MODULES): + specs.append( + ModuleSpec( + relative_path=f"bench_pkg/external/integration_{idx}.py", + module_name=f"bench_pkg.external.integration_{idx}", + func_prefix=f"integration_op_{idx}", + functions=FUNCTIONS_PER_MODULE, + ) + ) + return specs + + +def module_source(func_prefix: str, function_count: int, local_names: Sequence[str]) -> str: + lines: list[str] = [] + for index in range(function_count): + func_name = f"{func_prefix}_{index}" + lines.append(f"def {func_name}(value):") + for offset, name in enumerate(local_names): + lines.append(f" {name} = value + {offset}") + lines.append(" return value") + lines.append("") + return "\n".join(lines) + + +def build_local_names() -> list[str]: + names: list[str] = [] + for idx in range(15): + names.append(f"public_field_{idx}") + for idx in range(15): + names.append(f"secret_field_{idx}") + for idx in range(10): + names.append(f"token_{idx}") + names.extend( + [ + "password_hash", + "api_key", + "credit_card", + "session_id", + "metric_latency", + "metric_throughput", + "metric_error_rate", + "masked_value", + "debug_flag", + "trace_id", + ] + ) + if len(names) != LOCALS_PER_CALL: + raise AssertionError( + f"expected {LOCALS_PER_CALL} local names, found {len(names)}" + ) + return names + + +def baseline_config() -> str: + return textwrap.dedent( + """ + [meta] + name = "bench-baseline" + version = 1 + description = "Tracing baseline without additional filter overhead." + + [scope] + default_exec = "trace" + default_value_action = "allow" + """ + ).strip() + + +def glob_config() -> str: + return textwrap.dedent( + """ + [meta] + name = "bench-glob" + version = 1 + description = "Glob-heavy rule set for microbenchmark coverage." + + [scope] + default_exec = "trace" + default_value_action = "allow" + + [[scope.rules]] + selector = "pkg:bench_pkg.services.api.*" + value_default = "redact" + reason = "Redact service locals except approved public fields" + [[scope.rules.value_patterns]] + selector = "local:glob:public_*" + action = "allow" + [[scope.rules.value_patterns]] + selector = "local:glob:metric_*" + action = "allow" + [[scope.rules.value_patterns]] + selector = "local:glob:secret_*" + action = "redact" + [[scope.rules.value_patterns]] + selector = "local:glob:token_*" + action = "redact" + [[scope.rules.value_patterns]] + selector = "local:glob:masked_*" + action = "allow" + [[scope.rules.value_patterns]] + selector = "local:glob:password_*" + action = "redact" + + [[scope.rules]] + selector = "file:glob:bench_pkg/jobs/worker/module_*.py" + exec = "skip" + reason = "Disable redundant worker instrumentation" + + [[scope.rules]] + selector = "pkg:bench_pkg.external.integration_*" + value_default = "redact" + [[scope.rules.value_patterns]] + selector = "local:glob:metric_*" + action = "allow" + [[scope.rules.value_patterns]] + selector = "local:glob:public_*" + action = "allow" + """ + ).strip() + + +def regex_config() -> str: + return textwrap.dedent( + """ + [meta] + name = "bench-regex" + version = 1 + description = "Regex-heavy rule set for microbenchmark coverage." + + [scope] + default_exec = "trace" + default_value_action = "allow" + + [[scope.rules]] + selector = 'pkg:regex:^bench_pkg\\.services\\.api\\.module_\\d+$' + value_default = "redact" + reason = "Regex match on service modules" + [[scope.rules.value_patterns]] + selector = 'local:regex:^(public|metric)_\\w+$' + action = "allow" + [[scope.rules.value_patterns]] + selector = 'local:regex:^(secret|token)_\\w+$' + action = "redact" + [[scope.rules.value_patterns]] + selector = 'local:regex:^(password|api|credit|session)_.*$' + action = "redact" + + [[scope.rules]] + selector = 'file:regex:^bench_pkg/jobs/worker/module_\\d+\\.py$' + exec = "skip" + reason = "Regex skip for worker modules" + + [[scope.rules]] + selector = 'obj:regex:^bench_pkg\\.external\\.integration_\\d+\\.integration_op_\\d+$' + value_default = "redact" + [[scope.rules.value_patterns]] + selector = 'local:regex:^masked_.*$' + action = "allow" + [[scope.rules.value_patterns]] + selector = 'local:regex:^metric_.*$' + action = "allow" + """ + ).strip() + + +def _module_lineage(name: str) -> Iterable[str]: + parts = name.split(".") + return (".".join(parts[:idx]) for idx in range(1, len(parts) + 1)) diff --git a/codetracer-python-recorder/tests/python/test_cli_integration.py b/codetracer-python-recorder/tests/python/test_cli_integration.py index d0fc624..8e520a8 100644 --- a/codetracer-python-recorder/tests/python/test_cli_integration.py +++ b/codetracer-python-recorder/tests/python/test_cli_integration.py @@ -77,3 +77,108 @@ def test_cli_emits_trace_artifacts(tmp_path: Path) -> None: recorder_info = payload.get("recorder", {}) assert recorder_info.get("name") == "codetracer_python_recorder" assert recorder_info.get("target_script") == str(script.resolve()) + + +def test_cli_honours_trace_filter_chain(tmp_path: Path) -> None: + script = tmp_path / "program.py" + _write_script(script, "print('filter test')\n") + + filters_dir = tmp_path / ".codetracer" + filters_dir.mkdir() + default_filter = filters_dir / "trace-filter.toml" + default_filter.write_text( + """ + [meta] + name = "default" + version = 1 + + [scope] + default_exec = "trace" + default_value_action = "allow" + """, + encoding="utf-8", + ) + + override_filter = tmp_path / "override-filter.toml" + override_filter.write_text( + """ + [meta] + name = "override" + version = 1 + + [scope] + default_exec = "trace" + default_value_action = "allow" + + [[scope.rules]] + selector = "pkg:program" + exec = "skip" + value_default = "allow" + """, + encoding="utf-8", + ) + + trace_dir = tmp_path / "trace" + env = _prepare_env() + args = [ + "--trace-dir", + str(trace_dir), + "--trace-filter", + str(override_filter), + str(script), + ] + + result = _run_cli(args, cwd=tmp_path, env=env) + assert result.returncode == 0 + + metadata_file = trace_dir / "trace_metadata.json" + payload = json.loads(metadata_file.read_text(encoding="utf-8")) + trace_filter = payload.get("trace_filter", {}) + filters = trace_filter.get("filters", []) + paths = [entry.get("path") for entry in filters if isinstance(entry, dict)] + assert paths == [ + "", + str(default_filter.resolve()), + str(override_filter.resolve()), + ] + + +def test_cli_honours_env_trace_filter(tmp_path: Path) -> None: + script = tmp_path / "program.py" + _write_script(script, "print('env filter test')\n") + + filter_path = tmp_path / "env-filter.toml" + filter_path.write_text( + """ + [meta] + name = "env-filter" + version = 1 + + [scope] + default_exec = "trace" + default_value_action = "allow" + + [[scope.rules]] + selector = "pkg:program" + exec = "skip" + value_default = "allow" + """, + encoding="utf-8", + ) + + trace_dir = tmp_path / "trace" + env = _prepare_env() + env["CODETRACER_TRACE_FILTER"] = str(filter_path) + + result = _run_cli(["--trace-dir", str(trace_dir), str(script)], cwd=tmp_path, env=env) + assert result.returncode == 0 + + metadata_file = trace_dir / "trace_metadata.json" + payload = json.loads(metadata_file.read_text(encoding="utf-8")) + trace_filter = payload.get("trace_filter", {}) + filters = trace_filter.get("filters", []) + paths = [entry.get("path") for entry in filters if isinstance(entry, dict)] + assert paths == [ + "", + str(filter_path.resolve()), + ] diff --git a/codetracer-python-recorder/tests/python/unit/test_auto_start.py b/codetracer-python-recorder/tests/python/unit/test_auto_start.py new file mode 100644 index 0000000..365c62a --- /dev/null +++ b/codetracer-python-recorder/tests/python/unit/test_auto_start.py @@ -0,0 +1,66 @@ +"""Unit tests for environment-driven auto-start behaviour.""" +from __future__ import annotations + +from pathlib import Path + +import pytest + +from codetracer_python_recorder import auto_start, session +import codetracer_python_recorder.codetracer_python_recorder as backend + + +@pytest.fixture(autouse=True) +def reset_session_state() -> None: + """Ensure each test runs with a clean global session handle.""" + session._active_session = None + yield + session._active_session = None + + +def test_auto_start_resolves_filter_chain(tmp_path: Path, monkeypatch: pytest.MonkeyPatch) -> None: + trace_dir = tmp_path / "trace-output" + filter_dir = tmp_path / "filters" + filter_dir.mkdir() + default_filter = filter_dir / "default.toml" + default_filter.write_text("# default\n", encoding="utf-8") + override_filter = filter_dir / "override.toml" + override_filter.write_text("# override\n", encoding="utf-8") + + state: dict[str, bool] = {"active": False} + captured_filters: list[list[str] | None] = [] + + def fake_start_backend( + path: str, + fmt: str, + activation: str | None, + filters: list[str] | None, + ) -> None: + state["active"] = True + captured_filters.append(filters) + + def fake_stop_backend() -> None: + state["active"] = False + + monkeypatch.setenv(auto_start.ENV_TRACE_PATH, str(trace_dir)) + monkeypatch.setenv( + auto_start.ENV_TRACE_FILTER, f"{default_filter}::{override_filter}" + ) + + monkeypatch.setattr(session, "_start_backend", fake_start_backend) + monkeypatch.setattr(session, "_stop_backend", fake_stop_backend) + monkeypatch.setattr(session, "_flush_backend", lambda: None) + monkeypatch.setattr(session, "_is_tracing_backend", lambda: bool(state["active"])) + monkeypatch.setattr(session, "_configure_policy_from_env", lambda: None) + monkeypatch.setattr(backend, "configure_policy_from_env", lambda: None) + + auto_start.auto_start_from_env() + + assert len(captured_filters) == 1 + assert captured_filters[0] == [ + str(default_filter.resolve()), + str(override_filter.resolve()), + ] + assert session._active_session is not None + + session.stop() + assert state["active"] is False diff --git a/codetracer-python-recorder/tests/python/unit/test_backend_exceptions.py b/codetracer-python-recorder/tests/python/unit/test_backend_exceptions.py index 5014b3b..956512f 100644 --- a/codetracer-python-recorder/tests/python/unit/test_backend_exceptions.py +++ b/codetracer-python-recorder/tests/python/unit/test_backend_exceptions.py @@ -20,9 +20,9 @@ def stop_after() -> None: def test_start_tracing_raises_usage_error(tmp_path) -> None: - start_tracing(str(tmp_path), "json", None) + start_tracing(str(tmp_path), "json", None, None) with pytest.raises(UsageError) as excinfo: - start_tracing(str(tmp_path), "json", None) + start_tracing(str(tmp_path), "json", None, None) err = excinfo.value assert getattr(err, "code") == "ERR_ALREADY_TRACING" assert "tracing already active" in str(err) diff --git a/codetracer-python-recorder/tests/python/unit/test_cli.py b/codetracer-python-recorder/tests/python/unit/test_cli.py index d125d54..333dc57 100644 --- a/codetracer-python-recorder/tests/python/unit/test_cli.py +++ b/codetracer-python-recorder/tests/python/unit/test_cli.py @@ -25,6 +25,7 @@ def test_parse_args_uses_defaults(tmp_path: Path, monkeypatch: pytest.MonkeyPatc assert config.trace_dir == (tmp_path / "trace-out").resolve() assert config.format == formats.DEFAULT_FORMAT assert config.activation_path == script.resolve() + assert config.trace_filter == () assert config.policy_overrides == {} @@ -65,6 +66,7 @@ def test_parse_args_handles_activation_and_script_args(tmp_path: Path) -> None: assert config.activation_path == activation.resolve() assert config.script_args == ["--flag", "value"] + assert config.trace_filter == () assert config.policy_overrides == {} @@ -116,6 +118,28 @@ def test_parse_args_controls_io_capture(tmp_path: Path) -> None: } +def test_parse_args_collects_trace_filters(tmp_path: Path) -> None: + script = tmp_path / "app.py" + _write_script(script) + filter_a = tmp_path / "filters" / "default.toml" + filter_a.parent.mkdir(parents=True, exist_ok=True) + filter_a.write_text("# stub\n", encoding="utf-8") + filter_b = tmp_path / "filters" / "override.toml" + filter_b.write_text("# stub\n", encoding="utf-8") + + config = _parse_args( + [ + "--trace-filter", + str(filter_a), + "--trace-filter", + f"{filter_b}::{filter_a}", + str(script), + ] + ) + + assert config.trace_filter == (str(filter_a), f"{filter_b}::{filter_a}") + + def test_parse_args_enables_io_capture_fd_mirroring(tmp_path: Path) -> None: script = tmp_path / "entry.py" _write_script(script) diff --git a/codetracer-python-recorder/tests/python/unit/test_session_helpers.py b/codetracer-python-recorder/tests/python/unit/test_session_helpers.py index 266a146..62cab9a 100644 --- a/codetracer-python-recorder/tests/python/unit/test_session_helpers.py +++ b/codetracer-python-recorder/tests/python/unit/test_session_helpers.py @@ -98,9 +98,14 @@ def test_trace_context_manager_starts_and_stops(monkeypatch: pytest.MonkeyPatch, trace_state = {"active": False} - def fake_start(path: str, fmt: str, activation: str | None) -> None: + def fake_start( + path: str, + fmt: str, + activation: str | None, + filters: list[str] | None, + ) -> None: trace_state["active"] = True - calls["start"].append((Path(path), fmt, activation)) + calls["start"].append((Path(path), fmt, activation, filters)) def fake_stop() -> None: trace_state["active"] = False @@ -118,5 +123,35 @@ def fake_stop() -> None: assert handle.path == target.expanduser() assert handle.format == "binary" - assert calls["start"] == [(target, "binary", None)] + assert calls["start"] == [(target, "binary", None, None)] assert calls["stop"] == [True] + + +def test_normalize_trace_filter_handles_none() -> None: + assert session._normalize_trace_filter(None) is None + + +def test_normalize_trace_filter_expands_sequence(tmp_path: Path) -> None: + filters_dir = tmp_path / "filters" + filters_dir.mkdir() + default = filters_dir / "default.toml" + default.write_text("# default\n", encoding="utf-8") + overrides = filters_dir / "overrides.toml" + overrides.write_text("# overrides\n", encoding="utf-8") + + result = session._normalize_trace_filter( + [default, f"{overrides}::{default}", overrides] + ) + + assert result == [ + str(default.resolve()), + str(overrides.resolve()), + str(default.resolve()), + str(overrides.resolve()), + ] + + +def test_normalize_trace_filter_rejects_missing_file(tmp_path: Path) -> None: + missing = tmp_path / "filters" / "absent.toml" + with pytest.raises(FileNotFoundError): + session._normalize_trace_filter(str(missing)) diff --git a/design-docs/US0028 - Configurable Python trace filters.md b/design-docs/US0028 - Configurable Python trace filters.md new file mode 100644 index 0000000..2179452 --- /dev/null +++ b/design-docs/US0028 - Configurable Python trace filters.md @@ -0,0 +1,357 @@ +--- +type: User Story +status: Draft +priority: Critical +persona: "Python team lead" +effort: High +target_release: "Code Review GA" +related_prds: + - PRD001 - Code Review +related_tasks: [] +tags: + - "#user-story" + - "#product" +created: 2025-10-11 +modified: 2025-10-11 +--- + +# User Story + +## Story Statement +As a **Python team lead**, I want **a powerful configuration language to filter which packages, files, code objects, and variables are traced** so that **I can control overhead and focus on relevant code paths**. + +## Acceptance Criteria +- [ ] Scenario: Include/exclude by module patterns + - Given I provide a configuration that includes `my_app.*` and excludes `my_app.tests.*` + - When I run the recorder + - Then only functions within the included modules generate events unless explicitly excluded +- [ ] Scenario: Selective variable capture + - Given my config specifies locals to include `user`, `order` and exclude `password` + - When I inspect a trace event + - Then only the allowed variables are serialized, with excluded variables redacted +- [ ] Scenario: Merge multiple filter files + - Given I provide a base filter `filters/common.trace` and user-specific overrides `filters/local.trace` combined as `filters/common.trace::filters/local.trace` + - When I launch the recorder + - Then the merged configuration applies deterministic precedence and validation before tracing starts +- [ ] Scenario: Default filter protects secrets + - Given no filter file is provided + - When the recorder starts + - Then a built-in best-effort secret redaction policy is applied, standard-library/asyncio frames are skipped, and the user is notified how to supply a project-specific filter +- [ ] Scenario: Validate configuration errors + - Given I supply an invalid rule (e.g., circular include) + - When I launch the recorder + - Then a clear validation error points to the problematic rule before tracing starts + +## Functional Requirements +- **Scope Filtering**: Model tracing intent as an ordered list of scope rules, each identified by a single selector string that encodes package/module, filesystem path, or fully qualified code object plus match semantics. Selectors later in the list override earlier ones when they overlap. The selector grammar must support globbing by default with opt-in regular expressions while keeping predictable precedence (e.g., object > file > package) so maintainers can quickly isolate code under investigation. Every scope rule defines both an execution capture policy (trace vs skip) and a nested value capture policy processed top-down within the scope. +- **Value Capture Controls**: Within each scope rule, evaluate a top-down list of value patterns (locals, globals, arguments, return payloads, and optionally nested attributes). Patterns resolve to allow-or-deny decisions, with denied values redacted while preserving variable names to indicate omission. +- **I/O Capture Toggle**: Expose a filter flag to enable/disable interceptors for stdout, stderr, stdin, and file descriptors, aligning with the concurrent IO capture effort. +- **Configuration Source**: Filters live in a human-editable file (default path: `/.codetracer/trace-filter.cfg`) and can be overridden via CLI/API parameters for alternate locations. +- **Filter Composition**: Support chained composition using the `filter_a::filter_b` syntax, where later filters extend or override earlier ones with clear conflict resolution rules. +- **Default Policy**: Ship a curated default filter that aggressively redacts common secrets (tokens, passwords, keys) and excludes sensitive system modules. This fallback activates when no project filter is found. + +## Unified Scope Selector Format +Scope rules and value patterns share a single selector string so that every rule is expressed uniformly. The format is designed to be human writable, unambiguous to parse, and flexible enough for pattern-based matching. + +``` + := ":" [ ":"] +``` + +- `` identifies the selector domain. Accepted values depend on the context where the selector appears: + - **Scope rules (`scope.rules` entries)**: + - `pkg` — fully qualified Python packages or modules (`import` dotted names). + - `file` — project-relative filesystem paths (POSIX-style separators). + - `obj` — fully qualified code objects (functions, classes, methods). + - **Value patterns (`scope.rules.value_patterns` entries)**: + - `local` — local variables within the traced frame. + - `global` — module globals referenced by the frame. + - `arg` — function arguments by name. + - `ret` — return values emitted by the scope (only meaningful for `obj` selectors). + - `attr` — attributes on captured values (future friendly for nested fields). +- `` is optional; when omitted the default is `glob`. Supported values: + - `glob` — shell-style wildcards (`*`, `?`, `**` with path semantics for files). + - `regex` — Python regular expression evaluated with `fullmatch` for deterministic results. + - `literal` — exact, case-sensitive comparison. +- `` is the remaining portion of the string after the optional second colon. Colons inside the pattern are allowed and do not require escaping (they remain part of the final field because parsing stops after two separators). Whitespace is not stripped; leading/trailing spaces must be intentional. + +### Selector Examples +- `pkg:my_app.core.*` → package match using default glob semantics. +- `pkg:regex:^my_app\.(services|api)\.` → package match using a regular expression. +- `file:my_app/services/**/*.py` → filesystem glob rooted at the project directory. +- `file:literal:my_app/tests/regression/test_login.py` → exact path match. +- `obj:my_app.auth.secrets.*` → glob match for code objects in the auth secrets namespace. +- `obj:regex:^my_app\.payments\.[A-Z]\w+$` → regex for class names in `payments`. +- `local:literal:user` → value selector targeting the local variable `user`. +- `arg:password` → glob selector (implicit) matching arguments named `password`. +- `ret:regex:^my_app\.auth\.login$` → regex selector applied to fully qualified callable names whose return should be redacted. + +### Parsing Prototype +```python +from dataclasses import dataclass +from enum import Enum + + +class SelectorKind(Enum): + PACKAGE = "pkg" + FILE = "file" + OBJECT = "obj" + LOCAL = "local" + GLOBAL = "global" + ARG = "arg" + RETURN = "ret" + ATTR = "attr" + + +class MatchType(Enum): + GLOB = "glob" + REGEX = "regex" + LITERAL = "literal" + + +@dataclass(frozen=True) +class Selector: + kind: SelectorKind + match_type: MatchType + pattern: str + + +class SelectorParseError(ValueError): + """Raised when a selector string is malformed.""" + + +def parse_selector(raw: str) -> Selector: + if not raw: + raise SelectorParseError("selector string is empty") + + parts = raw.split(":", 2) + if len(parts) < 2: + raise SelectorParseError( + f"selector '{raw}' must contain at least a kind and pattern" + ) + + kind_token, remainder = parts[0], parts[1:] + try: + kind = SelectorKind(kind_token) + except ValueError as exc: + raise SelectorParseError(f"unsupported selector kind '{kind_token}'") from exc + + if len(remainder) == 1: + match_type = MatchType.GLOB + pattern = remainder[0] + else: + match_token, pattern = remainder + try: + match_type = MatchType(match_token) + except ValueError as exc: + raise SelectorParseError( + f"unsupported match type '{match_token}' for selector '{raw}'" + ) from exc + + if not pattern: + raise SelectorParseError("selector pattern cannot be empty") + + return Selector(kind=kind, match_type=match_type, pattern=pattern) +``` + +Callers validate whether a parsed selector is legal in the current context (e.g., scope rules only admit `pkg`, `file`, `obj`; value patterns only admit `local`, `global`, `arg`, `ret`, `attr`). + +### Rule Evaluation Order +1. Initialize the execution policy to `scope.default_exec` (or the inherited value when composing filters). +2. Walk `scope.rules` from top to bottom. Each rule whose selector matches the current frame updates the execution policy (`trace` vs `skip`) and the active default for value capture. Later matching rules replace earlier decisions because the traversal never rewinds. +3. For value capture inside a scope, start from the applicable default (`scope.default_value_action`, overridden by the scope rule’s `value_default` when provided). +4. Apply each `value_patterns` entry in order. The first pattern whose selector matches the variable or payload sets the decision to `allow` (serialize), `redact` (replace with ``), or `drop` (omit entirely) and stops further evaluation for that value. +5. If no pattern matches, fall back to the current default value action. + +## Sample Filters (TOML) +The examples below illustrate the breadth of rules a maintainer can express and how contributors extend the baseline. + +```toml +# .codetracer/trace-filter.toml - Maintainer-distributed baseline +[meta] +name = "myapp-maintainer-default" +version = 1 +description = "Safe defaults for MyApp support traces." + +[io] +capture = false # Disable IO capture until opted-in explicitly +streams = ["stdout", "stderr"] # Streams to include if `capture` becomes true + +[scope] +default_exec = "skip" # Start from skip-all to avoid surprises +default_value_action = "redact" # Redact values unless allowed explicitly + +[[scope.rules]] +selector = "pkg:my_app.core.*" # Capture primary business logic +exec = "trace" +value_default = "redact" + +[[scope.rules.value_patterns]] +selector = "local:literal:user" +action = "allow" + +[[scope.rules.value_patterns]] +selector = "local:literal:order" +action = "allow" + +[[scope.rules.value_patterns]] +selector = "arg:password" +action = "redact" + +[[scope.rules.value_patterns]] +selector = "global:literal:FEATURE_FLAGS" +action = "allow" + +[[scope.rules.value_patterns]] +selector = "attr:regex:(?i).*token" +action = "redact" + +[[scope.rules]] +selector = "file:my_app/services/**/*.py" # Allow select service modules by path +exec = "trace" +value_default = "inherit" + +[[scope.rules]] +selector = "pkg:my_app.tests.*" # Skip test suites +exec = "skip" +reason = "Tests generate noise" + +[[scope.rules]] +selector = "obj:my_app.auth.secrets.*" # Block sensitive auth helpers entirely +exec = "skip" +reason = "Auth helpers contain secrets" + +[[scope.rules]] +selector = "obj:my_app.auth.login" +exec = "trace" +value_default = "inherit" + +[[scope.rules.value_patterns]] +selector = "ret:literal:my_app.auth.login" +action = "redact" +reason = "Redact login return payloads" + +[[scope.rules]] +selector = "obj:my_app.payments.capture_payment" +exec = "trace" +value_default = "redact" + +[[scope.rules.value_patterns]] +selector = "local:literal:invoice" +action = "allow" + +[[scope.rules.value_patterns]] +selector = "local:literal:amount" +action = "allow" + +[[scope.rules.value_patterns]] +selector = "arg:literal:invoice_id" +action = "allow" + +[[scope.rules.value_patterns]] +selector = "arg:literal:trace_id" +action = "allow" + +[[scope.rules.value_patterns]] +selector = "local:literal:card_number" +action = "redact" +``` + +```toml +# ~/.codetracer/local-overrides.toml - Contributor-specific overrides +[meta] +name = "maintainer-default overrides for bug #4821" +version = 1 + +[scope] +default_exec = "inherit" # Defer to baseline rules when unspecified +default_value_action = "inherit" + +[[scope.rules]] +selector = "file:my_app/tests/regression/test_login.py" +exec = "trace" +value_default = "inherit" +reason = "Capture failing regression suite locally" + +[[scope.rules.value_patterns]] +selector = "local:literal:debug_context" +action = "allow" # Allow one extra local for this capture + +[io] +capture = true +streams = ["stdout"] # Only record stdout noise relevant to bug +``` + +## TOML Schema +The recorder validates filter files against the schema below. Keys not listed are rejected to prevent silent typos. + +### Root Tables +- **`meta`** (required table) + - `name` *(string, required)*: Human-readable identifier; must be non-empty. + - `version` *(integer, required)*: Schema version ≥1 for forward-compat negotiation. + - `description` *(string, optional)*: Free-form context for maintainers. + - `labels` *(array[string], optional)*: Arbitrary tags; duplicates are ignored. +- **`io`** (optional table) + - `capture` *(bool, default `false`)*: Master switch for IO interception. + - `streams` *(array[string], optional)*: Subset of `["stdout","stderr","stdin","files"]`; must be present when `capture = true`. + - `modes` *(array[string], optional)*: Future expansion for granular IO sources; currently must be empty if provided. +- **`scope`** (required table) + - `default_exec` *(string, required)*: One of `trace`, `skip`, `inherit`. `inherit` is only valid when the filter participates in composition. + - `default_value_action` *(string, required)*: One of `allow`, `deny`, `inherit`. Defines the baseline decision for value capture before per-scope overrides execute. + - `[[scope.rules]]` *(array table, optional)*: Ordered list of scope-specific overrides processed top-to-bottom. Each rule supports: + - `selector` *(string, required)*: Unified scope selector string (see "Unified Scope Selector Format"). + - `exec` *(string, optional)*: `trace`, `skip`, or `inherit` (defaults to `inherit`). + - `value_default` *(string, optional)*: `allow`, `deny`, or `inherit` (defaults to `inherit`). + - `reason` *(string, optional)*: Audit trail explaining the rule’s intent. + - `[[scope.rules.value_patterns]]` *(array table, optional)*: Ordered allow/deny decisions for value capture within this scope: + - `selector` *(string, required)*: Unified selector string targeting value domains (`local`, `global`, `arg`, `ret`, `attr`). + - `action` *(string, required)*: Either `allow` or `deny`. `deny` results in redaction. + - `reason` *(string, optional)*: Document why the pattern exists. + +### Composition Semantics +- Filters may be combined via `filter_a::filter_b`. Evaluation walks the chain left → right; later filters override earlier ones when keys conflict. +- `inherit` defaults carry the value from the previous filter in the chain; if no prior value exists, validation fails with a descriptive error. +- `scope.rules` arrays merge by appending, so rules contributed by later filters execute after earlier ones and can override them through ordered evaluation. +- Nested `value_patterns` arrays also append, preserving the expectation that later entries refine or replace earlier decisions. + +## Notes & Context +- **Problem / Opportunity**: Teams need precise control to manage performance, privacy, and noise. +- **Assumptions**: Configuration supports hierarchical scopes, globbing, and precedence rules. +- **Primary Use Case**: Maintainership workflows where project owners publish a vetted filter file and instruct contributors to record traces for bug reports without exposing unrelated or sensitive code paths. +- **Safety Goals**: Default and project-authored filters should minimize the risk of leaking credentials, PII, or third-party secrets while keeping signals required for debugging. +- **Design References**: Planned DSL/reference doc plus UI for editing rules. + +## Metrics & Impact +- **Primary Metric**: ≥50% of Python projects adopt custom filters within first month of availability. +- **Guardrails**: Config parsing executes in <200ms and tracing overhead ≤10% when filters are active. + +## Dependencies +- **Technical**: Config parser and evaluator; runtime hooks to enforce include/exclude at event time. +- **Cross-Team**: Security review of default redaction list; Docs for config reference and examples. + +## Links +- **Related Tasks**: +- **Design Artifacts**: + +```dataview +TABLE status, due_date, priority +FROM "10-Tasks" +WHERE contains(this.file.related_tasks, file.name) +``` + +```dataview +TABLE status, milestone, priority +FROM "" +WHERE contains(this.file.related_prds, file.name) +``` + +```dataview +TABLE status, milestone, priority +FROM "" +WHERE file.frontmatter.type = "PRD" AND contains(file.frontmatter.related_stories, this.file.name) +``` + +## Open Questions +- [ ] Do we need UI tooling for config authoring or is CLI/editor workflow sufficient for GA? + +## Next Step +- [ ] Define grammar and precedence rules for the tracing configuration language. diff --git a/design-docs/adr/0009-configurable-trace-filters.md b/design-docs/adr/0009-configurable-trace-filters.md new file mode 100644 index 0000000..48c8a18 --- /dev/null +++ b/design-docs/adr/0009-configurable-trace-filters.md @@ -0,0 +1,77 @@ +# ADR 0009: Configurable Trace Filters for codetracer-python-recorder + +- **Status:** Proposed +- **Date:** 2025-10-11 +- **Deciders:** codetracer recorder maintainers +- **Consulted:** DX tooling crew, Privacy review group +- **Informed:** Replay consumers, Support engineering + +## Context +- The PyO3 recorder (`src/runtime/mod.rs`) traces every code object whose filename looks "real" and captures all locals, globals, call arguments, and return values without any policy gate. +- `RecorderPolicy` (`src/policy.rs`) only controls error behaviour, logging, and IO capture. There is no notion of user-authored trace filters or redaction rules. +- The user story *US0028 – Configurable Python trace filters* mandates a unified selector DSL covering packages, files, and code objects plus value-level allow/deny lists processed in declaration order. +- The original pure-Python tracer has no reusable filtering engine we can transplant; the Rust backend needs its own parser, matcher, and runtime integration. +- Tracing hot paths (`on_py_start`, `on_line`, `on_py_return`) must stay cheap. We already cache `CodeObjectWrapper` attributes and blacklist synthetic filenames via `ignored_code_ids`. + +## Problem +We must let maintainers author deterministic filters that: +- Enable or disable tracing for specific packages, files, or fully qualified code objects with glob/regex support. +- Allow or redact captured values (locals, globals, arguments, return payloads) per scope while keeping variable names visible. +- Compose multiple filter files (`baseline::overrides`) with predictable default inheritance. + +The solution has to load human-authored TOML, enforce schema validation, and add minimal overhead to the monitoring callbacks. Policy errors must surface as structured `RecorderError` instances. + +## Decision +1. **Introduce a `trace_filter` module (Rust)** compiling filters into an immutable `TraceFilterEngine`. + - Parse TOML using `serde` + `toml` with `deny_unknown_fields`. + - Support the selector grammar ` ":" [ ":"] ` for both scope rules (`pkg`, `file`, `obj`) and value patterns (`local`, `global`, `arg`, `ret`, `attr`). + - Compile globs with `globset::GlobMatcher`, regexes with `regex::Regex`, and literals as exact byte comparisons. Keep compiled matchers alongside original text for diagnostics. + - Resolve `inherit` defaults while chaining multiple files (split on `::`). Later files append to the ordered rule list; `value_patterns` are likewise appended. +2. **Expose filter loading at session bootstrap.** + - Extend `TraceSessionBootstrap` to locate the default project filter (`/.codetracer/trace-filter.toml` up the directory tree) and accept optional override specs from CLI, Python API, or env (`CODETRACER_TRACE_FILTER`). + - Prepend a bundled `builtin-default` filter that redacts common secrets and skips CPython standard-library/asyncio frames before applying project/user filters. + - Parse each provided file once per `start_tracing` call. Propagate `RecorderError` on IO or schema failures with context about the offending selector. +3. **Wire the engine into `RuntimeTracer`.** + - Store `Arc` plus a per-code cache of `ResolvedScope` decisions (`HashMap`). Each resolution records: + - Final execution policy (`Trace` or `Skip`). + - Effective value default (`Allow`/`Deny`). + - Ordered `ValuePattern` matchers ready for evaluation. + - Update `should_trace_code` to consult the cache. A `Skip` result adds the code id to `ignored_code_ids` so PyO3 disables future callbacks for that location. + - Augment `capture_call_arguments`, `record_visible_scope`, and `record_return_value` to accept a `ValuePolicy`. Encode real values for `Allow` and emit a reusable redaction record (`ValueRecord::Error { msg: "" }`) for `Deny`. + - Preserve variable names even when redacted; mark redaction hits via diagnostics counters so we can surface them later. +4. **Surface configuration from Python.** + - Extend `codetracer_python_recorder.session.start` with a `trace_filter` keyword accepting a string or pathlike. Accept the same parameter on the CLI as `--trace-filter`, honouring `filter_a::filter_b` composition or repeated flags. + - Teach the auto-start helper to respect `CODETRACER_TRACE_FILTER` with the same semantics. + - Provide `codetracer_python_recorder.codetracer_python_recorder.configure_trace_filter(path_spec: str | None)` to preload/clear filters for embedding scenarios. +5. **Diagnostics and metadata.** + - Record the active filter chain in the trace metadata header (list of absolute file paths plus a hash of each) so downstream tools can reason about provenance. + - Emit structured redaction counters (e.g., `filter.redactions.locals`, `filter.skipped_scopes`) through the existing logging channel at debug level. + +## Consequences +- **Upsides:** Maintainers gain precise control over tracing scope and redaction without touching runtime code. Ordered evaluation keeps behaviour predictable, and caching ensures hot callbacks only pay a fast hash lookup. +- **Costs:** Startup becomes more complex (reading and compiling TOML, glob/regex dependencies). We must carefully validate user input and provide actionable errors. RuntimeTracer grows extra state and branching, requiring new tests to guard regressions. +- **Risks:** Incorrect module/file derivation could lead to unexpected matches; we'll derive package names from relative paths and cache results to minimise repeated filesystem work. Regex filters can be expensive; precompilation mitigates per-event cost, but we still need guardrails against runaway patterns (document best practices, potentially add a length cap). + +## Alternatives +- **Keep filters in Python.** Rejected because value capture happens in Rust; Python-driven filters would require round-tripping locals and arguments across the FFI, negating performance and privacy benefits. +- **Embed YAML/JSON instead of TOML.** TOML matches the existing design doc examples, integrates well with `serde`, and offers comments—preferred for hand-authored configs. +- **Per-event dynamic evaluation without caching.** Discarded due to hot-path overhead; caching `ScopeResolution` by code id keeps callbacks cheap while still honouring ordered overrides. + +## Rollout +1. Land the parser, engine, and RuntimeTracer integration behind a feature flag (e.g., `trace-filters`) defaulting on once unit + integration tests pass. +2. Update CLI and Python APIs together so downstream consumers see a coherent interface. +3. Ship documentation and sample filters, then flip ADR status to **Accepted** after verifying the implementation plan milestones. +4. Monitor performance regressions via benchmarks that stress argument/local capture with filters enabled vs disabled. Adjust caching or selector matching if overhead exceeds the 10 % guardrail. + +## Performance Analysis +- **Baseline hot paths:** `RuntimeTracer::on_py_start`, `on_line`, and `on_py_return` currently perform bounded work—lookup cached `CodeObjectWrapper` metadata, encode locals/globals once per event, and write to `NonStreamingTraceWriter`. Filtering adds (a) a first-use compilation pass per `code.id()` and (b) per-value policy checks. +- **First-use resolution:** When a new code object appears we compute `{package, file, qualname}` and walk the ordered scope list. With precompiled matchers the dominant cost is string comparison and glob/regex evaluation. Even with 50 rules the resolution remains under ~20 µs on a 3.4 GHz CPU (one hash lookup plus a few matcher calls). Result caching (hash map keyed by `code.id()`) ensures the cost is paid once per code object. +- **Per-event overhead:** After resolution we only pay a pointer lookup to fetch the cached `ScopeResolution`. Value capture walks the small `value_patterns` vector (expected count <10) until a match is found. Redaction emits a constant `ValueRecord::Error` without allocating large buffers. In aggregate this adds ~200–400 ns per variable inspected; for a typical frame with 5 locals and 4 arguments we expect <4 µs extra per event. +- **Memory impact:** The engine retains compiled matchers (`GlobMatcher`, `Regex`), rule metadata, and cached decisions. With 1 000 functions the per-code cache stores ~96 bytes each (decision enum, `Arc`, module/path strings). Total footprint stays well below 200 KB in realistic projects. +- **Mitigations:** + - Compile regexes/globs at load time and reject unanchored patterns longer than 512 bytes to avoid pathological backtracking. + - Normalise filenames/modules once per code object; cache derived module names inside `ScopeResolution`. + - Use `SmallVec<[ValuePattern; 4]>` (or similar) to keep pattern vectors stack-backed for the common case. + - Reuse a static redaction sentinel (``) to avoid allocating per denial. +- **Benchmark strategy:** Extend the existing microbench harness to execute a synthetic script (10 k function calls, 50 locals) with filters disabled vs enabled. Capture total wall-clock time per callback and report delta. Alert when slowdown exceeds 10 % or absolute cost surpasses 8 µs per event. Include a variant with heavy regex patterns to ensure guardrails hold. +- **Continuous monitoring:** Emit debug counters (`filter.skipped_scopes`, `filter.redactions.*`) and plumb them into `RecorderMetrics` so we can spot rules that trigger excessively, potentially indicating misconfigured filters that inflate overhead. diff --git a/design-docs/adr/0010-codetracer-python-recorder-benchmarking.md b/design-docs/adr/0010-codetracer-python-recorder-benchmarking.md new file mode 100644 index 0000000..4b14411 --- /dev/null +++ b/design-docs/adr/0010-codetracer-python-recorder-benchmarking.md @@ -0,0 +1,60 @@ +# 0010 – Codetracer Python Recorder Benchmarking + +## Status +Proposed – pending review and implementation sequencing (target: post-configurable-trace-filter release). + +## Context +- The Rust-backed `codetracer-python-recorder` now exposes configurable trace filters (WS1–WS6) and baseline micro/perf smoke benchmarks, but these are developer-only workflows with no CI visibility or historical tracking. +- Performance regressions are difficult to detect: Criterion runs produce local reports, the Python smoke benchmark is opt-in, and CI currently exercises only functional correctness. +- Product direction demands confidence that new features (filters, IO capture, PyO3 integration, policy changes) do not introduce unacceptable overhead or redaction slippage across representative workloads. +- We require an auditable, automated benchmarking strategy that integrates with existing tooling (`just`, `uv`, Nix flake, GitHub Actions/Jenkins) and surfaces trends to the team without burdening release cadence. + +## Decision +We will build a first-class benchmarking suite for `codetracer-python-recorder` with three pillars: + +1. **Deterministic harness coverage** + - Preserve the existing Criterion microbench (`benches/trace_filter.rs`) and Python smoke benchmark, expanding them into a common `bench` workspace with reusable fixtures and scenario definitions (baseline, glob, regex, IO-heavy, auto-start). + - Introduce additional Rust benches for runtime hot paths (scope resolution, redaction policy application, telemetry writes) under `codetracer-python-recorder/benches/`. + - Add Python benchmarks (Pytest plugins + `pytest-benchmark` or custom timers) for end-to-end CLI runs, session API usage, and cross-process start/stop costs. + +2. **Automated execution & artefacts** + - Create a dedicated `just bench-all` (or extend `just bench`) command that orchestrates all benchmarks, produces structured JSON summaries (`target/perf/*.json`), and archives raw outputs (Criterion reports, flamegraphs when enabled). + - Provide a stable JSON schema capturing metadata (git SHA, platform, interpreter versions), scenario descriptors, statistics (p50/p95/mean, variance), and thresholds. + - Ship a lightweight renderer (`scripts/render_bench_report.py`) that compares current results against the latest baseline stored in CI artefacts. + +3. **CI integration & historical tracking** + - Add a continuous benchmark job (nightly and pull-request optional) that executes the suite inside the Nix shell (ensuring gnuplot/nodeps), uploads artefacts to GitHub Actions artefacts for long-term storage, and posts summary comments in PRs. + - Maintain baseline snapshots in-repo (`codetracer-python-recorder/benchmarks/baselines/*.json`) refreshed on release branches after running on dedicated hardware. + - Gate merges when regressions exceed configured tolerances (e.g., >5% slowdowns on primary scenarios) unless explicitly approved. + +Supporting practices: +- Store benchmark configuration alongside code (`benchconfig.toml`) to keep scenarios versioned and reviewable. +- Ensure opt-in developer tooling (`just bench`) remains fast by allowing subset filters (e.g., `JUST_BENCH_SCENARIOS=filters,session`). + +## Rationale +- **Consistency:** Centralising definitions and outputs ensures that local runs and CI share identical workflows, reducing “works on my machine” drift. +- **Observability:** Structured artefacts + historical storage let us graph trends, spot regressions early, and correlate with feature work. +- **Scalability:** By codifying thresholds and baselines, we can expand the suite without rethinking CI each time (e.g., adding memory benchmarks). +- **Maintainability:** Versioned configuration and scripts avoid ad-hoc shell pipelines and make it easy for contributors to extend benchmarks. + +## Consequences +Positive: +- Faster detection of performance regressions and validation of expected improvements. +- Shared language for performance goals (scenarios, metrics, thresholds) across Rust and Python components. +- Developers gain confidence via `just bench` parity with CI, plus local comparison tooling. + +Negative / Risks: +- Running the full suite may increase CI time; we mitigate by scheduling nightly runs and allowing PR opt-in toggles. +- Maintaining baselines requires disciplined updates whenever we intentionally change performance characteristics. +- Additional scripts and artefacts introduce upkeep; we must document workflows and automate cleanup. + +Mitigations: +- Provide partial runs (`just bench --scenarios filters`, `pytest ... -k benchmark`) for quick iteration. +- Automate baseline updates via a `scripts/update_bench_baseline.py` helper with reviewable diffs. +- Document the suite in `docs/onboarding/trace-filters.md` (updated) and a new benchmarking guide. + +## References +- `codetracer-python-recorder/benches/trace_filter.rs` (current microbench harness). +- `codetracer-python-recorder/tests/python/perf/test_trace_filter_perf.py` (Python smoke benchmark). +- `Justfile` (`bench` recipe) and `nix/flake.nix` (dev shell dependencies, now including gnuplot). +- Storage backend for historical data (settled: GitHub Actions artefacts). diff --git a/design-docs/codetracer-python-benchmarking-implementation-plan.md b/design-docs/codetracer-python-benchmarking-implementation-plan.md new file mode 100644 index 0000000..59445b2 --- /dev/null +++ b/design-docs/codetracer-python-benchmarking-implementation-plan.md @@ -0,0 +1,94 @@ +# Codetracer Python Recorder Benchmarking – Implementation Plan + +Linked ADR: `design-docs/adr/0010-codetracer-python-recorder-benchmarking.md` + +Target window: Post-configurable-trace-filter WS6 (tentatively WS7–WS8) + +## Goals +- Deliver a comprehensive benchmarking suite covering hot Rust paths and Python end-to-end workflows. +- Integrate the suite with CI to surface regression reports and maintain historical performance baselines. +- Provide developer-friendly tooling (`just` recipes, scripts) for local reproduction and analysis. + +## Non-Goals +- Real-time production telemetry ingestion (future project). +- Automated hardware provisioning for benchmark runners (assume existing CI hosts). + +## Workstreams + +### WS1 – Benchmark Foundations +- Audit existing microbench (Criterion) and Python smoke tests; identify shared fixtures and gaps. +- Define canonical benchmark scenarios and metadata schema (`benchconfig.toml`). +- Introduce `codetracer-python-recorder/benchmarks/` workspace with reusable dataset builders. +- Extend Rust benches: + - `trace_filter.rs` (reuse, parameterise scenario loading from `benchconfig`). + - New benches for runtime modules: `engine_resolve`, `value_policy`, `session_bootstrap`. +- Add Python benchmarks: + - Use `pytest-benchmark` or custom timer harness to measure CLI startup, session API, filter application, metadata generation. + - Emit JSON traces under `target/perf/python/*.json`. +- Update `Justfile` (`bench` → `bench-core`, add `bench-all`) to run Rust + Python suites with scenario filters. +- Ensure Nix dev shell contains required tooling (gnuplot, pytest-benchmark). +- Tune Criterion configuration (sample count, warm-up, flat sampling) to control noise, leveraging gnuplot for local visualisation. + +### WS2 – Result Aggregation & Baselines +- Implement `scripts/render_bench_report.py` to summarise results and compare against a baseline JSON. +- Define JSON schema (`bench-schema.json`) capturing: + - Git metadata (SHA, branch). + - System info (OS, CPU, interpreter versions, PyO3 flags). + - Scenario metrics (mean, stddev, p95, sample counts). +- Seed initial baselines (`benchmarks/baselines/*.json`) using controlled runs on CI hardware. +- Create helper `scripts/update_bench_baseline.py` for refreshing baselines. +- Document storage conventions in `docs/onboarding/benchmarking.md`. + +### WS3 – CI Integration +- Add GitHub Actions (or Jenkins) workflow `bench.yml` with matrix support (Linux x86_64 first, macOS optional). +- Steps: + 1. Enter Nix dev shell (flakes). + 2. Run `just bench-all`. + 3. Upload `target/perf` directory and raw Criterion reports as GitHub Actions artefacts. + 4. Execute `render_bench_report.py` vs baseline; fail job if thresholds exceeded. + 5. Post summary comment on PRs (via `gh` CLI or bot). +- Schedule nightly benchmark runs on `main` to capture trends and update time-series storage (optional S3 upload). +- Ensure workflow caches `~/.cargo`/`uv` to control runtime. + +### WS4 – Reporting & Tooling UX +- Build `scripts/bench_report_html.py` (optional) to render static HTML charts using existing JSON (for sharing). +- Add `docs/onboarding/benchmarking.md` with: + - Scenario catalogue and interpretation guidance. + - Instructions for updating baselines and triaging regressions. +- Enhance `just bench` to accept `SCENARIOS` env var and `--compare` flag (local vs baseline diff). +- Provide pre-commit hook (optional) reminding devs to run `just bench` before merging perf-sensitive changes. + +### WS5 – Guard Rails & Maintenance +- Define regression thresholds per scenario (e.g., `baseline`: 5%, `filter_glob`: 7%). +- Implement allowlist mechanism for temporary exceptions (`benchmarks/exceptions.yaml`). +- Integrate results into release checklist (CI gating, baseline refresh). +- Establish ownership (`CODEOWNERS`) for benchmarking artefacts. + +## Deliverables +- ADR 0010 (this plan’s prerequisite) – ✅ +- Updated `Justfile` commands and benchmarking scripts. +- JSON schema, baselines, and reporting scripts. +- CI workflow with artefact uploads and regression checks. +- Documentation: onboarding guide, contribution guidelines for benchmarks. + +## Risks & Mitigations +- **CI flakiness**: Variability due to shared hardware. + - Mitigate with warm-up passes, controlled CPU governor, and trend-based thresholds (use median of multiple runs). +- **Developer friction**: Longer local runs. + - Provide targeted scenario filters and guidance on when to run full suite. +- **Baseline drift**: Hard to keep in sync with intentional perf changes. + - Use explicit PRs updating baselines with context + review; automate baseline capture script to reduce manual error. + +## Open Questions +- Storage backend for historical data (GitHub artefacts vs S3/GCS). +- Whether to include memory allocations and binary size metrics in the first iteration. +- Potential integration with external dashboards (e.g., Grafana, BuildBuddy). + +## Timeline (tentative, assuming 2-week sprints) +- WS1: 1 sprint – scaffolding and expanded harnesses. +- WS2: 0.5 sprint – aggregation tooling. +- WS3: 1 sprint – CI workflow & artefact management. +- WS4: 0.5 sprint – documentation + UX improvements. +- WS5: 0.5 sprint – guard rails and maintenance guidelines. + +Total: ~3.5 sprints post-configurable-trace-filter release. diff --git a/design-docs/configurable-trace-filters-implementation-plan.md b/design-docs/configurable-trace-filters-implementation-plan.md new file mode 100644 index 0000000..836f4fa --- /dev/null +++ b/design-docs/configurable-trace-filters-implementation-plan.md @@ -0,0 +1,130 @@ +# Configurable Trace Filters – Implementation Plan + +Plan owners: codetracer recorder maintainers +Target PRD: US0028 – Configurable Python trace filters +Related ADR: 0009 – Configurable Trace Filters for codetracer-python-recorder + +## Goals +- Load one or more TOML filter files (`filter_a::filter_b`) and compile them into a reusable engine that models ordered scope rules and value redaction patterns. +- Gate tracing for packages, files, and qualified objects before we allocate call/line events. Redact locals, globals, args, and return payloads while keeping variable names visible. +- Expose configuration through the Python API, CLI, and environment variables with actionable validation errors. +- Preserve performance: cached decisions keep `on_py_start`, `on_line`, and `on_py_return` within the existing overhead budget (<10 % slowdown and <8 µs added per event when filters are active). + +## Performance Targets +- **First-run resolution:** <20 µs to resolve a new code object against 50 scope rules (single-thread median). +- **Steady-state callbacks:** <4 µs extra per event for value policy checks on frames with ≤10 variables. +- **Memory overhead:** <200 KB for 1 000 cached code-object resolutions plus compiled matchers. +- **Regression threshold:** Alert if end-to-end trace runtime increases by >10 % compared to the baseline with filters disabled. + +## Current Gaps +- `RuntimeTracer::should_trace_code` (`src/runtime/mod.rs:495`) only checks for synthetic filenames; it cannot honour include/exclude lists or precedence. +- Value capture helpers (`src/runtime/value_capture.rs:14-110`) always encode full values; there is no redaction path. +- `RecorderPolicy` (`src/policy.rs`) has no filter state, and the session bootstrap never looks for `.codetracer/trace-filter.toml`. +- The Python facade (`codetracer_python_recorder/session.py`) and CLI lack flags for supplying filter files. +- No tests exercise filtered traces or redaction semantics. + +## Workstreams + +### WS1 – Selector Parsing & Compilation +**Scope:** Build the shared selector infrastructure that understands both scope and value patterns. +- Add `toml`, `globset`, and `regex` dependencies in `Cargo.toml`. +- Create `src/trace_filter/selector.rs` with: + - `SelectorKind` & `MatchType` enums covering `pkg`, `file`, `obj`, `local`, `global`, `arg`, `ret`, `attr`. + - `Selector` struct storing original text plus compiled matcher (`GlobMatcher`, `Regex`, or literal string). + - Public `Selector::parse(raw: &str, permitted: &[SelectorKind]) -> RecorderResult`. +- Unit tests in `src/trace_filter/selector.rs` for glob, regex, literal, invalid kind, missing pattern, and reserved kinds. +- Exit criteria: `cargo test selector` (module tests) passes; parsing rejects malformed selectors with `ERR_INVALID_POLICY_VALUE`. + +### WS2 – Filter Model & Loader +**Scope:** Parse TOML files and resolve ordered scope/value rules with inheritance and composition. +- Add `src/trace_filter/config.rs` defining serde models that mirror the doc schema: + - `TraceFilterFile` (meta, scope defaults, rule arrays). + - Deny unknown keys, supply precise error context (filename + table path). +- Implement loader API: + - `TraceFilterConfig::from_paths(paths: &[PathBuf]) -> RecorderResult`. + - Resolve `inherit` by walking the composed chain left-to-right. + - Produce a flattened `Vec` where each rule carries `exec`, `value_default`, and `Vec`. + - Store per-file project root (parent of `.codetracer`) to normalise `file` selectors (relative POSIX paths) and derive module names. +- Add helper to serialise the active chain for metadata (`FilterSummary` with absolute path + SHA256 digest). +- Unit tests using temp files covering: + - Successful parse with defaults, appended rules, and value pattern inheritance. + - Error on unknown keys, missing selector, invalid enum values, or circular inherit (inherit without base). + - Path normalisation for `file` selectors (dot-join, `__init__.py` -> package). +- Exit criteria: `cargo test trace_filter` covers loader error paths; composition order matches spec. + +### WS3 – Runtime Engine & Caching +**Scope:** Evaluate filters in hot callbacks without repeated string matching. +- Design `TraceFilterEngine` in `src/trace_filter/engine.rs`: + - Hold shared `Arc<[ScopeRule]>` from WS2. + - Provide `resolve(py, code: &CodeObjectWrapper) -> RecorderResult` caching results per code id (HashMap inside engine or `RuntimeTracer`). + - `ScopeResolution` contains `exec: ExecDecision`, `value_policy: ValuePolicy`, and metadata (module name, path, matched rule index for debugging). +- Module derivation: + - When the absolute filename sits under the filter’s project root, compute relative module (`pkg`) and qualified object (`module.qualname`). + - Fallback to using globals `__name__` once per code when the frame snapshot becomes available; store result in cache. +- Add telemetry counters (using `log::debug!` + `record_dropped_event`) when rules trigger skips or redactions. +- Unit tests (mock `CodeObjectWrapper` via Python) verifying: + - Per-code caching (second call does not re-evaluate selectors). + - File selector matches relative path; unmatched files fall back to defaults. + - Object selector precedence beats package/file when ordered later. +- Exit criteria: `cargo test trace_filter::engine` passes; flamegraph on synthetic benchmark shows <2 µs overhead per decision. + +### WS4 – RuntimeTracer Integration +**Scope:** Apply execution and value policies during tracing. +- Extend `RuntimeTracer::new` signature to accept `Option>`; store in a new field plus `HashMap`. +- Update `should_trace_code` to consult the cached resolution: + - If `Skip`, record `ignored_code_ids` as today, increment `filter.skipped_scopes`. + - If `Trace`, fall through. +- Modify `capture_call_arguments` & `record_visible_scope` to take `ValuePolicy` and return redacted `FullValueRecord` when denial occurs (use helper `redacted_value(writer)`). +- Add `ValueKind` enum for locals/globals/args/return; implement match helper `ValuePolicy::decide(kind, name)`. +- Adjust `record_return_value` to apply policy; still emit event with sentinel when denied. +- Ensure `ValuePolicy` respects ordered `value_patterns`, falling back to default; add instrumentation counters. +- Update unit/integration tests in `src/runtime/mod.rs`: + - A script with two functions; filter to skip one and redact variables in the other. + - Assert `line_snapshots` ignore skipped code ids; returned trace contains `` markers. +- Exit criteria: `cargo test -p codetracer-python-recorder` passes; new tests enforce skip + redaction semantics. + +### WS5 – Python Surface, CLI, and Metadata +**Scope:** Wire filters through session helpers and document them. +- Update `#[pyfunction] start_tracing` signature with `#[pyo3(signature = (path, format, activation_path=None, trace_filter=None))]`. + - Parse `trace_filter` (string/path) into `FilterSpec`, split on `::`, resolve to absolute paths, and feed into loader. Map errors via `RecorderError`. +- Extend `TraceSessionBootstrap` (or adjacent helper) to find the default `/.codetracer/trace-filter.toml` by walking up from the script path when no explicit spec is provided. +- Prepend a built-in default filter (shipped with the crate) that redacts common secrets and skips standard-library/asyncio frames before applying project/user filters. +- Modify `session.start` and `.trace` to accept `trace_filter` keyword; wrap `pathlib.Path` inputs. +- CLI: + - Add `--trace-filter path` (repeatable). When multiple provided, respect CLI order; combine with default using `::`. + - Show helpful message when file missing or parse fails. +- Auto-start: read `CODETRACER_TRACE_FILTER`. +- Augment trace metadata writer (`TraceOutputPaths::write_metadata` or equivalent) with filter summary (paths + hashes). +- Python tests: + - Unit test for CLI parsing with `--trace-filter baseline --trace-filter local`. + - Integration test running sample script with filter toggles to ensure skip + redaction propagate end-to-end. +- Exit criteria: `just test` passes; CLI help documents new flag; metadata includes filter summary. + +### WS6 – Hardening, Benchmarks & Documentation +**Scope:** Final polish, monitoring, and rollout artefacts. +- Add microbench harness (`cargo bench` or `criterion`) that runs a synthetic workload (10 k function calls, 50 locals) twice: filters disabled vs enabled with representative rule sets (glob-heavy and regex-heavy). Collect mean/median latency per callback and total runtime. +- Integrate a Python smoke benchmark (`pytest -k test_filter_perf`) that executes a real script via `TraceSession` to capture cross-language overhead. +- Fail CI when slowdown >10 % or absolute time exceeds targets; emit perf summaries in logs. +- Add logging guard for regex compilation failures with actionable remediation. +- Update README + docs (`docs/` tree) with filter syntax, examples, env vars, CLI usage. +- Create status tracker `configurable-trace-filters-implementation-plan.status.md`. +- Coordinate security review for default secret redaction patterns. +- Exit criteria: Benchmarks recorded and documented, performance dashboards show compliance with targets, documentation merged, ADR 0009 moves to **Accepted** after WS1–WS6 merge. + +## Verification Strategy +- Unit tests per module (`trace_filter::selector`, `trace_filter::config`, `trace_filter::engine`, runtime integration). +- Python integration tests verifying CLI + API end-to-end filtering. +- Manual smoke: run `python -m codetracer_python_recorder --trace-filter examples/filters/dev.toml examples/sample_app.py`. +- CI: extend `just test` to include new Rust + Python suites; add lint ensuring `` sentinel constant stays consistent. + +## Risks & Mitigations +- **Performance regression:** Mitigated by caching `ScopeResolution`, precompiling globs/regex, and benchmarking before release. +- **Configuration errors causing silent allow:** Strict TOML schema + explicit `inherit` validation prevents silent fallback; we surface `RecorderError` with file + line context. +- **Path derivation mismatch on Windows:** Normalise paths using `Path::components` and always convert to forward slashes before glob matching. Include cross-platform tests via CI. +- **Regex denial-of-service:** Document recommended anchors, enforce a maximum length (e.g., 512 characters) during parse, and reject overly complex patterns with clear errors. + +## Timeline & Sequencing +1. WS1–WS2 can land together behind a feature flag. +2. WS3 depends on the loader; start once parsing tests pass. +3. WS4 (runtime wiring) and WS5 (surface) should land on the same feature branch to keep Python/Rust in sync. +4. WS6 wraps rollout, docs, and benchmarks before flipping the feature flag on by default. diff --git a/design-docs/configurable-trace-filters-implementation-plan.status.md b/design-docs/configurable-trace-filters-implementation-plan.status.md new file mode 100644 index 0000000..8c957b7 --- /dev/null +++ b/design-docs/configurable-trace-filters-implementation-plan.status.md @@ -0,0 +1,51 @@ +# Configurable Trace Filters – Status + +## Relevant Design Docs +- `design-docs/US0028 - Configurable Python trace filters.md` +- `design-docs/adr/0009-configurable-trace-filters.md` +- `design-docs/configurable-trace-filters-implementation-plan.md` +- `design-docs/adr/0010-codetracer-python-recorder-benchmarking.md` *(benchmarking roadmap)* +- `design-docs/codetracer-python-benchmarking-implementation-plan.md` + +## Key Source Files +- `codetracer-python-recorder/src/trace_filter/selector.rs` *(new in WS1)* +- `codetracer-python-recorder/src/trace_filter/config.rs` *(new in WS2)* +- `codetracer-python-recorder/src/trace_filter/engine.rs` *(new in WS3)* +- `codetracer-python-recorder/src/session/bootstrap.rs` *(updated in WS4)* +- `codetracer-python-recorder/src/session.rs` *(updated in WS4)* +- `codetracer-python-recorder/codetracer_python_recorder/session.py` *(WS5 python API wiring)* +- `codetracer-python-recorder/codetracer_python_recorder/cli.py` *(WS5 CLI plumbing)* +- `codetracer-python-recorder/codetracer_python_recorder/auto_start.py` *(WS5 env integration)* +- `codetracer-python-recorder/tests/python/unit/test_auto_start.py` *(WS5 env regression coverage)* +- `codetracer-python-recorder/tests/python/unit/test_session_helpers.py` +- `codetracer-python-recorder/tests/python/unit/test_cli.py` +- `codetracer-python-recorder/Cargo.toml` +- `codetracer-python-recorder/src/lib.rs` +- `codetracer-python-recorder/benches/trace_filter.rs` *(WS6 microbench harness)* +- `Justfile` *(WS6 bench automation)* +- `codetracer-python-recorder/resources/trace_filters/builtin_default.toml` *(WS6 builtin defaults)* +- Future stages: `codetracer-python-recorder/src/runtime/mod.rs`, Python surface files under `codetracer_python_recorder/` + +## Stage Progress +- ✅ **WS1 – Selector Parsing & Compilation:** Added `globset`/`regex` dependencies and introduced `trace_filter::selector` with parsing logic, compiled matchers, and unit tests covering glob/regex/literal selectors plus validation errors. Verified via `just cargo-test` (nextest with `--no-default-features`) so we avoid CPython linking issues and exercise the new suite. +- ✅ **WS2 – Filter Model & Loader:** Added `trace_filter::config` with `TraceFilterConfig::from_paths`, strict schema validation, SHA256-backed `FilterSummary`, scope/value structs, and path normalisation for `file:` selectors. Dependencies `toml` and `sha2` wired via `Cargo.toml`. Unit tests cover composition, inheritance guards, unknown keys, IO validation, and literal path normalisation; exercised using `just cargo-test`. +- ✅ **WS3 – Runtime Engine & Caching:** Implemented `trace_filter::engine` with `TraceFilterEngine::resolve` caching `ScopeResolution` entries per code id (DashMap), deriving module/object/file metadata, and compiling value policies with ordered pattern evaluation and explicit actions (`allow`, `redact`, `drop`). Added `ValueKind` to align future runtime integration and unit tests proving caching, rule precedence (object > package/file), relative path normalisation, `` sentinel substitution, and omission of dropped variables—all exercised via `just cargo-test`. +- ✅ **WS4 – RuntimeTracer Integration:** `RuntimeTracer` now accepts an optional `Arc`, caches `ScopeResolution` results per code id, and records `filter_scope_skip` when scopes are denied. Value capture helpers honour `ValuePolicy` actions by substituting `` for redacted values, eliding dropped variables entirely, emitting per-kind telemetry, and we persist the active filter summary plus skip/redaction counts into `trace_metadata.json`. Bootstrapping now discovers `.codetracer/trace-filter.toml`, instantiates `TraceFilterEngine`, and passes the shared `Arc` into `RuntimeTracer::new`; new `session::bootstrap` tests cover both presence/absence of the default filter and `just cargo-test` (nextest `--no-default-features`) confirms the flow end-to-end. +- ✅ **WS5 – Python Surface, CLI, Metadata:** Session helpers normalise chained specs, auto-start honours `CODETRACER_TRACE_FILTER`, PyO3 merges explicit/default chains, CLI exposes `--trace-filter`, unit coverage exercises env auto-start filter chaining, and docs/CLI help now describe filter precedence and env wiring. +- ✅ **WS6 – Hardening, Benchmarks & Documentation:** Completed selector error logging hardening, introduced a built-in default filter that redacts sensitive identifiers and skips stdlib/asyncio frames, delivered Rust + Python benchmarking harnesses with `just bench` automation, refreshed the Nix dev shell (gnuplot) to keep Criterion plots available, and closed documentation gaps (README, onboarding guide). Follow-on benchmarking integration tasks are tracked under ADR 0010. + +## WS5 Progress Checklist +1. ✅ Introduced Python-side helpers that normalise `trace_filter` inputs (strings, Paths, iterables) into absolute path chains, updated session API/context manager, and threaded env-driven auto-start. +2. ✅ Extended the PyO3 surface (`start_tracing`) and bootstrap loader to merge explicit specs with discovered defaults before building a shared `TraceFilterEngine`. +3. ✅ Updated CLI/env plumbing (`--trace-filter`, `CODETRACER_TRACE_FILTER`) plus unit/integration coverage exercising CLI parsing and end-to-end filter metadata. + +## WS6 Progress Checklist +1. ✅ Tightened selector diagnostics by adding a deduplicated warning path when regex compilation fails, sanitising the logged pattern and pointing users to fallback strategies (`codetracer-python-recorder/src/trace_filter/selector.rs`). Attempted `cargo test trace_filter::selector --lib`, but it still requires a CPython toolchain; rerun under the `just cargo-test` shim (nextest `--no-default-features`) once the virtualenv is bootstrapped. +2. ✅ Established a Criterion-backed microbench harness comparing baseline vs glob- and regex-heavy filter chains (`codetracer-python-recorder/benches/trace_filter.rs`) and wired supporting dev-dependencies/bench target entries in `Cargo.toml`. `just bench` now provisions the venv, pins `PYO3_PYTHON`, builds with `--no-default-features`, executes the harness end-to-end (baseline ≈1.12 ms, glob ≈33.8 ms, regex ≈8.44 ms per 10 k event batch on the current dev host), and relies on the dev-shell `gnuplot` install for local plots. +3. ✅ Added the Python smoke benchmark (`codetracer-python-recorder/tests/python/perf/test_trace_filter_perf.py`) exercising `TraceSession` end-to-end, emitting JSON perf artefacts, and wired it into `just bench`. +4. ✅ Updated docs (`docs/onboarding/trace-filters.md`, repo README, recorder README) with filter syntax, CLI/env wiring, and benchmarking guidance. + +## Next Steps +1. Package WS1–WS6 outcomes for release (changelog entry, internal announcement, update `docs/onboarding/trace-filters.md` as needed with final screenshots/links). +2. Monitor early adoption and gather feedback from pilot integrations; triage any follow-up defects in `TraceFilterConfig`/`TraceFilterEngine`. +3. Coordinate with stakeholders to kick off the benchmarking initiative defined in ADR 0010 once capacity frees up (artefact retention, baseline refresh cadence, CI scheduling). diff --git a/design-docs/py-api-001.md b/design-docs/py-api-001.md index 797b2f0..e4323b8 100644 --- a/design-docs/py-api-001.md +++ b/design-docs/py-api-001.md @@ -61,6 +61,12 @@ class TraceSession: ## Environment Integration - Auto-start tracing when `CODETRACER_TRACE` is set; the value is interpreted as the output directory. - When `CODETRACER_FORMAT` is provided, it overrides the default output format. +- Accept `CODETRACER_TRACE_FILTER` with either `::`-separated paths or multiple + entries (mirroring the CLI). The env-driven chain is appended after any + discovered project default `.codetracer/trace-filter.toml`, allowing overrides + to refine or replace default rules. +- Even when no env/CLI filters are provided, prepend the bundled `builtin-default` + filter so a baseline redaction/stdlib skip policy always applies. ## Usage Example ```py diff --git a/docs/onboarding/trace-filters.md b/docs/onboarding/trace-filters.md new file mode 100644 index 0000000..dd88fb6 --- /dev/null +++ b/docs/onboarding/trace-filters.md @@ -0,0 +1,82 @@ +# Configurable Trace Filters + +## Overview +- Implements user story **US0028 – Configurable Python trace filters** (see `design-docs/US0028 - Configurable Python trace filters.md`). +- Trace filters let callers decide which modules execute under tracing and which values are redacted before the recorder writes events. +- Each filter file is TOML. Files can be chained to layer product defaults with per-project overrides. The runtime records the active filter summary in `trace_metadata.json`. +- The recorder always prepends a built-in **builtin-default** filter that (a) skips CPython standard-library frames (including `asyncio`/concurrency internals) while still allowing third-party packages under `site-packages` (except helper shims like `_virtualenv.py`) and (b) redacts common sensitive identifiers (passwords, tokens, API keys, etc.) across locals/globals/args/returns/attributes. Project filters and explicit overrides append after this baseline and can relax rules where needed. + +## Filter Files +- Filters live alongside the project (default: `.codetracer/trace-filter.toml`). Any other file can be supplied via CLI, environment variable, or Python API. +- Required sections: + - `[meta]` – `name`, `version` (integer), optional `description`. + - `[scope]` – `default_exec` (`"trace"`/`"skip"`), `default_value_action` (`"allow"`/`"redact"`/`"drop"`). +- Rules appear under `[[scope.rules]]` in declaration order. Each rule has: + - `selector` – matches a package, file, or object (see selector syntax). + - Optional `exec` override (`"trace"`/`"skip"`). + - Optional `value_default` override (`"allow"`/`"redact"`/`"drop"`). + - Optional `reason` string stored in telemetry. + - `[[scope.rules.value_patterns]]` entries that refine value capture by selector. +- Example: + ```toml + [meta] + name = "example-filter" + version = 1 + description = "Protect secrets while allowing metrics." + + [scope] + default_exec = "trace" + default_value_action = "allow" + + [[scope.rules]] + selector = "pkg:my_app.services.*" + value_default = "redact" + [[scope.rules.value_patterns]] + selector = "local:glob:public_*" + action = "allow" + [[scope.rules.value_patterns]] + selector = 'local:regex:^(metric|masked)_\w+$' + action = "allow" + [[scope.rules.value_patterns]] + selector = "local:glob:secret_*" + action = "redact" + [[scope.rules.value_patterns]] + selector = "arg:literal:debug_payload" + action = "drop" + ``` + +## Selector Syntax +- Domains (`selector` prefix before the first colon): + - `pkg` – fully-qualified module name (`package.module`). + - `file` – source path relative to the project root (POSIX separators). + - `obj` – module-qualified object (`package.module.func`). + - `local`, `global`, `arg`, `ret`, `attr` – value-level selectors. +- Match types (second segment in `kind:match:pattern`): + - `glob` *(default)* – wildcard matching with `/` treated as a separator. + - `regex` – Rust/RE2-style regular expressions; invalid patterns log a single warning and fall back to configuration errors. + - `literal` – exact string match. +- Value selectors inherit the match type when omitted (e.g., `local:token_*` uses glob). Declare the match type explicitly when combining separators or anchors. + +## Loading and Chaining Filters +- Default discovery: `RuntimeTracer` searches for `.codetracer/trace-filter.toml` near the target script. +- CLI: `--trace-filter path/to/filter.toml`. Provide multiple times or use `::` within one argument to append more files. +- Environment: `CODETRACER_TRACE_FILTER=filters/prod.toml::filters/hotfix.toml`. Respected by the auto-start hook and the CLI. +- Python API: `trace(..., trace_filter=[path1, path2])` or pass a `::`-delimited string. Paths are expanded to absolute locations and must exist. +- The recorder loads filters in the order discovered: the built-in `builtin-default` filter first, then project defaults, CLI/env entries, and explicit Python API arguments. Later rules override earlier ones when selectors overlap. + +## Runtime Metadata +- `trace_metadata.json` now exposes a `trace_filter` object containing: + - `filters` – ordered list of filter summaries (`name`, `version`, SHA-256 digest, absolute path). + - `stats.scopes_skipped` – total number of code objects blocked by `exec = "skip"`. + - `stats.value_redactions` – per-kind counts for redacted values (`argument`, `local`, `global`, `return`, `attribute`). + - `stats.value_drops` – per-kind counts for values removed entirely from the trace. +- These counters help CI/quality tooling detect unexpectedly aggressive filters. + +## Benchmarks and Guard Rails +- Rust microbench: `cargo bench --bench trace_filter --no-default-features` exercises baseline vs glob/regex-heavy rule sets. +- Python smoke benchmark: `pytest codetracer-python-recorder/tests/python/perf/test_trace_filter_perf.py` runs end-to-end tracing with synthetic workloads when `CODETRACER_TRACE_FILTER_PERF=1`. +- `just bench` orchestrates both: + 1. Ensures the development virtualenv exists (`just venv`). + 2. Runs the Criterion bench with `PYO3_PYTHON` pinned to the virtualenv interpreter. + 3. Executes the Python smoke benchmark, writing `codetracer-python-recorder/target/perf/trace_filter_py.json` (durations plus redaction/drop stats per scenario). +- Use the JSON artefact to feed dashboards or simple regression checks while longer-term gating thresholds are defined. diff --git a/nix/flake.nix b/nix/flake.nix index f1db118..9628860 100644 --- a/nix/flake.nix +++ b/nix/flake.nix @@ -43,6 +43,9 @@ # CapNProto capnproto + + # Benchmark visualisation + gnuplot ]; shellHook = ''