design: Module-name resolution simplification

tzanko-matev · tzanko-matev · commit 0701527e2239 · 2025-10-27T14:46:44.000+02:00
design-docs/adr/0016-module-name-resolution-via-globals-name.md: 
design-docs/module-name-resolution-deep-review.md: 
design-docs/module-name-resolution-globals-implementation-plan.md: 

Signed-off-by: Tzanko Matev &lt;tsanko@metacraft-labs.com&gt;
diff --git a/design-docs/adr/0016-module-name-resolution-via-globals-name.md b/design-docs/adr/0016-module-name-resolution-via-globals-name.md
@@ -0,0 +1,49 @@
+# ADR 0016 – Module Name Resolution via `__name__`
+
+- **Status:** Proposed
+- **Date:** 2025-03-18
+- **Stakeholders:** Runtime team, Trace Filter maintainers
+- **Related Decisions:** ADR 0013 (Reliable Module Name Derivation), ADR 0014 (Module Call Event Naming)
+
+## Context
+
+Today both the runtime tracer and the trace filter engine depend on `ModuleIdentityResolver` to derive dotted module names from code-object filenames. The resolver walks `sys.path`, searches `sys.modules`, and applies filesystem heuristics to map absolute paths to module identifiers (`codetracer-python-recorder/src/module_identity.rs:18`). While this produces stable names, it adds complexity to hot code paths (`runtime_tracer.rs:280`) and creates divergence from conventions used elsewhere in the Python ecosystem.
+
+Python’s logging framework, import system, and standard introspection APIs all rely on the module object’s `__name__` attribute. During module execution the `py_start` callback already runs with the module frame active, meaning we can read `frame.f_globals["__name__"]` before we emit events or evaluate filters. This naturally aligns with user expectations and removes the need for filesystem heuristics.
+
+## Decision
+
+We will source module identities directly from `__name__` when a `py_start` callback observes a code object whose qualified name is `<module>`. The runtime tracer will pass that value to both the trace filter coordinator and the event encoder, ensuring call records emit `<{__name__}>` labels and filters match the same string.
+
+Key points:
+
+1. `py_start` inspects the `CodeObjectWrapper` and, for `<module>` qualnames, reads `frame.f_globals.get("__name__")`. If present and non-empty, this name becomes the module identifier for the current scope.
+2. Filters always evaluate using the `__name__` value gathered at `py_start`; they no longer attempt to strip file paths or enumerate `sys.modules`.
+3. We keep recording the absolute and relative file paths already emitted elsewhere (`TraceFilterEngine::ScopeContext` still normalises filenames for telemetry), but module-name derivation no longer depends on them.
+4. We delete `ModuleIdentityResolver`, `ModuleIdentityCache`, and associated heuristics once all call sites switch to the new flow.
+
+## Consequences
+
+**Positive**
+
+- Simplifies hot-path logic, removing `DashMap` lookups and `sys.modules` scans.
+- Harmonises trace filtering semantics with Python logging (same strings users already recognise).
+- Eliminates filesystem heuristics that were fragile on mixed-case filesystems or when `sys.path` mutated mid-run.
+
+**Negative / Risks**
+
+- Direct scripts (`python my_tool.py`) continue to report `__name__ == "__main__"`. Filters must explicitly target `__main__` in those scenarios, whereas the current resolver maps to `package.module`. We will document the behaviour change and offer guidance for users relying on path-based selectors.
+- Synthetic modules created via `runpy.run_path` or dynamic loaders may use ad-hoc `__name__` values. We rely on importers to supply meaningful identifiers, matching Python logging expectations.
+- Tests and documentation referencing filesystem-based names must be updated.
+
+**Mitigations**
+
+- Provide a compatibility flag during rollout so filter configurations can opt into the new behaviour incrementally.
+- Emit a debug log when `__name__` is missing or empty, falling back to `<module>` exactly as today.
+- Preserve path metadata in filter resolutions so existing file-based selectors continue to work.
+
+## Rollout Notes
+
+- Update existing ADR 0013/0014 statuses once this ADR is accepted and the code lands.
+- Communicate the behavioural change to downstream teams who consume `<module>` events or rely on path-derived module names.
+
diff --git a/design-docs/module-name-resolution-deep-review.md b/design-docs/module-name-resolution-deep-review.md
@@ -0,0 +1,82 @@
+# Module Name Resolution Deep Review
+
+This document walks through how CodeTracer derives dotted module names from running Python code, why the problem matters, and where the current implementation may need refinement. The goal is to make the topic approachable for readers who only have basic familiarity with Python modules.
+
+## 1. What Problem Are We Solving?
+
+Python reports the top-level body of any module with the generic qualified name `<module>`. When CodeTracer records execution, every package entry point therefore looks identical unless we derive the real dotted module name ourselves (for example, turning `<module>` into `<mypkg.tools.cli>`). Without that mapping:
+
+- trace visualisations lose context because multiple files collapse to `<module>`
+- trace filters cannot match package-level rules (e.g., `pkg:glob:mypkg.*`)
+- downstream tooling struggles to correlate events back to the source tree
+
+Our recorder must therefore infer the right module name from runtime metadata, file paths, and user configuration, even in tricky situations such as site-packages code, editable installs, or scripts executed directly.
+
+## 2. Python Module Background (Minimal Primer)
+
+The current design assumes the reader understands the following basics:
+
+- **Modules and packages** – every `.py` file is a module; a directory with an `__init__.py` is treated as a package whose submodules use dotted names like `package.module`.
+- **`sys.path`** – Python searches each entry (directories or zip files) when importing modules. Joining a path entry with the relative file path yields the dotted name.
+- **`sys.modules`** – a dictionary of module objects keyed by dotted module name. Each module typically exposes `__spec__`, `__name__`, and `__file__`, which reveal how and from where it was loaded.
+- **Code objects** – functions and module bodies have code objects whose `co_filename` stores the source path and whose `co_qualname` stores the qualified name (`<module>` for top-level bodies).
+- **Imports are idempotent** – when Python imports a module it first creates and registers the module object in `sys.modules`, then executes its body. That guarantees CodeTracer can query `sys.modules` while tracing the import.
+
+## 3. Why CodeTracer Resolves Module Names
+
+- **Trace event labelling** – `RuntimeTracer::function_name` converts `<module>` frames into `<dotted.package>` so traces remain readable (`codetracer-python-recorder/src/runtime/tracer/runtime_tracer.rs:280`).
+- **Trace filter decisions** – package selectors rely on module names, and file selectors reuse the same normalised paths. The filter engine caches per-code-object decisions together with the resolved module identity (`codetracer-python-recorder/src/trace_filter/engine.rs:183` and `:232`).
+- **File path normalisation** – both subsystems need consistent POSIX-style paths for telemetry, redaction policies, and for emitting metadata in trace files.
+- **Cross-component hints** – filter resolutions capture the module name, project-relative path, and absolute path and hand them to the runtime tracer so both parts agree on naming (`codetracer-python-recorder/src/runtime/tracer/runtime_tracer.rs:292`).
+
+## 4. Where the Logic Lives
+
+- `codetracer-python-recorder/src/module_identity.rs:18` – core resolver, caching, heuristics, and helpers (`normalise_to_posix`, `module_from_relative`, etc.).
+- `codetracer-python-recorder/src/trace_filter/engine.rs:183` – owns a `ModuleIdentityResolver` instance, builds `ScopeContext`, and stores resolved module names inside `ScopeResolution`.
+- `codetracer-python-recorder/src/runtime/tracer/runtime_tracer.rs:280` – turns `<module>` frames into `<{module}>` labels using `ModuleIdentityCache`.
+- `codetracer-python-recorder/src/runtime/tracer/filtering.rs:19` – captures filter outcomes and pipes their module hints back to the tracer.
+- Tests exercising the behaviour: Rust unit tests in `codetracer-python-recorder/src/module_identity.rs:551` and Python integration coverage in `codetracer-python-recorder/tests/python/test_monitoring_events.py:199`.
+
+The pure-Python recorder does not perform advanced module-name derivation; all of the reusable logic lives in the Rust-backed module.
+
+## 5. How the Algorithm Works End to End
+
+### 5.1 Building the resolver
+1. `ModuleIdentityResolver::new` snapshots `sys.path` under the GIL, normalises each entry to POSIX form, removes duplicates, and sorts by descending length so more specific roots win (`module_identity.rs:24`–`:227`).
+2. Each entry stays cached for the duration of the recorder session; mutations to `sys.path` after startup are not observed automatically.
+
+### 5.2 Resolving an absolute filename
+Given a normalised absolute path (e.g., `/home/app/pkg/service.py`):
+
+1. **Path-based attempt** – `module_name_from_roots` strips each known root and converts the remainder into a dotted form (`pkg/service.py → pkg.service`). This is fast and succeeds for project code and site-packages that live under a `sys.path` entry (`module_identity.rs:229`–`:238`).
+2. **Heuristics** – if the first guess looks like a raw filesystem echo (meaning we probably matched a catch-all root like `/`), the resolver searches upward for project markers (`pyproject.toml`, `.git`, etc.) and retries with that directory as the root. Failing that, it uses the immediate parent directory (`module_identity.rs:240`–`:468`).
+3. **`sys.modules` sweep** – the resolver iterates through loaded modules, comparing `__spec__.origin`, `__file__`, or `__cached__` paths (normalised to POSIX) against the target filename, accounting for `.py` vs `.pyc` differences. Any valid dotted name wins over heuristic guesses (`module_identity.rs:248`–`:335`).
+4. The winning name (or lack thereof) is cached by absolute path in a `DashMap` so future lookups avoid repeated sys.modules scans (`module_identity.rs:54`–`:60`).
+
+### 5.3 Mapping code objects to module names
+
+`ModuleIdentityCache::resolve_for_code` accepts optional hints:
+
+1. **Preferred hint** – e.g., the filter engine’s stored module name. It is accepted only if it is a valid dotted identifier (`module_identity.rs:103`–`:116`).
+2. **Relative path** – converted via `module_from_relative` when supplied (`module_identity.rs:107`–`:110`).
+3. **Absolute path** – triggers the resolver described above (`module_identity.rs:112`–`:115`).
+4. **Globals-based hint** – a last resort using `frame.f_globals["__name__"]` when available (`module_identity.rs:116`).
+
+If no hints contain an absolute path, the cache will read `co_filename`, normalise it, and resolve it once (`module_identity.rs:118`–`:127`). Results (including failures) are memoised per `code_id`.
+
+### 5.4 Feeding results into the rest of the system
+
+1. The trace filter builds a `ScopeContext` for every new code object. It records project-relative paths and module names derived from configuration roots, then calls back into the resolver if the preliminary name is missing or invalid (`trace_filter/engine.rs:420`–`:471`).
+2. The resulting `ScopeResolution` is cached and exposed to the runtime tracer via `FilterCoordinator`, providing rich hints (`runtime/tracer/filtering.rs:43`).
+3. During execution, `RuntimeTracer::function_name` reaches into the shared cache to turn `<module>` qualnames into `<pkg.module>` labels. If every heuristic fails, it safely falls back to the original `<module>` string (`runtime_tracer.rs:280`–`:305`).
+4. Both subsystems reuse the same `ModuleIdentityResolver`, ensuring trace files and filtering decisions stay consistent.
+
+## 6. Bugs, Risks, and Observations
+
+- **Prefix matching ignores path boundaries** – `strip_posix_prefix` checks `path.starts_with(base)` without verifying the next character is a separator. A root like `/opt/app` therefore incorrectly matches `/opt/application/module.py`, yielding the bogus module `lication.module`. When `sys.modules` lacks the correct entry (e.g., resolving a file before import), the resolver will cache this wrong answer (`module_identity.rs:410`–`:426`).
+- **Case sensitivity on Windows** – normalisation preserves whatever casing the OS returns. If `co_filename` and `sys.modules` report the same path with different casing, `equivalent_posix_paths` will not treat them as equal, causing the fallback to miss (`module_identity.rs:317`–`:324`). Consider lowercasing drive prefixes or using `Path::eq` semantics behind the GIL.
+- **`sys.path` mutations after startup** – the resolver snapshots roots once. If tooling modifies `sys.path` later (common in virtualenv activation scripts), we will never see the new prefix, so we fall back to heuristics or `sys.modules`. Documenting this behaviour or exposing a method to refresh the roots may avoid surprises.
+- **Project-marker heuristics hit the filesystem** – `has_project_marker` calls `exists()` for every parent directory when the fast path fails (`module_identity.rs:470`–`:493`). Because results are cached per file this is usually acceptable, but tracing thousands of unique `site-packages` files on network storage could still become expensive.
+
+Addressing the first two items would materially improve correctness; the latter two are design trade-offs worth monitoring.
+
diff --git a/design-docs/module-name-resolution-globals-implementation-plan.md b/design-docs/module-name-resolution-globals-implementation-plan.md
@@ -0,0 +1,64 @@
+# Module Name Resolution via `__name__` – Implementation Plan
+
+This plan delivers ADR 0016 by retooling module-name derivation around `frame.f_globals["__name__"]` and retiring the existing filesystem-based resolver.
+
+## Goals
+- Use `__name__` as the single source of truth for module identifiers during tracing and filtering.
+- Remove the `ModuleIdentityResolver`/`ModuleIdentityCache` complex and associated heuristics.
+- Preserve relative/absolute path metadata for selectors and telemetry.
+- Ship the change behind a feature flag so we can roll out gradually.
+
+## Non-goals
+- Changing how file selectors work (they should keep matching on normalised paths).
+- Replacing hints elsewhere that already rely on `__name__` (e.g., value redaction logic).
+- Revisiting logging or module instrumentation beyond the tracer/filter flow.
+
+## Work Breakdown
+
+### Stage 0 – Feature Flag and Compatibility Layer
+- Add a recorder policy flag `module_name_from_globals` defaulting to `false`.
+- Plumb the flag through CLI/env configuration and expose it via the Python bindings (mirroring other policy toggles).
+- Update integration tests to parameterise expectations for both modes.
+
+### Stage 1 – Capture `__name__` at `py_start`
+- In `runtime/tracer/events.rs`, augment the `on_py_start` handler to detect `<module>` code objects.
+- Fetch `frame.f_globals.get("__name__")`, validate it is a non-empty string, and store it in the scope resolution cache (likely inside `FilterCoordinator`).
+- Thread the captured value into `FilterCoordinator::resolve` so the filter engine obtains the module name without invoking the resolver.
+- Add unit tests that simulate modules with various `__name__` values (`__main__`, aliased imports, missing globals) to ensure fallbacks log appropriately.
+
+### Stage 2 – Simplify Filter Engine
+- Remove the `module_resolver` field from `TraceFilterEngine` and delete calls to `ModuleIdentityResolver::resolve_absolute` (`codetracer-python-recorder/src/trace_filter/engine.rs:183`).
+- Adjust `ScopeContext` to accept the module name supplied by the coordinator and skip path-derived module inference.
+- Keep relative and absolute path normalisation for file selectors.
+- Update Rust tests that previously expected filesystem-derived names to assert the new globals-based behaviour.
+
+### Stage 3 – Replace Runtime Function Naming
+- Eliminate `ModuleIdentityCache` from `RuntimeTracer` and switch `function_name` to prefer the cached globals-derived module name (`runtime_tracer.rs:280`).
+- Remove the `ModuleNameHints` plumbing and any resolver invocations.
+- Update the Python integration test `test_module_imports_record_package_names` to expect `<my_pkg.mod>` coming directly from `__name__`.
+
+### Stage 4 – Remove Legacy Resolver Module
+- Delete `src/module_identity.rs` and tidy references (Cargo module list, tests, exports).
+- Drop resolver-specific tests and replace them with focused coverage around globals-based extraction.
+- Update documentation and changelog to describe the new semantics. Reference ADR 0016 and mark ADR 0013 as superseded where appropriate.
+
+### Stage 5 – Flip the Feature Flag
+- After validating in CI and canary environments, change the default for `module_name_from_globals` to `true`.
+- Remove the compatibility flag once usage data confirms no regressions.
+
+## Testing Strategy
+- Rust unit tests covering `on_py_start` logic, especially fallback to `<module>` when `__name__` is absent.
+- Python integration tests for direct scripts (`__main__`), package imports, and dynamic `runpy` execution to capture expected names.
+- Regression test ensuring filters still match file-based selectors when module names change.
+- Performance check to confirm hot-path allocations decreased after removing resolver lookups.
+
+## Risks & Mitigations
+- **Filter regressions for direct scripts:** Document the `"__main__"` behaviour and add guardrails in tests; optionally add helper patterns in builtin filters.
+- **Third-party loaders with non-string `__name__`:** Validate type and fall back gracefully when extraction fails.
+- **Hidden dependencies on old naming:** Continue exposing absolute/relative paths in `ScopeResolution` so downstream tooling relying on paths keeps functioning.
+
+## Rollout Checklist
+- ADR 0016 updated to “Accepted” once the feature flag defaults to on.
+- Release notes highlight the new module-naming semantics and the opt-out flag during transition.
+- Telemetry or logging confirms we do not hit the fallback path excessively in real workloads.
+