Address residual P1+P2s from re-audit of PR #412#422
Merged
Conversation
The restored CI reviewer surfaced findings the degraded reviewer missed across all 5 prior rounds on PR #412: P1 (REGISTRY + code comment): the claim that "R does not ship per-path predict_het on placebos either, so parity is preserved by deferral" contradicts what R's `did_multiplegt_dyn(..., by_path, predict_het)` dispatcher actually does - it forwards `predict_het` into each per-path `did_multiplegt_main` call along with `placebo`, so R may emit per-path placebo heterogeneity rows we do not yet mirror. Rewrite both surfaces (chaisemartin_dhaultfoeuille_results.py code comment and REGISTRY.md DataFrame-integration paragraph) as an explicit Python- side deferral rather than a verified R-parity. Add a TODO row to track validating R's actual placebo predict_het output and either implementing parity or documenting the deviation explicitly. P2 (REGISTRY rtol claim): the per-path heterogeneity R-parity paragraph claimed "rtol ~1e-6 on point estimates AND SE", but the parity tests use BETA_RTOL=1e-6 and SE_RTOL=1e-5 (one decade looser on SE). Split the claim into the two separate tolerances and note the WLS-denominator/cohort-recentering numerical drift that motivates the looser SE bound. P2 (replicate-weight df_survey refresh): the existing test only checked finite SE; it would have passed if the new dedicated heterogeneity refresh loop failed to recompute t_stat / p_value / conf_int at the final df_survey. Strengthen the test to call `safe_inference(beta, se, df=df_survey)` on the first finite entry and assert the stored inference fields match - this anti-regression covers the dedicated post-call refresh added for path_heterogeneity_ effects. P2 (paths_of_interest survey gap): the documented composability of `paths_of_interest + heterogeneity + survey_design` was not regression- locked - all existing survey-specific tests used `by_path=k`. Add test_paths_of_interest_heterogeneity_survey_design_analytical (verify analytical Binder TSL fits, selector ordering preserved, finite SE per populated (path, l)) and test_paths_of_interest_heterogeneity_ survey_n_bootstrap_gate (verify the multiplier-bootstrap gate applies under paths_of_interest too). No estimator behavior, weighting, variance/SE, identification, or default statistical surface changed in source - documentation accuracy plus expanded regression coverage only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good — no unmitigated P0/P1 findings. Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Audit follow-up to PR #412. The restored CI reviewer surfaced findings the degraded reviewer missed across all 5 prior rounds:
P1 (REGISTRY + code comment) - the claim that "R does not ship per-path `predict_het` on placebos either, so parity is preserved by deferral" contradicts what R's `did_multiplegt_dyn(..., by_path, predict_het)` dispatcher does: it forwards `predict_het` into each per-path `did_multiplegt_main` call alongside `placebo`, so R may emit per-path placebo heterogeneity rows we do not yet mirror. Rewrite both surfaces as an explicit Python-side deferral, NOT a verified R-parity. TODO row added to track validating R's actual output and either implementing parity or documenting the deviation explicitly.
P2 (REGISTRY rtol claim) - the per-path heterogeneity R-parity paragraph claimed `rtol ~1e-6 on point estimates AND SE`, but parity tests use `BETA_RTOL=1e-6` and `SE_RTOL=1e-5`. Split the claim and note the WLS-denominator/cohort-recentering numerical drift motivating the looser SE bound.
P2 (replicate-weight df_survey refresh test gap) - the existing test `test_per_path_heterogeneity_replicate_weights_propagates_n_valid` would have passed if the new dedicated refresh loop failed to recompute `t_stat` / `p_value` / `conf_int` at the final `df_survey`. Strengthen to call `safe_inference(beta, se, df=df_survey)` on the first finite entry and assert the stored inference fields match.
P2 (paths_of_interest survey gap) - the documented composability of `paths_of_interest + heterogeneity + survey_design` was not regression-locked (all survey-specific tests used `by_path=k`). Add two new tests: analytical Binder TSL coverage with selector-ordering preservation, and the multiplier-bootstrap gate under `paths_of_interest`.
No estimator behavior, weighting, variance/SE, identification, or default statistical surface changed in source - documentation accuracy plus expanded regression coverage only.
Test plan
🤖 Generated with Claude Code