Skip to content

Address residual P1+P2s from re-audit of PR #412#422

Merged
igerber merged 2 commits into
mainfrom
fix-audit-412
May 13, 2026
Merged

Address residual P1+P2s from re-audit of PR #412#422
igerber merged 2 commits into
mainfrom
fix-audit-412

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented May 13, 2026

Summary

Audit follow-up to PR #412. The restored CI reviewer surfaced findings the degraded reviewer missed across all 5 prior rounds:

  • P1 (REGISTRY + code comment) - the claim that "R does not ship per-path `predict_het` on placebos either, so parity is preserved by deferral" contradicts what R's `did_multiplegt_dyn(..., by_path, predict_het)` dispatcher does: it forwards `predict_het` into each per-path `did_multiplegt_main` call alongside `placebo`, so R may emit per-path placebo heterogeneity rows we do not yet mirror. Rewrite both surfaces as an explicit Python-side deferral, NOT a verified R-parity. TODO row added to track validating R's actual output and either implementing parity or documenting the deviation explicitly.

  • P2 (REGISTRY rtol claim) - the per-path heterogeneity R-parity paragraph claimed `rtol ~1e-6 on point estimates AND SE`, but parity tests use `BETA_RTOL=1e-6` and `SE_RTOL=1e-5`. Split the claim and note the WLS-denominator/cohort-recentering numerical drift motivating the looser SE bound.

  • P2 (replicate-weight df_survey refresh test gap) - the existing test `test_per_path_heterogeneity_replicate_weights_propagates_n_valid` would have passed if the new dedicated refresh loop failed to recompute `t_stat` / `p_value` / `conf_int` at the final `df_survey`. Strengthen to call `safe_inference(beta, se, df=df_survey)` on the first finite entry and assert the stored inference fields match.

  • P2 (paths_of_interest survey gap) - the documented composability of `paths_of_interest + heterogeneity + survey_design` was not regression-locked (all survey-specific tests used `by_path=k`). Add two new tests: analytical Binder TSL coverage with selector-ordering preservation, and the multiplier-bootstrap gate under `paths_of_interest`.

No estimator behavior, weighting, variance/SE, identification, or default statistical surface changed in source - documentation accuracy plus expanded regression coverage only.

Test plan

  • CI - new tests run as part of the standard slow-test matrix.
  • Locally verified: `pytest tests/test_chaisemartin_dhaultfoeuille.py::TestByPathHeterogeneity::test_paths_of_interest_heterogeneity_{survey_design_analytical,survey_n_bootstrap_gate} tests/test_chaisemartin_dhaultfoeuille.py::TestByPathHeterogeneity::test_per_path_heterogeneity_replicate_weights_propagates_n_valid -m ''` all pass.

🤖 Generated with Claude Code

igerber and others added 2 commits May 12, 2026 20:14
The restored CI reviewer surfaced findings the degraded reviewer
missed across all 5 prior rounds on PR #412:

P1 (REGISTRY + code comment): the claim that "R does not ship per-path
predict_het on placebos either, so parity is preserved by deferral"
contradicts what R's `did_multiplegt_dyn(..., by_path, predict_het)`
dispatcher actually does - it forwards `predict_het` into each
per-path `did_multiplegt_main` call along with `placebo`, so R may
emit per-path placebo heterogeneity rows we do not yet mirror. Rewrite
both surfaces (chaisemartin_dhaultfoeuille_results.py code comment and
REGISTRY.md DataFrame-integration paragraph) as an explicit Python-
side deferral rather than a verified R-parity. Add a TODO row to
track validating R's actual placebo predict_het output and either
implementing parity or documenting the deviation explicitly.

P2 (REGISTRY rtol claim): the per-path heterogeneity R-parity
paragraph claimed "rtol ~1e-6 on point estimates AND SE", but the
parity tests use BETA_RTOL=1e-6 and SE_RTOL=1e-5 (one decade looser
on SE). Split the claim into the two separate tolerances and note
the WLS-denominator/cohort-recentering numerical drift that motivates
the looser SE bound.

P2 (replicate-weight df_survey refresh): the existing test only
checked finite SE; it would have passed if the new dedicated
heterogeneity refresh loop failed to recompute t_stat / p_value /
conf_int at the final df_survey. Strengthen the test to call
`safe_inference(beta, se, df=df_survey)` on the first finite entry
and assert the stored inference fields match - this anti-regression
covers the dedicated post-call refresh added for path_heterogeneity_
effects.

P2 (paths_of_interest survey gap): the documented composability of
`paths_of_interest + heterogeneity + survey_design` was not regression-
locked - all existing survey-specific tests used `by_path=k`. Add
test_paths_of_interest_heterogeneity_survey_design_analytical (verify
analytical Binder TSL fits, selector ordering preserved, finite SE
per populated (path, l)) and test_paths_of_interest_heterogeneity_
survey_n_bootstrap_gate (verify the multiplier-bootstrap gate applies
under paths_of_interest too).

No estimator behavior, weighting, variance/SE, identification, or
default statistical surface changed in source - documentation
accuracy plus expanded regression coverage only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 7e9f1aa861fd043ea6b276b57c7ec4363d1e5bbe


Overall Assessment

✅ Looks good — no unmitigated P0/P1 findings.

Executive Summary

  • The diff does not change dCDH estimator logic, weighting, variance/SE computation, identification assumptions, or defaults; it changes documentation/comments and adds regression coverage around by_path heterogeneity.
  • Affected methodology surface: dCDH per-path heterogeneity testing (predict_het, Web Appendix Section 1.5 / Lemma 7) and the by-path placebo-row export path.
  • The registry/comment updates correctly stop claiming verified R parity for per-path placebo heterogeneity and instead document the current Python behavior as a tracked deferral. docs/methodology/REGISTRY.md:L643-L645, diff_diff/chaisemartin_dhaultfoeuille_results.py:L1862-L1903, TODO.md:L63
  • The rewritten parity-tolerance language matches the actual parity test constants (BETA_RTOL=1e-6, SE_RTOL=1e-5). tests/test_chaisemartin_dhaultfoeuille_parity.py:L1404-L1469
  • The new tests target the right regressions: final df_survey refresh on path_heterogeneity_effects, paths_of_interest + heterogeneity + survey_design, and the survey + multiplier-bootstrap gate under paths_of_interest. tests/test_chaisemartin_dhaultfoeuille.py:L10577-L10703
  • I could not execute the tests in this environment because pytest is not installed.

Methodology

  • Severity: P3. Impact: The remaining per-path placebo heterogeneity gap is now accurately documented as a Python-side deferral rather than an asserted R-parity fact; this matches the live behavior, where placebo rows in to_dataframe(level="by_path") still emit NaN het_* fields. Concrete fix: None required for approval; future parity work is already tracked. docs/methodology/REGISTRY.md:L643-L645, diff_diff/chaisemartin_dhaultfoeuille_results.py:L1862-L1903, TODO.md:L63
  • No P0/P1 methodology findings.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity: P3. Impact: The new TODO entry properly tracks the unresolved placebo-heterogeneity parity work, so it is mitigated under the project’s deferred-work policy rather than remaining implicit. Concrete fix: None required in this PR. TODO.md:L63

Security

  • No findings.

Documentation/Tests

  • No findings. The added tests are well-targeted to the touched risks, especially the dedicated final-df refresh loop in fit(). diff_diff/chaisemartin_dhaultfoeuille.py:L3957-L4070, tests/test_chaisemartin_dhaultfoeuille.py:L10577-L10703
  • Verification note: test execution was not possible here because neither pytest nor python -m pytest is available in the review environment.

@igerber igerber added the ready-for-ci Triggers CI test workflows label May 13, 2026
@igerber igerber merged commit 9a2435a into main May 13, 2026
31 of 32 checks passed
@igerber igerber deleted the fix-audit-412 branch May 13, 2026 10:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant