Add reporting statistics documentation and glossary updates

drbenvincent · drbenvincent · commit b832192713dc · 2025-11-12T10:57:48.000Z
diff --git a/causalpy/reporting.py b/causalpy/reporting.py
@@ -13,6 +13,14 @@
 #   limitations under the License.
 """
 Reporting utilities for causal inference experiments.
+
+This module provides statistical summaries and prose reports for causal effects.
+The reporting functions automatically compute appropriate statistics based on the
+model type (Bayesian/PyMC or Frequentist/OLS).
+
+For detailed explanations of the reported statistics (HDI, ROPE, p-values, etc.)
+and their interpretation, see the documentation:
+https://causalpy.readthedocs.io/en/latest/knowledgebase/reporting_statistics.html
 """
 
 from dataclasses import dataclass
diff --git a/docs/source/knowledgebase/glossary.rst b/docs/source/knowledgebase/glossary.rst
@@ -25,12 +25,19 @@ Glossary
    CITS
       An interrupted time series design with added comparison time series observations.
 
+   Confidence interval
+   CI
+      In frequentist statistics, a range of values that would contain the true parameter in a specified percentage of repeated samples. For example, a 95% confidence interval means that if we repeated the study many times, 95% of such intervals would contain the true parameter. See :doc:`reporting_statistics` for interpretation guidance and comparison with credible intervals.
+
    Confound
       Anything besides the treatment which varies across the treatment and control conditions.
 
    Counterfactual
       A hypothetical outcome that could or will occur under specific hypothetical circumstances.
 
+   Credible interval
+      In Bayesian statistics, an interval containing a specified probability of the posterior distribution. For example, a 95% credible interval contains 95% of the posterior probability mass. Unlike confidence intervals, this is a direct probability statement about the parameter. The HDI (Highest Density Interval) is a specific type of credible interval. See :doc:`reporting_statistics` for details.
+
    Difference in differences
    DiD
       Analysis where the treatment effect is estimated as a difference between treatment conditions in the differences between pre-treatment to post treatment observations.
@@ -46,6 +53,10 @@ Glossary
    Endogenous Variable
       An endogenous variable is a variable in a regression equation such that the variable is correlated with the error term of the equation i.e. correlated with the outcome variable (in the system). This is a problem for OLS regression estimation techniques because endogeniety violates the assumptions of the Gauss Markov theorem.
 
+   HDI
+   Highest Density Interval
+      In Bayesian statistics, the narrowest credible interval containing a specified percentage of the posterior probability mass. For example, a 95% HDI is the shortest interval that contains 95% of the posterior distribution. This is the default uncertainty interval reported by CausalPy for PyMC models. See :doc:`reporting_statistics` for interpretation guidance.
+
    Local Average Treatment effect
    LATE
       Also known as the complier average causal effect (CACE), is the effect of a treatment for subjects who comply with the experimental treatment assigned to their sample group. It is the quantity we're estimating in IV designs.
@@ -63,14 +74,20 @@ Glossary
    Panel data
       Time series data collected on multiple units where the same units are observed at each time point.
 
+   Posterior probability
+      In Bayesian statistics, the probability of a hypothesis or parameter value after observing the data. In CausalPy's reporting, posterior probabilities are used for hypothesis testing (e.g., the probability that a treatment effect is positive). Unlike p-values, these are direct probability statements about the hypothesis of interest. See :doc:`reporting_statistics` for examples.
+
+   Potential outcomes
+      A potential outcome is definable for a candidate or experimental unit under a treatment regime with respect to a measured outcome. The outcome Y(0) for that experimental unit is the outcome when the individual does not have the treatment. The outcome Y(1) for that experimental unit is the outcome when the individual does receive the treatment. Only one case can be observed in reality, and this is called the fundamental problem of causal inference. Seen this way causal inference becomes a kind of imputation problem.
+
    Pretest-posttest design
       A quasi-experimental design where the treatment effect is estimated by comparing an outcome measure before and after treatment.
 
    Propensity scores
       An estimate of the probability of adopting a treatment status. Used in re-weighting schemes to balance observational data.
 
-   Potential outcomes
-      A potential outcome is definable for a candidate or experimental unit under a treatment regime with respect to a measured outcome. The outcome Y(0) for that experimental unit is the outcome when the individual does not have the treatment. The outcome Y(1) for that experimental unit is the outcome when the individual does receive the treatment. Only one case can be observed in reality, and this is called the fundamental problem of causal inference. Seen this way causal inference becomes a kind of imputation problem.
+   p-value
+      In frequentist statistics, the probability of observing data at least as extreme as what was observed, assuming the null hypothesis (typically "no effect") is true. Lower p-values indicate stronger evidence against the null hypothesis. Commonly, p < 0.05 is used as a threshold for statistical significance, though the p-value itself should be reported along with effect sizes and confidence intervals. See :doc:`reporting_statistics` for interpretation guidance and common pitfalls.
 
    Quasi-experiment
       An empirical comparison used to estimate the effects of a treatment where units are not assigned to conditions at random.
@@ -88,6 +105,10 @@ Glossary
    Regression kink design
       A quasi-experimental research design that estimates treatment effects by analyzing the impact of a treatment or intervention precisely at a defined threshold or "kink" point in a quantitative assignment variable (running variable). Unlike traditional regression discontinuity designs, regression kink design looks for a change in the slope of an outcome variable at the kink, instead of a discontinuity. This is useful when the assignment variable is not discrete, jumping from 0 to 1 at a threshold. Instead, regression kink designs are appropriate when there is a change in the first derivative of the assignment function at the kink point.
 
+   ROPE
+   Region of Practical Equivalence
+      In Bayesian causal inference, a method for testing whether an effect exceeds a minimum meaningful threshold (the "minimum effect size"). Rather than just testing if an effect differs from zero (which may be statistically significant but trivially small), ROPE analysis tests if the effect is large enough to be practically important. CausalPy reports this as `p_rope`, the posterior probability that the effect exceeds the specified threshold. See :doc:`reporting_statistics` for usage and interpretation.
+
    Running variable
       In regression discontinuity designs, the running variable is the variable that determines the assignment of units to treatment or control conditions. This is typically a continuous variable. Examples could include a test score, age, income, or spatial location. But the running variable would not be time, which is the case in interrupted time series designs.
 
diff --git a/docs/source/knowledgebase/index.md b/docs/source/knowledgebase/index.md
@@ -4,6 +4,7 @@
 :maxdepth: 1
 
 glossary
+reporting_statistics
 design_notation
 quasi_dags.ipynb
 causal_video_resources
diff --git a/docs/source/knowledgebase/reporting_statistics.md b/docs/source/knowledgebase/reporting_statistics.md
@@ -0,0 +1,298 @@
+# Statistical Reporting in CausalPy
+
+This page explains the statistical concepts used in CausalPy's reporting layer. The reporting functions automatically compute and present statistics appropriate to your model type.
+
+## Model Types and Statistical Approaches
+
+CausalPy supports two modeling frameworks, each with its own statistical paradigm:
+
+| Model Framework | Statistical Approach | Statistics Reported |
+|----------------|---------------------|---------------------|
+| PyMC models | Bayesian | Mean, Median, HDI, Tail Probabilities, ROPE |
+| Scikit-learn models | Frequentist (OLS) | Mean, Confidence Intervals, p-values |
+
+:::{note}
+The reporting layer automatically detects which type of model you're using and generates appropriate statistics. You don't need to specify the statistical approach.
+:::
+
+---
+
+## Bayesian Statistics (PyMC Models)
+
+When you use PyMC models, CausalPy performs Bayesian inference, yielding posterior distributions for causal effects. The reported statistics summarize these posterior distributions.
+
+### Point Estimates
+
+**Mean**
+- The average of the posterior distribution
+- Represents the expected value of the causal effect
+- **When to use:** Most commonly reported point estimate; balances all posterior information
+- **Interpretation:** "The average estimated effect is X"
+
+**Median**
+- The middle value of the posterior distribution (50th percentile)
+- Divides the posterior probability mass in half
+- **When to use:** Preferred when the posterior is skewed; more robust to outliers
+- **Interpretation:** "There's a 50% probability the effect is above/below X"
+
+:::{important}
+For symmetric posteriors, mean and median are nearly identical. For skewed posteriors, they may differ substantially. Report both to give readers a complete picture.
+:::
+
+### Uncertainty Quantification
+
+**HDI (Highest Density Interval)**
+- A :term:`credible interval` containing the specified percentage of posterior probability (default: 95%)
+- Reported as `hdi_lower` and `hdi_upper` in summary tables
+- The narrowest interval containing the specified probability mass
+- **Interpretation:** "We can be 95% certain the true effect lies between X and Y"
+- **Key difference from CI:** This is a probability statement about the parameter itself, not about the procedure
+
+:::{note}
+The `hdi_prob` parameter controls the interval width (e.g., 0.95 for 95% HDI, 0.90 for 90% HDI). Wider intervals (higher probability) provide more certainty but less precision.
+:::
+
+**Example interpretation:**
+```
+mean: 2.5, 95% HDI: [1.2, 3.8]
+```
+"The estimated effect is 2.5 on average, and we can be 95% certain the true effect lies between 1.2 and 3.8."
+
+### Hypothesis Testing
+
+Bayesian hypothesis testing uses posterior probabilities directly, making the interpretation more intuitive than traditional p-values.
+
+**Directional Tests**
+- `p_gt_0`: :term:`Posterior probability` that the effect is greater than zero (positive effect)
+- `p_lt_0`: Posterior probability that the effect is less than zero (negative effect)
+- **Interpretation:** Direct probability statements about the hypothesis
+- **Example:** If `p_gt_0 = 0.95`, there's a 95% probability the effect is positive
+
+**Two-Sided Tests**
+- `p_two_sided`: Two-sided posterior probability (analogous to two-sided p-value)
+- `prob_of_effect`: Probability of an effect in either direction (1 - p_two_sided)
+- **When to use:** When you don't have a directional hypothesis
+- **Interpretation:** `prob_of_effect = 0.95` means 95% probability of a non-zero effect
+
+:::{note}
+Unlike frequentist p-values, Bayesian posterior probabilities answer the question you actually care about: "What's the probability of this hypothesis given the data?"
+:::
+
+**Decision guidance:**
+- `p_gt_0 > 0.95` or `p_lt_0 > 0.95`: Strong evidence for directional effect
+- `prob_of_effect > 0.95`: Strong evidence for any effect (two-sided)
+- Values close to 0.5: Weak or no evidence for the effect
+
+### Effect Size Assessment
+
+**ROPE (Region of Practical Equivalence)**
+- Tests whether the effect exceeds a minimum meaningful threshold (`min_effect`)
+- Reported as `p_rope` in summary tables
+- **Purpose:** Distinguish statistical significance from practical significance
+- **Interpretation:** Probability that the effect exceeds the threshold you care about
+
+**How it works:**
+1. You specify `min_effect` (the smallest effect size you consider meaningful)
+2. For directional tests: `p_rope` = P(|effect| > min_effect)
+3. For two-sided tests: `p_rope` = P(|effect| > min_effect)
+
+**Example:**
+```python
+result.effect_summary(direction="increase", min_effect=1.0)
+```
+If `p_rope = 0.85`, there's an 85% probability the effect exceeds your meaningful threshold of 1.0.
+
+:::{important}
+ROPE analysis requires domain knowledge to set `min_effect`. Consider: What's the smallest effect that would justify the intervention cost? What effect size is scientifically or practically meaningful?
+:::
+
+---
+
+## Frequentist Statistics (Scikit-learn / OLS Models)
+
+When you use scikit-learn models (OLS regression), CausalPy performs classical frequentist inference based on t-distributions.
+
+### Point Estimates
+
+**Mean / Coefficient Estimate**
+- The estimated causal effect from the regression model
+- For scalar effects (DiD, RD): the coefficient of interest
+- For time-series effects (ITS, SC): the average or cumulative impact
+- **Interpretation:** "The estimated effect is X"
+
+:::{note}
+Unlike Bayesian estimates, frequentist point estimates don't come with a probability distribution. Uncertainty is captured through confidence intervals and standard errors.
+:::
+
+### Uncertainty Quantification
+
+**Confidence Intervals (CI)**
+- Reported as `ci_lower` and `ci_upper` in summary tables
+- Computed using t-distribution critical values at the specified significance level (default: α = 0.05 for 95% CI)
+- **Interpretation:** "If we repeated this experiment many times, 95% of such intervals would contain the true effect"
+- **Key difference from HDI:** This is a statement about the procedure, not about the parameter
+
+**Standard Errors**
+- Measure of uncertainty in the coefficient estimate
+- Used to construct confidence intervals and compute p-values
+- Derived from the residual variance and design matrix
+- **Larger standard errors** → wider confidence intervals → more uncertainty
+
+**Example interpretation:**
+```
+mean: 2.5, 95% CI: [1.1, 3.9]
+```
+"The estimated effect is 2.5. If we repeated this study many times, 95% of such confidence intervals would contain the true effect."
+
+:::{important}
+**Bayesian HDI vs Frequentist CI:** While numerically similar, they have fundamentally different interpretations. The HDI makes a direct probability statement about the parameter ("95% probability the effect is in this range"), while the CI makes a statement about the procedure ("95% of such intervals would contain the true parameter").
+:::
+
+### Hypothesis Testing
+
+**p-values**
+- The probability of observing data at least as extreme as what we observed, assuming the null hypothesis (no effect) is true
+- Reported as `p_value` in summary tables
+- **Common threshold:** p < 0.05 is often used as evidence against the null hypothesis
+- **Interpretation:** Lower p-values indicate stronger evidence against no effect
+
+**Correct interpretation:**
+- p = 0.03: "If there were truly no effect, we'd observe data this extreme only 3% of the time"
+- **NOT:** "There's a 97% probability of an effect" (this is a Bayesian interpretation)
+
+**Common pitfalls to avoid:**
+1. ❌ "p = 0.06 means no effect" → The p-value doesn't prove the null hypothesis
+2. ❌ "p < 0.05 means the effect is important" → Statistical significance ≠ practical significance
+3. ❌ "p = 0.01 is better than p = 0.04" → Both provide evidence against the null; the effect size matters more
+4. ❌ "p > 0.05 proves no effect" → Absence of evidence is not evidence of absence
+
+**Decision guidance:**
+- p < 0.05: Conventional threshold for "statistical significance"
+- p < 0.01: Stronger evidence against the null
+- p > 0.05: Insufficient evidence to reject the null (but doesn't prove no effect)
+
+:::{note}
+Always report the actual p-value and effect size, not just whether p < 0.05. The magnitude and confidence interval of the effect are often more informative than the p-value alone.
+:::
+
+**t-statistics and degrees of freedom**
+- t-statistic = coefficient / standard error
+- Measures how many standard errors the estimate is from zero
+- Degrees of freedom (df) = n - p, where n = sample size, p = number of parameters
+- Larger |t-statistics| and smaller p-values indicate stronger evidence
+
+---
+
+## Choosing Between Approaches
+
+### When to use Bayesian inference (PyMC models):
+- ✅ You want direct probability statements about effects
+- ✅ You have prior information to incorporate
+- ✅ You need uncertainty quantification for complex hierarchical models
+- ✅ You want to test against meaningful effect sizes (ROPE)
+- ✅ Small to moderate sample sizes where uncertainty matters
+
+### When to use Frequentist inference (OLS models):
+- ✅ You need computational speed (OLS is faster)
+- ✅ Your audience expects classical statistical inference
+- ✅ Large sample sizes where approaches converge
+- ✅ Simple linear models without hierarchy
+- ✅ You want to align with traditional econometric practice
+
+:::{important}
+Both approaches are valid and will often lead to similar conclusions, especially with larger sample sizes. The choice often depends on your field's conventions, computational constraints, and whether you value direct probabilistic interpretation (Bayesian) or long-run frequency guarantees (frequentist).
+:::
+
+---
+
+## Summary Statistics by Effect Type
+
+### Scalar Effects (DiD, RD, Regression Kink)
+For experiments with a single causal effect parameter:
+
+**Bayesian output:**
+- One row with: mean, median, hdi_lower, hdi_upper
+- Tail probabilities: p_gt_0 (or p_lt_0, or p_two_sided + prob_of_effect)
+- Optional: p_rope (if min_effect specified)
+
+**Frequentist output:**
+- One row with: mean, ci_lower, ci_upper, p_value
+
+### Time-Series Effects (ITS, Synthetic Control)
+For experiments with multiple post-treatment time points:
+
+**Two aggregation levels:**
+1. **Average effect:** Mean effect across the post-treatment window
+2. **Cumulative effect:** Sum of effects across the post-treatment window
+
+**Additional statistics:**
+- **Relative effects:** Percentage change relative to counterfactual
+  - `relative_mean`: Effect size as percentage of counterfactual
+  - `relative_hdi_lower` / `relative_hdi_upper` (Bayesian)
+  - `relative_ci_lower` / `relative_ci_upper` (frequentist)
+
+---
+
+## Usage Examples
+
+### Basic usage (default Bayesian):
+```python
+import causalpy as cp
+
+# Fit experiment with PyMC model
+result = cp.DifferenceInDifferences(...)
+
+# Get effect summary with default settings
+summary = result.effect_summary()
+print(summary.text)  # Prose interpretation
+print(summary.table)  # Numerical summary
+```
+
+### With directional hypothesis:
+```python
+# Test for an increase
+summary = result.effect_summary(direction="increase")  # Reports p_gt_0
+
+# Test for a decrease
+summary = result.effect_summary(direction="decrease")  # Reports p_lt_0
+
+# Two-sided test
+summary = result.effect_summary(direction="two-sided")  # Reports prob_of_effect
+```
+
+### With practical significance threshold:
+```python
+# Only care about effects > 2.0
+summary = result.effect_summary(
+    direction="increase",
+    min_effect=2.0  # ROPE analysis
+)
+# Now summary.table includes p_rope column
+```
+
+### For time-series experiments with custom window:
+```python
+# ITS or Synthetic Control result
+summary = result.effect_summary(
+    window=(10, 20),  # Only analyze time points 10-20
+    cumulative=True,   # Include cumulative effects
+    relative=True      # Include percentage changes
+)
+```
+
+---
+
+## Further Reading
+
+For deeper understanding of these statistical concepts:
+
+- **Bayesian inference:** The [PyMC documentation](https://www.pymc.io/) provides excellent tutorials on Bayesian statistics
+- **Causal inference:** See our :doc:`causal_written_resources` for recommended books
+- **Statistical terms:** Refer to the :doc:`glossary` for concise definitions
+- **Practical application:** Explore the example notebooks in our documentation showing these concepts in action
+
+:::{seealso}
+- :doc:`glossary` - Quick reference for statistical terms
+- :doc:`causal_written_resources` - Books and articles on causal inference
+- API documentation for the `effect_summary()` method
+:::