Skip to content

Commit b832192

Browse files
committed
Add reporting statistics documentation and glossary updates
1 parent d2d1813 commit b832192

File tree

4 files changed

+330
-2
lines changed

4 files changed

+330
-2
lines changed

causalpy/reporting.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,14 @@
1313
# limitations under the License.
1414
"""
1515
Reporting utilities for causal inference experiments.
16+
17+
This module provides statistical summaries and prose reports for causal effects.
18+
The reporting functions automatically compute appropriate statistics based on the
19+
model type (Bayesian/PyMC or Frequentist/OLS).
20+
21+
For detailed explanations of the reported statistics (HDI, ROPE, p-values, etc.)
22+
and their interpretation, see the documentation:
23+
https://causalpy.readthedocs.io/en/latest/knowledgebase/reporting_statistics.html
1624
"""
1725

1826
from dataclasses import dataclass

docs/source/knowledgebase/glossary.rst

Lines changed: 23 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,12 +25,19 @@ Glossary
2525
CITS
2626
An interrupted time series design with added comparison time series observations.
2727

28+
Confidence interval
29+
CI
30+
In frequentist statistics, a range of values that would contain the true parameter in a specified percentage of repeated samples. For example, a 95% confidence interval means that if we repeated the study many times, 95% of such intervals would contain the true parameter. See :doc:`reporting_statistics` for interpretation guidance and comparison with credible intervals.
31+
2832
Confound
2933
Anything besides the treatment which varies across the treatment and control conditions.
3034

3135
Counterfactual
3236
A hypothetical outcome that could or will occur under specific hypothetical circumstances.
3337

38+
Credible interval
39+
In Bayesian statistics, an interval containing a specified probability of the posterior distribution. For example, a 95% credible interval contains 95% of the posterior probability mass. Unlike confidence intervals, this is a direct probability statement about the parameter. The HDI (Highest Density Interval) is a specific type of credible interval. See :doc:`reporting_statistics` for details.
40+
3441
Difference in differences
3542
DiD
3643
Analysis where the treatment effect is estimated as a difference between treatment conditions in the differences between pre-treatment to post treatment observations.
@@ -46,6 +53,10 @@ Glossary
4653
Endogenous Variable
4754
An endogenous variable is a variable in a regression equation such that the variable is correlated with the error term of the equation i.e. correlated with the outcome variable (in the system). This is a problem for OLS regression estimation techniques because endogeniety violates the assumptions of the Gauss Markov theorem.
4855

56+
HDI
57+
Highest Density Interval
58+
In Bayesian statistics, the narrowest credible interval containing a specified percentage of the posterior probability mass. For example, a 95% HDI is the shortest interval that contains 95% of the posterior distribution. This is the default uncertainty interval reported by CausalPy for PyMC models. See :doc:`reporting_statistics` for interpretation guidance.
59+
4960
Local Average Treatment effect
5061
LATE
5162
Also known as the complier average causal effect (CACE), is the effect of a treatment for subjects who comply with the experimental treatment assigned to their sample group. It is the quantity we're estimating in IV designs.
@@ -63,14 +74,20 @@ Glossary
6374
Panel data
6475
Time series data collected on multiple units where the same units are observed at each time point.
6576

77+
Posterior probability
78+
In Bayesian statistics, the probability of a hypothesis or parameter value after observing the data. In CausalPy's reporting, posterior probabilities are used for hypothesis testing (e.g., the probability that a treatment effect is positive). Unlike p-values, these are direct probability statements about the hypothesis of interest. See :doc:`reporting_statistics` for examples.
79+
80+
Potential outcomes
81+
A potential outcome is definable for a candidate or experimental unit under a treatment regime with respect to a measured outcome. The outcome Y(0) for that experimental unit is the outcome when the individual does not have the treatment. The outcome Y(1) for that experimental unit is the outcome when the individual does receive the treatment. Only one case can be observed in reality, and this is called the fundamental problem of causal inference. Seen this way causal inference becomes a kind of imputation problem.
82+
6683
Pretest-posttest design
6784
A quasi-experimental design where the treatment effect is estimated by comparing an outcome measure before and after treatment.
6885

6986
Propensity scores
7087
An estimate of the probability of adopting a treatment status. Used in re-weighting schemes to balance observational data.
7188

72-
Potential outcomes
73-
A potential outcome is definable for a candidate or experimental unit under a treatment regime with respect to a measured outcome. The outcome Y(0) for that experimental unit is the outcome when the individual does not have the treatment. The outcome Y(1) for that experimental unit is the outcome when the individual does receive the treatment. Only one case can be observed in reality, and this is called the fundamental problem of causal inference. Seen this way causal inference becomes a kind of imputation problem.
89+
p-value
90+
In frequentist statistics, the probability of observing data at least as extreme as what was observed, assuming the null hypothesis (typically "no effect") is true. Lower p-values indicate stronger evidence against the null hypothesis. Commonly, p < 0.05 is used as a threshold for statistical significance, though the p-value itself should be reported along with effect sizes and confidence intervals. See :doc:`reporting_statistics` for interpretation guidance and common pitfalls.
7491

7592
Quasi-experiment
7693
An empirical comparison used to estimate the effects of a treatment where units are not assigned to conditions at random.
@@ -88,6 +105,10 @@ Glossary
88105
Regression kink design
89106
A quasi-experimental research design that estimates treatment effects by analyzing the impact of a treatment or intervention precisely at a defined threshold or "kink" point in a quantitative assignment variable (running variable). Unlike traditional regression discontinuity designs, regression kink design looks for a change in the slope of an outcome variable at the kink, instead of a discontinuity. This is useful when the assignment variable is not discrete, jumping from 0 to 1 at a threshold. Instead, regression kink designs are appropriate when there is a change in the first derivative of the assignment function at the kink point.
90107

108+
ROPE
109+
Region of Practical Equivalence
110+
In Bayesian causal inference, a method for testing whether an effect exceeds a minimum meaningful threshold (the "minimum effect size"). Rather than just testing if an effect differs from zero (which may be statistically significant but trivially small), ROPE analysis tests if the effect is large enough to be practically important. CausalPy reports this as `p_rope`, the posterior probability that the effect exceeds the specified threshold. See :doc:`reporting_statistics` for usage and interpretation.
111+
91112
Running variable
92113
In regression discontinuity designs, the running variable is the variable that determines the assignment of units to treatment or control conditions. This is typically a continuous variable. Examples could include a test score, age, income, or spatial location. But the running variable would not be time, which is the case in interrupted time series designs.
93114

docs/source/knowledgebase/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
:maxdepth: 1
55

66
glossary
7+
reporting_statistics
78
design_notation
89
quasi_dags.ipynb
910
causal_video_resources
Lines changed: 298 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,298 @@
1+
# Statistical Reporting in CausalPy
2+
3+
This page explains the statistical concepts used in CausalPy's reporting layer. The reporting functions automatically compute and present statistics appropriate to your model type.
4+
5+
## Model Types and Statistical Approaches
6+
7+
CausalPy supports two modeling frameworks, each with its own statistical paradigm:
8+
9+
| Model Framework | Statistical Approach | Statistics Reported |
10+
|----------------|---------------------|---------------------|
11+
| PyMC models | Bayesian | Mean, Median, HDI, Tail Probabilities, ROPE |
12+
| Scikit-learn models | Frequentist (OLS) | Mean, Confidence Intervals, p-values |
13+
14+
:::{note}
15+
The reporting layer automatically detects which type of model you're using and generates appropriate statistics. You don't need to specify the statistical approach.
16+
:::
17+
18+
---
19+
20+
## Bayesian Statistics (PyMC Models)
21+
22+
When you use PyMC models, CausalPy performs Bayesian inference, yielding posterior distributions for causal effects. The reported statistics summarize these posterior distributions.
23+
24+
### Point Estimates
25+
26+
**Mean**
27+
- The average of the posterior distribution
28+
- Represents the expected value of the causal effect
29+
- **When to use:** Most commonly reported point estimate; balances all posterior information
30+
- **Interpretation:** "The average estimated effect is X"
31+
32+
**Median**
33+
- The middle value of the posterior distribution (50th percentile)
34+
- Divides the posterior probability mass in half
35+
- **When to use:** Preferred when the posterior is skewed; more robust to outliers
36+
- **Interpretation:** "There's a 50% probability the effect is above/below X"
37+
38+
:::{important}
39+
For symmetric posteriors, mean and median are nearly identical. For skewed posteriors, they may differ substantially. Report both to give readers a complete picture.
40+
:::
41+
42+
### Uncertainty Quantification
43+
44+
**HDI (Highest Density Interval)**
45+
- A :term:`credible interval` containing the specified percentage of posterior probability (default: 95%)
46+
- Reported as `hdi_lower` and `hdi_upper` in summary tables
47+
- The narrowest interval containing the specified probability mass
48+
- **Interpretation:** "We can be 95% certain the true effect lies between X and Y"
49+
- **Key difference from CI:** This is a probability statement about the parameter itself, not about the procedure
50+
51+
:::{note}
52+
The `hdi_prob` parameter controls the interval width (e.g., 0.95 for 95% HDI, 0.90 for 90% HDI). Wider intervals (higher probability) provide more certainty but less precision.
53+
:::
54+
55+
**Example interpretation:**
56+
```
57+
mean: 2.5, 95% HDI: [1.2, 3.8]
58+
```
59+
"The estimated effect is 2.5 on average, and we can be 95% certain the true effect lies between 1.2 and 3.8."
60+
61+
### Hypothesis Testing
62+
63+
Bayesian hypothesis testing uses posterior probabilities directly, making the interpretation more intuitive than traditional p-values.
64+
65+
**Directional Tests**
66+
- `p_gt_0`: :term:`Posterior probability` that the effect is greater than zero (positive effect)
67+
- `p_lt_0`: Posterior probability that the effect is less than zero (negative effect)
68+
- **Interpretation:** Direct probability statements about the hypothesis
69+
- **Example:** If `p_gt_0 = 0.95`, there's a 95% probability the effect is positive
70+
71+
**Two-Sided Tests**
72+
- `p_two_sided`: Two-sided posterior probability (analogous to two-sided p-value)
73+
- `prob_of_effect`: Probability of an effect in either direction (1 - p_two_sided)
74+
- **When to use:** When you don't have a directional hypothesis
75+
- **Interpretation:** `prob_of_effect = 0.95` means 95% probability of a non-zero effect
76+
77+
:::{note}
78+
Unlike frequentist p-values, Bayesian posterior probabilities answer the question you actually care about: "What's the probability of this hypothesis given the data?"
79+
:::
80+
81+
**Decision guidance:**
82+
- `p_gt_0 > 0.95` or `p_lt_0 > 0.95`: Strong evidence for directional effect
83+
- `prob_of_effect > 0.95`: Strong evidence for any effect (two-sided)
84+
- Values close to 0.5: Weak or no evidence for the effect
85+
86+
### Effect Size Assessment
87+
88+
**ROPE (Region of Practical Equivalence)**
89+
- Tests whether the effect exceeds a minimum meaningful threshold (`min_effect`)
90+
- Reported as `p_rope` in summary tables
91+
- **Purpose:** Distinguish statistical significance from practical significance
92+
- **Interpretation:** Probability that the effect exceeds the threshold you care about
93+
94+
**How it works:**
95+
1. You specify `min_effect` (the smallest effect size you consider meaningful)
96+
2. For directional tests: `p_rope` = P(|effect| > min_effect)
97+
3. For two-sided tests: `p_rope` = P(|effect| > min_effect)
98+
99+
**Example:**
100+
```python
101+
result.effect_summary(direction="increase", min_effect=1.0)
102+
```
103+
If `p_rope = 0.85`, there's an 85% probability the effect exceeds your meaningful threshold of 1.0.
104+
105+
:::{important}
106+
ROPE analysis requires domain knowledge to set `min_effect`. Consider: What's the smallest effect that would justify the intervention cost? What effect size is scientifically or practically meaningful?
107+
:::
108+
109+
---
110+
111+
## Frequentist Statistics (Scikit-learn / OLS Models)
112+
113+
When you use scikit-learn models (OLS regression), CausalPy performs classical frequentist inference based on t-distributions.
114+
115+
### Point Estimates
116+
117+
**Mean / Coefficient Estimate**
118+
- The estimated causal effect from the regression model
119+
- For scalar effects (DiD, RD): the coefficient of interest
120+
- For time-series effects (ITS, SC): the average or cumulative impact
121+
- **Interpretation:** "The estimated effect is X"
122+
123+
:::{note}
124+
Unlike Bayesian estimates, frequentist point estimates don't come with a probability distribution. Uncertainty is captured through confidence intervals and standard errors.
125+
:::
126+
127+
### Uncertainty Quantification
128+
129+
**Confidence Intervals (CI)**
130+
- Reported as `ci_lower` and `ci_upper` in summary tables
131+
- Computed using t-distribution critical values at the specified significance level (default: α = 0.05 for 95% CI)
132+
- **Interpretation:** "If we repeated this experiment many times, 95% of such intervals would contain the true effect"
133+
- **Key difference from HDI:** This is a statement about the procedure, not about the parameter
134+
135+
**Standard Errors**
136+
- Measure of uncertainty in the coefficient estimate
137+
- Used to construct confidence intervals and compute p-values
138+
- Derived from the residual variance and design matrix
139+
- **Larger standard errors** → wider confidence intervals → more uncertainty
140+
141+
**Example interpretation:**
142+
```
143+
mean: 2.5, 95% CI: [1.1, 3.9]
144+
```
145+
"The estimated effect is 2.5. If we repeated this study many times, 95% of such confidence intervals would contain the true effect."
146+
147+
:::{important}
148+
**Bayesian HDI vs Frequentist CI:** While numerically similar, they have fundamentally different interpretations. The HDI makes a direct probability statement about the parameter ("95% probability the effect is in this range"), while the CI makes a statement about the procedure ("95% of such intervals would contain the true parameter").
149+
:::
150+
151+
### Hypothesis Testing
152+
153+
**p-values**
154+
- The probability of observing data at least as extreme as what we observed, assuming the null hypothesis (no effect) is true
155+
- Reported as `p_value` in summary tables
156+
- **Common threshold:** p < 0.05 is often used as evidence against the null hypothesis
157+
- **Interpretation:** Lower p-values indicate stronger evidence against no effect
158+
159+
**Correct interpretation:**
160+
- p = 0.03: "If there were truly no effect, we'd observe data this extreme only 3% of the time"
161+
- **NOT:** "There's a 97% probability of an effect" (this is a Bayesian interpretation)
162+
163+
**Common pitfalls to avoid:**
164+
1. ❌ "p = 0.06 means no effect" → The p-value doesn't prove the null hypothesis
165+
2. ❌ "p < 0.05 means the effect is important" → Statistical significance ≠ practical significance
166+
3. ❌ "p = 0.01 is better than p = 0.04" → Both provide evidence against the null; the effect size matters more
167+
4. ❌ "p > 0.05 proves no effect" → Absence of evidence is not evidence of absence
168+
169+
**Decision guidance:**
170+
- p < 0.05: Conventional threshold for "statistical significance"
171+
- p < 0.01: Stronger evidence against the null
172+
- p > 0.05: Insufficient evidence to reject the null (but doesn't prove no effect)
173+
174+
:::{note}
175+
Always report the actual p-value and effect size, not just whether p < 0.05. The magnitude and confidence interval of the effect are often more informative than the p-value alone.
176+
:::
177+
178+
**t-statistics and degrees of freedom**
179+
- t-statistic = coefficient / standard error
180+
- Measures how many standard errors the estimate is from zero
181+
- Degrees of freedom (df) = n - p, where n = sample size, p = number of parameters
182+
- Larger |t-statistics| and smaller p-values indicate stronger evidence
183+
184+
---
185+
186+
## Choosing Between Approaches
187+
188+
### When to use Bayesian inference (PyMC models):
189+
- ✅ You want direct probability statements about effects
190+
- ✅ You have prior information to incorporate
191+
- ✅ You need uncertainty quantification for complex hierarchical models
192+
- ✅ You want to test against meaningful effect sizes (ROPE)
193+
- ✅ Small to moderate sample sizes where uncertainty matters
194+
195+
### When to use Frequentist inference (OLS models):
196+
- ✅ You need computational speed (OLS is faster)
197+
- ✅ Your audience expects classical statistical inference
198+
- ✅ Large sample sizes where approaches converge
199+
- ✅ Simple linear models without hierarchy
200+
- ✅ You want to align with traditional econometric practice
201+
202+
:::{important}
203+
Both approaches are valid and will often lead to similar conclusions, especially with larger sample sizes. The choice often depends on your field's conventions, computational constraints, and whether you value direct probabilistic interpretation (Bayesian) or long-run frequency guarantees (frequentist).
204+
:::
205+
206+
---
207+
208+
## Summary Statistics by Effect Type
209+
210+
### Scalar Effects (DiD, RD, Regression Kink)
211+
For experiments with a single causal effect parameter:
212+
213+
**Bayesian output:**
214+
- One row with: mean, median, hdi_lower, hdi_upper
215+
- Tail probabilities: p_gt_0 (or p_lt_0, or p_two_sided + prob_of_effect)
216+
- Optional: p_rope (if min_effect specified)
217+
218+
**Frequentist output:**
219+
- One row with: mean, ci_lower, ci_upper, p_value
220+
221+
### Time-Series Effects (ITS, Synthetic Control)
222+
For experiments with multiple post-treatment time points:
223+
224+
**Two aggregation levels:**
225+
1. **Average effect:** Mean effect across the post-treatment window
226+
2. **Cumulative effect:** Sum of effects across the post-treatment window
227+
228+
**Additional statistics:**
229+
- **Relative effects:** Percentage change relative to counterfactual
230+
- `relative_mean`: Effect size as percentage of counterfactual
231+
- `relative_hdi_lower` / `relative_hdi_upper` (Bayesian)
232+
- `relative_ci_lower` / `relative_ci_upper` (frequentist)
233+
234+
---
235+
236+
## Usage Examples
237+
238+
### Basic usage (default Bayesian):
239+
```python
240+
import causalpy as cp
241+
242+
# Fit experiment with PyMC model
243+
result = cp.DifferenceInDifferences(...)
244+
245+
# Get effect summary with default settings
246+
summary = result.effect_summary()
247+
print(summary.text) # Prose interpretation
248+
print(summary.table) # Numerical summary
249+
```
250+
251+
### With directional hypothesis:
252+
```python
253+
# Test for an increase
254+
summary = result.effect_summary(direction="increase") # Reports p_gt_0
255+
256+
# Test for a decrease
257+
summary = result.effect_summary(direction="decrease") # Reports p_lt_0
258+
259+
# Two-sided test
260+
summary = result.effect_summary(direction="two-sided") # Reports prob_of_effect
261+
```
262+
263+
### With practical significance threshold:
264+
```python
265+
# Only care about effects > 2.0
266+
summary = result.effect_summary(
267+
direction="increase",
268+
min_effect=2.0 # ROPE analysis
269+
)
270+
# Now summary.table includes p_rope column
271+
```
272+
273+
### For time-series experiments with custom window:
274+
```python
275+
# ITS or Synthetic Control result
276+
summary = result.effect_summary(
277+
window=(10, 20), # Only analyze time points 10-20
278+
cumulative=True, # Include cumulative effects
279+
relative=True # Include percentage changes
280+
)
281+
```
282+
283+
---
284+
285+
## Further Reading
286+
287+
For deeper understanding of these statistical concepts:
288+
289+
- **Bayesian inference:** The [PyMC documentation](https://www.pymc.io/) provides excellent tutorials on Bayesian statistics
290+
- **Causal inference:** See our :doc:`causal_written_resources` for recommended books
291+
- **Statistical terms:** Refer to the :doc:`glossary` for concise definitions
292+
- **Practical application:** Explore the example notebooks in our documentation showing these concepts in action
293+
294+
:::{seealso}
295+
- :doc:`glossary` - Quick reference for statistical terms
296+
- :doc:`causal_written_resources` - Books and articles on causal inference
297+
- API documentation for the `effect_summary()` method
298+
:::

0 commit comments

Comments
 (0)