You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: confounding0.qmd
+99-12Lines changed: 99 additions & 12 deletions
Original file line number
Diff line number
Diff line change
@@ -42,6 +42,20 @@ Optional reading:
42
42
## Video Lessons
43
43
44
44
::: callout-tip
45
+
### The Epistemological Divide: Explanatory versus Predictive Modeling
46
+
47
+
Before dissecting specific confounder selection techniques, it is crucial to establish the epistemological distinction that governs variable selection: the divergence between predictive and causal inference goals. This distinction is frequently conflated in practice, leading to the misapplication of algorithms designed for one purpose to the problems of the other.
48
+
49
+
50
+
#### The Goal of Prediction
51
+
In predictive modeling, the objective is to minimize the expected loss (e.g., mean squared error) between the predicted and observed outcome values. In this context, a "good" variable is one that is strongly correlated with the outcome, regardless of the direction of causality. A variable that is a consequence of the outcome (a proxy) or a mediator of the exposure can be an excellent predictor. Variable selection methods in this domain, such as standard stepwise regression, Akaike Information Criterion (AIC) minimization, or standard Lasso regularization, are designed to identify a parsimonious set of correlates that maximize model fit and reduce prediction error.
52
+
53
+
#### The Goal of Causal Explanation
54
+
In causal inference, the objective is to isolate the specific marginal effect of an intervention (exposure) on an outcome. Here, the correlation is only useful if it reflects a structural cause-effect relationship. Including a mediator in the model will increase the $R^2$ (predictive power) but will bias the estimation of the total causal effect toward the null.
55
+
56
+
Consequently, variable selection methods optimized for prediction are often mathematically antagonistic to causal inference. Techniques that rely on "goodness-of-fit" or statistical significance can inadvertently select colliders (inducing bias) or drop weak confounders that are critical for validity. The failure to distinguish these goals is a primary source of methodological error in the medical literature, motivating the need for distinct, causally-grounded selection strategies.
57
+
58
+
45
59
### The Counterfactual Framework for Defining Causality
46
60
47
61
#### Defining the Causal Effect: Potential Outcomes
@@ -96,7 +110,7 @@ The timestamps are also included in the YouTube video description.
96
110
:::
97
111
98
112
::: callout-tip
99
-
### Visualizing Causal Assumptions with Directed Acyclic Graphs (DAGs)
113
+
### Structural and Knowledge-Based Selection Techniques
100
114
101
115
To properly address confounding, researchers need a tool to translate their subject-matter knowledge and assumptions about the world into a formal structure. **Directed Acyclic Graphs (DAGs)** serve this purpose, providing a visual language and a set of rigorous rules for identifying sources of bias and guiding statistical analysis.
102
116
@@ -164,17 +178,47 @@ Example DAG codes can be accessed from [this GitHub repository folder](https://g
164
178
:::
165
179
166
180
::: callout-tip
167
-
### When the DAG is Unknown: Empirical Criteria
181
+
### Empirical Criteria for DAG-Deficient Scenarios
182
+
183
+
In many practical epidemiological investigations, particularly those involving novel exposures or complex metabolic pathways, the full causal structure is unknown. The uncertainty regarding the presence or direction of arrows makes the strict construction of a DAG impossible. In such "DAG-deficient" scenarios, epidemiologists must resort to pragmatic heuristics or empirical criteria that aim to approximate the Backdoor Criterion with less stringent assumptions.
168
184
169
185
In the absence of a fully specified DAG, researchers can rely on a set of empirical criteria that require less stringent assumptions.
170
186
171
-
-**Pre-treatment Criterion**: Adjust for any variable that occurs chronologically before the exposure. This approach can fail by incorrectly adjusting for a pre-exposure collider, thereby inducing M-bias.
172
-
-**Common Cause Criterion**: Adjust only for variables known to be common causes of both the exposure and the outcome. This is often too conservative.
173
-
-**Disjunctive Cause Criterion**: This is a highly recommended practical strategy. It states that one should control for any pre-exposure covariate that is a cause of the exposure, **OR** a cause of the outcome, **OR** both. This criterion strikes a robust balance, ensuring sufficient adjustment under a wide range of unknown causal structures.
174
-
-**Modified Disjunctive Cause Criterion**: This refines the disjunctive criterion with crucial exceptions:
175
-
1.**Exclude Instrumental Variables & Z-Bias**: An instrumental variable causes the exposure but does not affect the outcome except through the exposure. One must *avoid* adjusting for known instruments, as doing so can amplify bias due to unmeasured confounding, a phenomenon known as **Z-bias**.
176
-
2.**Include Proxies for Unmeasured Confounders**: If a true common cause is unmeasured, one should adjust for a measured variable that serves as a proxy for it, as this will typically reduce bias.
177
-
- Finally, to estimate the *total causal effect*, any known mediators on the causal pathway must also be excluded from the adjustment set.
187
+
-**Pre-treatment Criterion**: One of the simplest and most intuitive heuristics is the Pre-treatment Criterion, which dictates adjusting for all covariates measured chronologically before the exposure was administered or assigned.
188
+
189
+
**Rationale:** The logic is grounded in temporal causality; a variable occurring before the exposure cannot be a downstream effect (mediator) of the exposure. Therefore, adjusting for pre-treatment variables avoids the error of overadjustment via mediation.
190
+
191
+
**Critique:**
192
+
193
+
(a) While this criterion successfully avoids adjusting for mediators, it fails to protect against M-bias. A pre-treatment variable can still be a collider if it is caused by two unobserved latent variables—one linked to the exposure and one to the outcome. Adjusting for such a pre-treatment collider introduces bias.
194
+
195
+
(b) This "kitchen sink" approach often leads to the inclusion of Instrumental Variables (IVs)—pre-treatment variables that cause the exposure but have no independent effect on the outcome. As discussed later, adjusting for IVs inflates the variance of the estimator and can amplify bias due to residual unmeasured confounding (Z-bias).
196
+
197
+
Thus, while the Pre-treatment Criterion is a helpful starting point, it is often too crude for high-stakes causal inference.
198
+
199
+
-**Common Cause Criterion**: The Common Cause Criterion refines the selection process by narrowing the adjustment set to variables known (or suspected) to be causes of *both* the exposure and the outcome.
200
+
201
+
**Rationale:** This criterion targets the classical epidemiological definition of a confounder. By restricting selection to common causes, it theoretically avoids colliders (which are effects) and instruments (which are causes of exposure only).
202
+
203
+
**Critique:** The major limitation of this approach is its reliance on definitive knowledge. If a researcher is unsure whether a variable causes the outcome, the strict application of this criterion would lead to its exclusion. However, standard bias analysis suggests that omitting a true confounder (due to uncertainty) generally introduces more bias than including a non-confounder. Therefore, the Common Cause Criterion is often viewed as overly conservative, potentially leading to residual confounding in the pursuit of parsimony.
204
+
205
+
-**Disjunctive Cause Criterion**: To address the limitations of the Common Cause Criterion, VanderWeele (2019) proposed the Disjunctive Cause Criterion as a pragmatic strategy for confounder selection.
206
+
207
+
**The Rule:** Control for any pre-exposure covariate that is
208
+
209
+
(a) a cause of the exposure, **OR**
210
+
(b) a cause of the outcome, **OR**
211
+
(c) both.
212
+
213
+
**Mechanism:** This union-based approach ensures that all common causes (confounders) are included, as they satisfy the condition of being a cause of both. By including variables that are only causes of the outcome, the method improves the precision of the estimate (reducing standard error) without introducing bias. By including variables that are only causes of the exposure (potential instruments), it risks some variance inflation, but this is often considered an acceptable trade-off to ensure no confounders are missed.
214
+
215
+
**Strength:** The primary strength of the Disjunctive Cause Criterion is its robustness to uncertainty regarding the full causal structure. The researcher does not need to know if a variable affects *both* exposure and outcome; knowing it affects *at least one* is sufficient for inclusion. This effectively minimizes the risk of unadjusted confounding while generally avoiding colliders (which are effects, not causes).
216
+
217
+
-**Modified Disjunctive Cause Criterion**: Refining the Disjunctive Cause Criterion further, the Modified Disjunctive Cause Criterion incorporates specific exclusions and inclusions to optimize both validity and efficiency.
218
+
219
+
**Exclude IVs:** Recognizing the variance inflation and Z-bias risks associated with instruments, the modified criterion explicitly removes variables known to affect the exposure but not the outcome. This requires some structural knowledge but yields a more efficient estimator.
220
+
221
+
**Include Proxies:** Acknowledging that true confounders are often unmeasured, the modified criterion mandates the inclusion of measured variables that serve as *proxies* for the unmeasured common causes. Even if a proxy is not a direct cause, adjusting for it partially blocks the backdoor path transmitted through the unobserved parent variable.
@@ -188,9 +232,52 @@ In the absence of a fully specified DAG, researchers can rely on a set of empiri
188
232
189
233
Statistical methods can also be used for variable selection, but their application requires careful consideration of the research goal: prediction versus causal inference.
190
234
191
-
-**Change-in-Estimate**: This method retains a covariate if its inclusion changes the exposure effect estimate by a certain threshold (e.g., 10%). However, this approach is flawed and not valid for non-collapsible effect measures, such as odds ratios (ORs) and hazard ratios (HRs). For these measures, a change in the estimate can occur even in the absence of confounding.
192
-
-**Statistical Significance**: Methods like **stepwise regression** using p-values or AIC are strongly discouraged for confounder selection. They are designed for prediction, not causal inference, and result in invalid p-values and confidence intervals for the final model.
193
-
-**Machine Learning**: Algorithms like LASSO and Random Forests are excellent for high-dimensional **prediction**. Their primary role in causal inference is in developing **propensity score (PS) models**, which is a prediction task. The goal is to create a score that balances measured covariates between the exposed and unexposed groups, mimicking randomization.
235
+
-**Change-in-Estimate**: The Change-in-Estimate (CIE) method represents an operationalization of the definition of confounding: if a variable is a confounder, adjusting for it should change the estimated effect of the exposure.
236
+
237
+
**The Procedure:** The researcher begins with a "crude" model containing only the exposure and outcome. Potential confounders are added to the model one by one (or removed from a full model). If the regression coefficient for the exposure changes by more than a specified percentage (commonly 10%), the variable is deemed a confounder and retained in the model.
238
+
239
+
**The Non-Collapsibility Trap:** A critical flaw of the CIE method arises when using non-collapsible effect measures, such as the OR or HR. In logistic regression, the addition of a covariate that is strongly associated with the outcome (but independent of the exposure) will increase the magnitude of the exposure's OR—driving it further from the null. This occurs not because of confounding bias, but because of a mathematical property known as non-collapsibility. A CIE algorithm would interpret this change as evidence of confounding and select the variable, potentially leading to over-adjustment or misinterpretation of the effect measure modification. Thus, CIE is safer for RDs or RRs but hazardous for ORs.
240
+
241
+
-**Statistical Significance (Stepwise Selection)**: Stepwise selection algorithms (forward selection, backward elimination, or bidirectional search) rely on statistical significance (p-values) to determine variable inclusion.
242
+
243
+
**The Procedure:** Variables are added to the model if their association with the outcome yields a p-value below a certain threshold (e.g., 0.05) or removed if the p-value exceeds it.
244
+
245
+
**The Confounding vs. Significance Fallacy:** The most fundamental critique of this approach is that "confounding is not a significance test." A variable can be a strong confounder—systematically biasing the effect estimate—even if its association with the outcome fails to reach statistical significance in a specific sample, particularly in small studies. Relying on p-values often leads to under-adjustment and residual confounding.
246
+
247
+
**Post-Selection Inference:** Stepwise selection invalidates the statistical theory behind confidence intervals. The final model treats the selected variables as if they were specified *a priori*, ignoring the immense "data dredging" and multiple testing that occurred during the selection process. This results in standard errors that are systematically too small and confidence intervals that are too narrow, creating a false sense of precision.
248
+
249
+
**Prediction vs. Causation:** Ultimately, stepwise algorithms are designed to maximize model fit (prediction). They will happily select a collider or a mediator if it is strongly correlated with the outcome, thereby maximizing $R^2$ while destroying the validity of the causal coefficient.
250
+
251
+
-**Purposeful Selection of Covariates** Recognizing the limitations of purely mechanical stepwise regression, the "Purposeful Selection" algorithm, a hybrid approach was proposed [@hosmer2013applied; @bursac2008purposeful]that combines statistical criteria with researcher judgment and confounding checks.
252
+
253
+
**The Algorithm:**
254
+
255
+
(a) **Univariate Screening:**
256
+
- Evaluate all covariates individually.
257
+
- Retain any variable with a univariate p-value $< 0.25$. This relaxed threshold is crucial; it aims to capture potential confounders that may be weak individually but strong jointly, or whose effects are masked in univariate analysis.
258
+
(b) **Multivariable Model:**
259
+
- Fit a model with all candidates identified in step 1.
260
+
- Remove variables that are not significant at traditional levels (e.g., $p < 0.05$).
261
+
(c) **Confounding Check:** This is the distinguishing feature.
262
+
- Before permanently discarding a variable, the analyst must check if its removal induces a major change ($>15-20\%$) in the coefficients of the remaining variables.
263
+
- If it does, the variable is added back into the model as a confounder, regardless of its statistical significance.
264
+
(d) **Refinement and Interactions:** Excluded variables are added back one by one to check for residual significance. Finally, the model is checked for plausible interactions.
265
+
266
+
**Insight:** Purposeful Selection is widely cited in epidemiology because it operationalizes the definition of confounding within the selection process. Unlike rigid stepwise regression, it prioritizes the stability of the exposure coefficient over the parsimony of the outcome model. It forces the analyst to examine the data at each step, acting as a safeguard against the automation of causal errors.
267
+
268
+
**Criticism:** Purposeful Selection is now considered outdated and flawed by modern causal inference standards. Its fundamental weakness is that it remains entirely driven by statistical associations within the data rather than by a priori causal structure. The "confounding check" (Step 3), its distinguishing feature, is ironically its most critical flaw. This change-in-estimator (CIE) criterion cannot distinguish true confounders from colliders or mediators. In the case of a collider, adjusting for it induces a spurious association (bias), which causes a large change in the exposure's coefficient. The algorithm misinterprets this induced bias as a sign of confounding and therefore retains the collider, leading to a biased final estimate. Because it is "causally blind," it is not a safeguard against causal errors and is superseded by methods like those based on DAGs.
269
+
270
+
-**Machine Learning (ML)**: Algorithms such as LASSO and Random Forests are excellent for high-dimensional **prediction**. Their primary role in causal inference is in developing [propensity score (PS) models](propensityscore.html), which is a prediction task for the exposure model [@karim2025effective]. The goal is to create a score that balances measured covariates between the exposed and unexposed groups, mimicking randomization.
271
+
272
+
**Criticism:** The variance estimation can be poor depending on the machine learning method used to do the variable selection, often resulting in poor coverage.
273
+
274
+
-**Advanced Causal Inference Methods, often incorporating ML**:
275
+
276
+
(a) High-Dimensional Propensity Score (hdPS) [@schneeweiss2009high; @karim2025evaluating]: designed for healthcare databases. It algorithmically scans thousands of proxy variables (e.g., prior diagnoses, medications) and selects those that are most likely to be confounders to include in the propensity score model.
277
+
(b) Machine learning versions of hdPS [@karim2025high; @karim2018can]: These models are excellent at capturing complex, non-linear relationships and interactions among covariates. See [external workshop materials here](https://ehsanx.github.io/hdPSv25/).
278
+
(c) Post-double-selection method [@belloni2014inference]: It formally recognizes that a confounder must be related to both the exposure and the outcome. It use a machine learning method (e.g., LASSO) to select all covariates that are predictive of the outcome, and then again uses LASSO to select all covariates that are predictive of the exposure. The final set of confounders to adjust for is the union (all variables from both lists). This algorithmically mimics the "Disjunctive Cause Criterion" (adjust for causes of Exposure or Outcome). It is robust and avoids the biases of selecting based only on the outcome. Runs a simple (non-penalized) regression for the final estimate, adjusting for the union set.
279
+
(d) Outcome-Adaptive Lasso [@shortreed2017outcome; @balde2023reader]: This is a variation of LASSO that essentially performs "double selection" in a single step. It's a penalized regression (LASSO) for the outcome model, but the penalty for each covariate is adapted (weighted). Covariates that are strongly predictive of the exposure are given a smaller penalty, making them more likely to be kept in the final outcome model, regardless of their association with the outcome.
280
+
(e) Collaborative Targeted Maximum Likelihood Estimation (C-TMLE) [@van2010collaborative]: It uses machine learning (often a "Super Learner" that combines many ML algorithms) to build the best possible outcome model. Then, it collaboratively uses information from that model to decide which covariates also need to go into the propensity score model to minimize bias.
0 commit comments