Skip to content

Commit 5d01ccc

Browse files
committed
missing25
1 parent 584dcc1 commit 5d01ccc

File tree

4 files changed

+146
-73
lines changed

4 files changed

+146
-73
lines changed

docs/missingdata0.html

Lines changed: 27 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1400,7 +1400,7 @@ <h3 class="anchored" data-anchor-id="critique-of-flawed-ad-hoc-approaches">Criti
14001400
<ul>
14011401
<li><strong>Complete Case Analysis (Listwise Deletion)</strong>: This method, the default in many software packages, involves analyzing only the subset of observations with no missing data on any variable. While simple, it is only statistically valid under a very strict and rare assumption about the missing data mechanism Its widespread use without proper justification is one of the most common and serious errors in the literature. As a general rule of thumb, some methodologists suggest that complete case analysis could be considered for the primary analysis if the percentage of missing observations across all variables combined is below approximately 5%, but this requires a very strong justification and should not be based solely on a statistical test. Furthermore, if only the outcome variable has missing values, complete case analysis can be more statistically efficient than multiple imputation.</li>
14021402
<li><strong>Single Imputation (e.g., Mean/Median)</strong>: This approach involves “filling in” each missing value with a single number, such as the mean or median of the observed values for that variable. While this creates a complete dataset, it artificially reduces the natural variability of the data. All the imputed values are identical, which shrinks the standard deviation and leads to underestimated standard errors and overly optimistic (i.e., too small) p-values.</li>
1403-
<li><strong>Indicator Method</strong>: Another flawed technique is to create a new “missing” category for a variable and include this indicator in a regression model. This is not a valid statistical approach and can introduce significant bias into the model’s estimates. This method treats the lack of information as if it were a meaningful, substantive category. For example, if income data is missing for lower-income individuals, creating a “Missing” category can mask the true relationship between income and health, potentially causing the model to underestimate the effect of income. The bias can be especially noticeable if the variable with the missing category is an important confounder.</li>
1403+
<li><strong>Indicator Method</strong>: Another flawed technique is to create a new “missing” category for a variable and include this indicator in a regression model <span class="citation" data-cites="greenland1995critical vach1991biased">(<a href="#ref-greenland1995critical" role="doc-biblioref">Greenland and Finkle 1995</a>; <a href="#ref-vach1991biased" role="doc-biblioref">Vach and Blettner 1991</a>)</span>. This is not a valid statistical approach and can introduce significant bias into the model’s estimates. This method treats the lack of information as if it were a meaningful, substantive category. For example, if income data is missing for lower-income individuals, creating a “Missing” category can mask the true relationship between income and health, potentially causing the model to underestimate the effect of income. The bias can be especially noticeable if the variable with the missing category is an important confounder.</li>
14041404
</ul>
14051405
<p>The persistence of these suboptimal methods points to a critical issue beyond mere statistical technique. The failure to explicitly report the extent of missing data or to justify the method used to handle it is a matter of scientific integrity. Principled missing data analysis is not just about getting a more accurate p-value; it is about a commitment to transparency and producing the most robust and honest results possible from the available evidence.</p>
14061406
<hr>
@@ -1456,7 +1456,7 @@ <h3 class="anchored" data-anchor-id="table-summary-of-missing-data-mechanisms">T
14561456
</tr>
14571457
</tbody>
14581458
</table>
1459-
<p>The crucial takeaway is that the most powerful and widely used methods for handling missing data, such as Multiple Imputation, operate under the MAR assumption. This leads to a fundamental challenge for the researcher. The slides explicitly state, <strong>“it is not possible to distinguish between MAR and MNAR using observed data</strong>. This creates an apparent paradox: to proceed with the best available methods, one must make an assumption that cannot be statistically proven or disproven with the data at hand.</p>
1459+
<p>The crucial takeaway is that the most powerful and widely used methods for handling missing data, such as Multiple Imputation, operate under the MAR assumption. This leads to a fundamental challenge for the researcher. It is not possible to distinguish between MAR and MNAR using observed data. This creates an apparent paradox: to proceed with the best available methods, one must make an assumption that cannot be statistically proven or disproven with the data at hand.</p>
14601460
<p>The resolution to this paradox lies in shifting the burden of proof from a statistical test to a well-reasoned, subject-matter argument. A researcher cannot simply run a test to “choose” MAR. Instead, they must build a compelling case for why MAR is a <em>plausible</em> assumption in their specific research context. This involves a deep understanding of the data collection process and the substantive area of study. The strength of the final analysis rests not on a p-value from a test, but on the plausibility of this foundational, untestable assumption.</p>
14611461
<hr>
14621462
</section>
@@ -1496,7 +1496,12 @@ <h3 class="anchored" data-anchor-id="single-imputation-a-first-step">Single Impu
14961496
</section>
14971497
<section id="when-single-imputation-may-be-considered" class="level3">
14981498
<h3 class="anchored" data-anchor-id="when-single-imputation-may-be-considered">When Single Imputation May Be Considered</h3>
1499-
<p>While generally discouraged for final inferential analysis, there are specific scenarios where single imputation may be considered a pragmatic choice : * <strong>Clinical Trials</strong>: It is often preferred for imputing missing baseline covariates in randomized clinical trials. * <strong>Missing Outcome with Auxiliary Variables</strong>: If only the outcome variable is missing and strong auxiliary variables (proxies for the outcome) are available, single imputation may be more effective than complete case analysis. * <strong>Prediction Problems</strong>: In machine learning contexts focused on prediction, single imputation methods can be used, though pooling results from multiple imputations is not straightforward.</p>
1499+
<p>While generally discouraged for final inferential analysis, there are specific scenarios where single imputation may be considered a pragmatic choice :</p>
1500+
<ul>
1501+
<li><strong>Clinical Trials</strong>: It is often preferred for imputing missing baseline covariates in randomized clinical trials.</li>
1502+
<li><strong>Missing Outcome with Auxiliary Variables</strong>: If only the outcome variable is missing and strong auxiliary variables (proxies for the outcome) are available, single imputation may be more effective than complete case analysis.</li>
1503+
<li><strong>Prediction Problems</strong>: In machine learning contexts focused on prediction, single imputation methods can be used, though pooling results from multiple imputations is not straightforward <span class="citation" data-cites="hossain2025lasso">(<a href="#ref-hossain2025lasso" role="doc-biblioref">Hossain et al. 2025</a>)</span>.</li>
1504+
</ul>
15001505
</section>
15011506
<section id="the-unifying-flaw-of-single-imputation" class="level3">
15021507
<h3 class="anchored" data-anchor-id="the-unifying-flaw-of-single-imputation">The Unifying Flaw of Single Imputation</h3>
@@ -1525,12 +1530,12 @@ <h3 class="anchored" data-anchor-id="step-1-the-imputation-phase---creating-plau
15251530
</ul></li>
15261531
<li><strong>Practical Considerations for the Imputation Model</strong>:
15271532
<ul>
1528-
<li><strong>Number of Imputations (m)</strong>: A common rule of thumb suggests that the number of imputations, <em>m</em>, should be at least as large as the percentage of subjects with any missing data. Modern recommendations often suggest between 20 and 100 imputations.</li>
1533+
<li><strong>Number of Imputations (m)</strong>: A common rule of thumb suggests that the number of imputations, <em>m</em>, should be at least as large as the percentage of subjects with any missing data <span class="citation" data-cites="austin2021missing">(<a href="#ref-austin2021missing" role="doc-biblioref">Austin et al. 2021</a>)</span>. Modern recommendations often suggest between 20 and 100 imputations.</li>
15291534
<li><strong>Number of Iterations</strong>: MICE is an iterative algorithm. In each cycle, it updates the imputed values based on the progressively improved predictions from the other variables. The algorithm is run for a set number of iterations to allow the imputed values to stabilize, a state known as convergence.</li>
15301535
<li><strong>Handling Non-Normal Data</strong>: For continuous variables that are not normally distributed (e.g., skewed), one approach is to transform the variable before imputation and transform it back afterward. However, this can distort relationships and complicate interpretation. A more robust and often preferred strategy within MICE is to use Predictive Mean Matching (PMM), which is well-suited for non-normal data because it imputes values directly from the observed data, thereby preserving the original distribution.</li>
15311536
</ul></li>
15321537
</ul>
1533-
<p>A common point of confusion is why the outcome variable should be included as a predictor in the imputation model. This seems circular or like “cheating.” However, this stems from a misunderstanding of the imputation model’s goal. The goal is not merely to predict a missing covariate <span class="math inline">\(X\)</span>, but to impute <span class="math inline">\(X\)</span> in a way that <em>preserves its true relationship with the outcome Y</em>. The outcome <span class="math inline">\(Y\)</span> is often the single best predictor of <span class="math inline">\(X\)</span>. Excluding it from the imputation model would cause the imputed values of <span class="math inline">\(X\)</span> to have a weaker relationship with <span class="math inline">\(Y\)</span> than the observed values of <span class="math inline">\(X\)</span> do, biasing any estimated association between <span class="math inline">\(X\)</span> and <span class="math inline">\(Y\)</span> towards zero. The imputation model’s purpose is structural preservation, which enables the subsequent analysis model to accurately test a specific hypothesis.</p>
1538+
<p>A common point of confusion is why the outcome variable should be included as a predictor in the imputation model <span class="citation" data-cites="white2011multiple">(<a href="#ref-white2011multiple" role="doc-biblioref">White, Royston, and Wood 2011</a>)</span>. This seems circular or like “cheating.” However, this stems from a misunderstanding of the imputation model’s goal. The goal is not merely to predict a missing covariate <span class="math inline">\(X\)</span>, but to impute <span class="math inline">\(X\)</span> in a way that <em>preserves its true relationship with the outcome Y</em>. The outcome <span class="math inline">\(Y\)</span> is often the single best predictor of <span class="math inline">\(X\)</span>. Excluding it from the imputation model would cause the imputed values of <span class="math inline">\(X\)</span> to have a weaker relationship with <span class="math inline">\(Y\)</span> than the observed values of <span class="math inline">\(X\)</span> do, biasing any estimated association between <span class="math inline">\(X\)</span> and <span class="math inline">\(Y\)</span> towards zero. The imputation model’s purpose is structural preservation, which enables the subsequent analysis model to accurately test a specific hypothesis.</p>
15341539
</section>
15351540
<section id="step-2-the-analysis-phase---analyzing-each-reality" class="level3">
15361541
<h3 class="anchored" data-anchor-id="step-2-the-analysis-phase---analyzing-each-reality">Step 2: The Analysis Phase - Analyzing Each Reality</h3>
@@ -1543,8 +1548,8 @@ <h3 class="anchored" data-anchor-id="step-3-the-pooling-phase---synthesizing-the
15431548
<li><strong>The Pooled Estimate</strong>: The final point estimate for any parameter (e.g., a regression coefficient) is simply the average of the <em>m</em> estimates obtained in the analysis phase.</li>
15441549
<li><strong>The Pooled Variance</strong>: This is the key to MI’s success. The total variance of the pooled estimate correctly accounts for all sources of uncertainty and is composed of two parts:
15451550
<ol type="1">
1546-
<li><strong>Within-Imputation Variance (<span class="math inline">\(\bar{U}\)</span>)</strong>: This is the average of the variances from each of the <em>m</em> analyses. It represents the normal sampling uncertainty we would have if our data had been complete from the start.</li>
1547-
<li><strong>Between-Imputation Variance (<span class="math inline">\(B\)</span>)</strong>: This is the variance of the parameter estimates <em>across</em> the <em>m</em> datasets. It directly captures the extra uncertainty that is due to the missing data. If the missing data were not very influential, the estimates from all <em>m</em> datasets would be very similar, and <span class="math inline">\(B\)</span> would be small. If the missing data were very influential, the estimates would vary more, and <span class="math inline">\(B\)</span> would be large.</li>
1551+
<li><strong>Within-Imputation Variance (</strong><span class="math inline">\(\bar{U}\)</span>): This is the average of the variances from each of the <em>m</em> analyses. It represents the normal sampling uncertainty we would have if our data had been complete from the start.</li>
1552+
<li><strong>Between-Imputation Variance (</strong><span class="math inline">\(B\)</span>): This is the variance of the parameter estimates <em>across</em> the <em>m</em> datasets. It directly captures the extra uncertainty that is due to the missing data. If the missing data were not very influential, the estimates from all <em>m</em> datasets would be very similar, and <span class="math inline">\(B\)</span> would be small. If the missing data were very influential, the estimates would vary more, and <span class="math inline">\(B\)</span> would be large.</li>
15481553
</ol></li>
15491554
</ul>
15501555
<p>The formula for the total variance (<span class="math inline">\(T\)</span>) is <span class="math inline">\(T = \bar{U} + B(1 + 1/m)\)</span>. This elegant formula shows how MI correctly inflates the standard error to account for the uncertainty from missing data (<span class="math inline">\(B\)</span>), solving the primary problem of single imputation and yielding valid confidence intervals and p-values. The “fraction of missing information” (FMI) is a useful metric derived from this process, which quantifies the proportion of the total variance that is attributable to the missing data.</p>
@@ -1684,9 +1689,18 @@ <h2 class="anchored" data-anchor-id="references">References</h2>
16841689

16851690

16861691
<div id="refs" class="references csl-bib-body hanging-indent" data-entry-spacing="0" role="list">
1692+
<div id="ref-austin2021missing" class="csl-entry" role="listitem">
1693+
Austin, Peter C, Ian R White, Douglas S Lee, and Stef van Buuren. 2021. <span>“Missing Data in Clinical Research: A Tutorial on Multiple Imputation.”</span> <em>Canadian Journal of Cardiology</em> 37 (9): 1322–31.
1694+
</div>
16871695
<div id="ref-granger2019avoiding" class="csl-entry" role="listitem">
16881696
Granger, Elizabeth, Jamie C. Sergeant, and Mark Lunt. 2019. <span>“Avoiding Pitfalls When Combining Multiple Imputation and Propensity Scores.”</span> <em>Statistics in Medicine</em> 38 (26): 5120–32.
16891697
</div>
1698+
<div id="ref-greenland1995critical" class="csl-entry" role="listitem">
1699+
Greenland, Sander, and William D Finkle. 1995. <span>“A Critical Look at Methods for Handling Missing Covariates in Epidemiologic Regression Analyses.”</span> <em>American Journal of Epidemiology</em> 142 (12): 1255–64.
1700+
</div>
1701+
<div id="ref-hossain2025lasso" class="csl-entry" role="listitem">
1702+
Hossain, Md Belal, Mohsen Sadatsafavi, James C Johnston, Hubert Wong, Victoria J Cook, and Mohammad Ehsanul Karim. 2025. <span>“LASSO-Based Survival Prediction Modelling with Multiply Imputed Data: A Case Study in Tuberculosis Mortality Prediction.”</span> <em>The American Statistician</em>, no. just-accepted: 1–20.
1703+
</div>
16901704
<div id="ref-hughes2019accounting" class="csl-entry" role="listitem">
16911705
Hughes, Rachael A., Jon Heron, Jonathan A. Sterne, and Kate Tilling. 2019. <span>“Accounting for Missing Data in Statistical Analyses: Multiple Imputation Is Not Always the Answer.”</span> <em>International Journal of Epidemiology</em> 1: 11.
16921706
</div>
@@ -1696,9 +1710,15 @@ <h2 class="anchored" data-anchor-id="references">References</h2>
16961710
<div id="ref-sterne2009multiple" class="csl-entry" role="listitem">
16971711
Sterne, Jonathan A., and et al. 2009. <span>“Multiple Imputation for Missing Data in Epidemiological and Clinical Research: Potential and Pitfalls.”</span> <em>BMJ</em> 338: b2393.
16981712
</div>
1713+
<div id="ref-vach1991biased" class="csl-entry" role="listitem">
1714+
Vach, Werner, and Mana Blettner. 1991. <span>“Biased Estimation of the Odds Ratio in Case-Control Studies Due to the Use of Ad Hoc Methods of Correcting for Missing Values for Confounding Variables.”</span> <em>American Journal of Epidemiology</em> 134 (8): 895–907.
1715+
</div>
16991716
<div id="ref-vanbuuren2018flexible" class="csl-entry" role="listitem">
17001717
Van Buuren, Stef. 2018. <em>Flexible Imputation of Missing Data</em>. Chapman; Hall/CRC.
17011718
</div>
1719+
<div id="ref-white2011multiple" class="csl-entry" role="listitem">
1720+
White, Ian R, Patrick Royston, and Angela M Wood. 2011. <span>“Multiple Imputation Using Chained Equations: Issues and Guidance for Practice.”</span> <em>Statistics in Medicine</em> 30 (4): 377–99.
1721+
</div>
17021722
</div>
17031723
</section>
17041724

docs/search.json

Lines changed: 6 additions & 6 deletions
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)