The flexMissing R package implements a weighting-based framework for
identification and estimation of target laws in graphical models of
missing data represented by missing data directed acyclic graphs
(mDAGs). It supports identification and estimation under a broad range
of missingness mechanisms, including missing completely at random
(MCAR), missing at random (MAR), and missing not at random (MNAR). The
package allows the missingness mechanisms to be estimated using either
parametric regression models or flexible machine learning methods
through the SuperLearner package. Based on these estimates,
flexMissing constructs inverse probability weights, which can then be
used to estimate any target parameter defined as a functional of the
target law.
If you find this package useful, please consider cite: this paper
# placeholderTo install, run the following code in terminal:
# install the devtools package first if it's not yet installed
devtools::install_github("annaguo-bios/flexMissing")The source code for flexMissing package is available on GitHub at
flexMissing.
The graph below is any mDAG, where 
Some definitions:
- Missing mechanism (or propensity score): the conditional distribution
of the missingness indicators given their parents. For example, the
missing mechanism (propensity score) for
$R_3$ is$p(R_3|X_1,R_4,R_5,R_6)$ . - Full law: the joint distribution of all variables and their
missingness indicators, i.e.,
$p(X_1,X_2,X_3,X_4,X_5,X_6,R_1,R_2,R_3,R_4,R_5,R_6)$ . - Target law: the joint distribution of all variables, i.e.,
$p(X_1,X_2,X_3,X_4,X_5,X_6)$ .
The target law is identified if and only if all propensity scores
evaluated at
We can create a graph object based on the mDAG above and visualize it using the code below. In fact, the mDAG shown was generated with this code.
library(flexMissing)
G2 <- make.graph(obs_variables = c(),
missing_variables = c('X1','X2','X3','X4','X5','X6'),
missing_indicators = c('R1','R2','R3','R4','R5','R6'),
di_edges = list(c('X1','X2'),c('X2','X3'),c('X3','X4'),c('X4','X5'),c('X5','X6'),
c('R6','R5'),c('R6','R3'),c('R5','R3'),c('R4','R3'),c('R3','R2'),c('R2','R1'),
c('X1','R3'),
c('X3','R4'),c('X3','R5'),c('X3','R6'),
c('X4','R1'),c('X4','R6'),
c('X5','R2'),
c('X6','R5'))
)
# check G2
plot(G2, vertex.size = 15, graph.main='G2')With the graph object created above, we can check whether the target law
is identified using the f.ID_algorithm function. We can also check
whether the full law is identified by setting the argument
fulllaw = T. Full law ID is still under
development.
ID2 <- f.ID_algorithm(G2, fulllaw = T)## The target law is identified in the given mDAG.
## Full law identification for R3 failed.
## Full law identification for R5 failed.
## The full law is NOT identified in the given mDAG.
We can visualize the ID result using the plot function.
plot(ID2, vertex.size = 15)## The edges colored in skyblue are pruned edges for full law ID.
## The nodes colored in skyblue are non-ID nodes for full law.
## The edges colored in tomato are pruned edges for target law ID.
## The nodes colored in tomato are non-ID nodes for target law.
The tree structure provides a summary on how identification for each
propensity score is achieved. For example, R1 don’t have any children in
the tree meaning that its propensity score is identified using ordinary
independence constraint:
A summary of the identification result can be obtained using the
f.summarize_ID function. We will explain pre-selection, selection,
selection_r, and selection_x shortly in the estimation section.
f.summarize_ID(ID2)##
## === Target Law ID Result ===
##
##
##
## Missing_Indicators Propensitys ID_Status Intervened_On Preselection Selection Selection_r Selection_x
## ------------------- ----------------------- ---------- -------------- ------------------- ----------- ------------ ------------
## R1 p(R1 | X4, R2) TRUE - R4 R4 - R4
## R2 p(R2 | X5, R3) TRUE - R5 R5 - R5
## R3 p(R3 | X1, R4, R5, R6) TRUE R1, R2 R1, R4, R5 R1, R4, R5 R4, R5 R1
## R5 p(R5 | X3, X6, R6) TRUE R1, R3 R3, R6, R4, R1, R4 R3, R6 R6 R3, R6
## R6 p(R6 | X3, X4) TRUE R1, R3 R3, R4, R4, R1, R4 R3, R4 - R3, R4
## R4 p(R4 | X3) TRUE R2, R3 R3, R5, R1, R5 R3 - R3
##
##
## === Full Law ID Result ===
##
##
##
## Missing_Indicators Propensitys ID_Status Intervened_On Preselection Selection Selection_r Selection_x
## ------------------- ----------------------- ---------- -------------- ------------------- ---------- ------------ ------------
## R1 p(R1 | X4, R2) TRUE - R4 R4 - R4
## R2 p(R2 | X5, R3) TRUE - R5 R5 - R5
## R3 p(R3 | X1, R4, R5, R6) FALSE - R1 R1 - R1
## R5 p(R5 | X3, X6, R6) FALSE R1, R3 R3, R6, R4, R1, R4 R3, R6 R6 R3, R6
## R6 p(R6 | X3, X4) TRUE R1, R3 R3, R4, R4, R1, R4 R3, R4 - R3, R4
## R4 p(R4 | X3) TRUE R2, R3 R3, R5, R1, R5 R3 - R3
Let’s first generate a data set according to the mDAG G2 defined above. We will generate n=5000 samples.
library(dplyr)
G2.dgp <- function(n,
parX1 = c(1,1),
parX2 = c(-1,1,1),
parX3 = c(1,1,1),
parX4 = c(1,-0.5,1),
parX5 = c(1,3,1),
parX6 = c(-1,1,1),
parR1 = c(0,1,1),
parR2 = c(-2,1,1),
parR3 = c(0,1,1,1,1),
parR4 = c(1,1),
parR5 = c(1,1,1,1),
parR6 = c(-2,1,1)
){
# generate missing variables
X1 <- rnorm(n, parX1[1], parX1[2])
X2 <- rnorm(n, parX2[1] + parX2[2]*X1, parX2[3])
X3 <- rnorm(n, parX3[1] + parX3[2]*X2, parX3[3])
X4 <- rnorm(n, parX4[1] + parX4[2]*X3, parX4[3])
X5 <- rnorm(n, parX5[1] + parX5[2]*X4, parX5[3])
X6 <- rnorm(n, parX6[1] + parX6[2]*X5, parX6[3])
# generate missing indicators
R6 <- rbinom(n, 1, plogis(parR6[1] + parR6[2]*X3 + parR6[3]*X4))
R5 <- rbinom(n, 1, plogis(parR5[1] + parR5[2]*X3+ parR5[3]*X6 + parR5[4]*R6))
R4 <- rbinom(n, 1, plogis(parR4[1] + parR4[2]*X3))
R3 <- rbinom(n, 1, plogis(parR3[1] + parR3[2]*X1+ parR3[3]*R5 + parR3[4]*R6+ parR3[5]*R4))
R2 <- rbinom(n, 1, plogis(parR2[1] + parR2[2]*X5 + parR2[3]*R3))
R1 <- rbinom(n, 1, plogis(parR1[1] + parR1[2]*X4 + parR1[3]*R2))
# generate data
data_no_missing <- data.frame(X1=X1, X2=X2, X3=X3, X4=X4, X5=X5,X6=X6, R1=R1, R2=R2, R3=R3, R4=R4, R5=R5, R6=R6)
data <- data_no_missing %>%
mutate(
X1 = if_else(R1 == 0, NA_real_, X1),
X2 = if_else(R2 == 0, NA_real_, X2),
X3 = if_else(R3 == 0, NA_real_, X3),
X4 = if_else(R4 == 0, NA_real_, X4),
X5 = if_else(R5 == 0, NA_real_, X5),
X6 = if_else(R6 == 0, NA_real_, X6)
)
missing_prop <- apply(data, 2, FUN=function(x) sum(is.na(x))/n)
return(list(data = data, data_no_missing = data_no_missing, missing_prop = missing_prop))
}
set.seed(7)
G2.dgp.output <- G2.dgp(5000)
data <- G2.dgp.output$data
data_no_missing <- G2.dgp.output$data_no_missingOnce identification is achieved, we can proceed to estimation of the
propensity scores. We can estimate them with simple logistic
regressions, by default the formula for estimating propensity score of
formula.R.
propensitys <- f.propensity(graph=G2,
data=data,
ID=ID2,
law='target', # 'target' or 'full'
link.R="logit",
formula.R=NULL,
vocal=T)## Estimating propensity for R1.
## Done with propensity estimations.
## Estimating propensity for R2.
## Done with propensity estimations.
## Estimating propensity for R3.
## Done with propensity estimations.
## Estimating propensity for R5.
## Done with propensity estimations.
## Estimating propensity for R6.
## Done with propensity estimations.
## Estimating propensity for R4.
## Done with propensity estimations.
Let’s check what we have in the propensitys. Let’s use
This is the fitted regression for
summary(propensitys[['R3']]$Rk_fit)##
## Call:
## glm(formula = as.formula(formula.R), family = binomial(link.R),
## data = data.obs, weights = 1/weights[fit_rows])
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.27397 0.09435 34.699 < 2e-16 ***
## X1 0.78922 0.08198 9.627 < 2e-16 ***
## R4 NA NA NA NA
## R5 NA NA NA NA
## R6 -0.44289 0.15475 -2.862 0.00421 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1860.8 on 2285 degrees of freedom
## Residual deviance: 1760.7 on 2283 degrees of freedom
## (2714 observations deleted due to missingness)
## AIC: 1715.7
##
## Number of Fisher Scoring iterations: 6
The Rk_pred contains the predicted propensity scores for each
observation in the data. We can not make prediction for each
observation, therefore, some predicted values are NA. selection
contains missingness indicators that define the observations we can make
prediction for. For
head(propensitys[['R3']]$Rk_pred)## [1] 0.9956163 0.9576566 0.9711185 NA NA 0.9649595
With the estimated propensity score, we can compute inverse probability
weights and estimate any target law parameter that is a function of the
target law. For example, we can estimate the mean of f.Mesimation function below.
est_FUN <- function(data, theta){
with(data, c(X3-theta[1], X4-theta[2], theta[3] - theta[1]*theta[2]))
}
variables <- list(c('X3'), c('X4'), NULL) # the missing variables in each estimating equation
est <- f.Mestimation(data=data,
graph=G2,
propensitys=propensitys,
ID=ID2,
law='target',
est_FUN=est_FUN,
variables=variables,
start=rep(0,3),
vocal=T)
# the estimates
cat("Estimates: ", est$results@estimates, "\n")
# the truth
cat("Truth: ", mean(data_no_missing$X3), mean(data_no_missing$X4), mean(data_no_missing$X3)*mean(data_no_missing$X4), "\n")If we want more flexibility, we can also directly use inverse propensity
score to perform any estimation task. For example, we can estimate the
mean of geex package, while the one below directly
computes the weighted mean. But they should be asymptotically
equivalent.
library(tidyr)
clique <- find_clique(graph=G2,
data=data,
vocal=F,
propensitys=propensitys,
est_R='R3', # the missingness indicator corresponding to the missing variables in the estimation task
non_ID_nodes=c())
cat("Estimated mean of X3: ", mean(replace_na(clique$magic_weight*data$X3,0)), "\n")## Estimated mean of X3: 1.039139
cat("Truth: ", mean(data_no_missing$X3), "\n")## Truth: 1.028805
We can also perform regressions, which becomes weighted regression under
missing data. For example, we can perform regression of
clique <- find_clique(graph=G2,
data=data,
vocal=F,
propensitys=propensitys,
est_R=c('R5','R6'), # the missingness indicator corresponding to the missing variables in the estimation task
non_ID_nodes=c())
fit_rows <- clique$fit_rows # the rows we can use for fitting the regression
magic_weight <- clique$magic_weight # the inverse probability weights
weighted_regression <- lm(X6 ~ X5,
data = data[fit_rows, c('X5','X6')],
weights = magic_weight[fit_rows])
summary(weighted_regression)##
## Call:
## lm(formula = X6 ~ X5, data = data[fit_rows, c("X5", "X6")], weights = magic_weight[fit_rows])
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -7.7977 -0.8934 0.0002 0.9076 6.8281
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.94199 0.02819 -33.41 <2e-16 ***
## X5 0.99433 0.00633 157.09 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.454 on 1890 degrees of freedom
## Multiple R-squared: 0.9289, Adjusted R-squared: 0.9288
## F-statistic: 2.468e+04 on 1 and 1890 DF, p-value: < 2.2e-16
# Truth
cat("True coefficients are -1 and 1. Quite close.\n")## True coefficients are -1 and 1. Quite close.
The variance of the estimates outputted by either summary or
f.Mestimation function is not correct as they fail to account for the
uncertainty in estimating the propensity scores. Correct inference
relies on bootstrap.

