Skip to content

annaguo-bios/flexMissing

Repository files navigation

An R Package for Target Law Identification and Estimation in Given Missing Data mDAGs

The flexMissing R package implements a weighting-based framework for identification and estimation of target laws in graphical models of missing data represented by missing data directed acyclic graphs (mDAGs). It supports identification and estimation under a broad range of missingness mechanisms, including missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). The package allows the missingness mechanisms to be estimated using either parametric regression models or flexible machine learning methods through the SuperLearner package. Based on these estimates, flexMissing constructs inverse probability weights, which can then be used to estimate any target parameter defined as a functional of the target law.

If you find this package useful, please consider cite: this paper

# placeholder

1 Installation

To install, run the following code in terminal:

# install the devtools package first if it's not yet installed
devtools::install_github("annaguo-bios/flexMissing")

The source code for flexMissing package is available on GitHub at flexMissing.

2 A Brief Introduction to mDAGs

The graph below is any mDAG, where $X_1$ to $X_6$, collectively denoted by $X=(X_1,X_2,\cdots,X_6)$, are variables subject to missingness, and $R_1$ to $R_6$, collectively denoted by $R=(R_1,R_2,\cdots,R_6)$, are their corresponding missingness indicators, taking the value 1 if the variable is observed and 0 if it is missing. The observed proxies, $X_1^\star$ to $X_6^\star$, equal the corresponding variable when it is observed and are set to NA when the variable is missing.

Some definitions:

  • Missing mechanism (or propensity score): the conditional distribution of the missingness indicators given their parents. For example, the missing mechanism (propensity score) for $R_3$ is $p(R_3|X_1,R_4,R_5,R_6)$.
  • Full law: the joint distribution of all variables and their missingness indicators, i.e., $p(X_1,X_2,X_3,X_4,X_5,X_6,R_1,R_2,R_3,R_4,R_5,R_6)$.
  • Target law: the joint distribution of all variables, i.e., $p(X_1,X_2,X_3,X_4,X_5,X_6)$.

The target law is identified if and only if all propensity scores evaluated at $R=1$ (i.e., all variables are observed) are identified. This is easily seen from Bayes’ rule, where $\mathrm{pa}(R_k)$ denotes the parents of $R_k$ in the mDAG:

$$ p(X) = \frac{\displaystyle p(X, R)}{\displaystyle \prod_{R_k \in R}p(R_k \mid pa(R_k))} \ \Bigg{\rvert}_{R=1}. $$

We can create a graph object based on the mDAG above and visualize it using the code below. In fact, the mDAG shown was generated with this code.

library(flexMissing)

G2 <- make.graph(obs_variables = c(),
                 missing_variables = c('X1','X2','X3','X4','X5','X6'),
                 missing_indicators = c('R1','R2','R3','R4','R5','R6'),
                 di_edges = list(c('X1','X2'),c('X2','X3'),c('X3','X4'),c('X4','X5'),c('X5','X6'),
                                 c('R6','R5'),c('R6','R3'),c('R5','R3'),c('R4','R3'),c('R3','R2'),c('R2','R1'),
                                 c('X1','R3'),
                                 c('X3','R4'),c('X3','R5'),c('X3','R6'),
                                 c('X4','R1'),c('X4','R6'),
                                 c('X5','R2'),
                                 c('X6','R5'))
)

# check G2
plot(G2, vertex.size = 15, graph.main='G2')

3 Identification

With the graph object created above, we can check whether the target law is identified using the f.ID_algorithm function. We can also check whether the full law is identified by setting the argument fulllaw = T. Full law ID is still under development.

ID2 <- f.ID_algorithm(G2, fulllaw = T)
## The target law is identified in the given mDAG.

## Full law identification for R3 failed.

## Full law identification for R5 failed.

## The full law is NOT identified in the given mDAG.

We can visualize the ID result using the plot function.

plot(ID2, vertex.size = 15)

## The edges colored in skyblue are pruned edges for full law ID.
## The nodes colored in skyblue are non-ID nodes for full law.

## The edges colored in tomato are pruned edges for target law ID.
## The nodes colored in tomato are non-ID nodes for target law.

The tree structure provides a summary on how identification for each propensity score is achieved. For example, R1 don’t have any children in the tree meaning that its propensity score is identified using ordinary independence constraint: $p(R_1|X_4,R_2) = p(R_1|X_4,R_2,R_4=1)$. In comparison, R3 has children R2 and R1 in the tree, meaning that its propensity score is identified by intervening on $R_1$ and $R_2$, and relying on the independence constraint in the post-intervention distribution: $p(R_3|X_1,R_4,R_5,R_6) = p(R_3|X_1,R_4,R_5,R_6; do(R_1=1,R_2=1))$.

A summary of the identification result can be obtained using the f.summarize_ID function. We will explain pre-selection, selection, selection_r, and selection_x shortly in the estimation section.

f.summarize_ID(ID2)
## 
## === Target Law ID Result ===
## 
## 
## 
## Missing_Indicators   Propensitys              ID_Status   Intervened_On   Preselection         Selection    Selection_r   Selection_x 
## -------------------  -----------------------  ----------  --------------  -------------------  -----------  ------------  ------------
## R1                   p(R1 | X4, R2)           TRUE        -               R4                   R4           -             R4          
## R2                   p(R2 | X5, R3)           TRUE        -               R5                   R5           -             R5          
## R3                   p(R3 | X1, R4, R5, R6)   TRUE        R1, R2          R1, R4, R5           R1, R4, R5   R4, R5        R1          
## R5                   p(R5 | X3, X6, R6)       TRUE        R1, R3          R3, R6, R4, R1, R4   R3, R6       R6            R3, R6      
## R6                   p(R6 | X3, X4)           TRUE        R1, R3          R3, R4, R4, R1, R4   R3, R4       -             R3, R4      
## R4                   p(R4 | X3)               TRUE        R2, R3          R3, R5, R1, R5       R3           -             R3          
## 
## 
## === Full Law ID Result ===
## 
## 
## 
## Missing_Indicators   Propensitys              ID_Status   Intervened_On   Preselection         Selection   Selection_r   Selection_x 
## -------------------  -----------------------  ----------  --------------  -------------------  ----------  ------------  ------------
## R1                   p(R1 | X4, R2)           TRUE        -               R4                   R4          -             R4          
## R2                   p(R2 | X5, R3)           TRUE        -               R5                   R5          -             R5          
## R3                   p(R3 | X1, R4, R5, R6)   FALSE       -               R1                   R1          -             R1          
## R5                   p(R5 | X3, X6, R6)       FALSE       R1, R3          R3, R6, R4, R1, R4   R3, R6      R6            R3, R6      
## R6                   p(R6 | X3, X4)           TRUE        R1, R3          R3, R4, R4, R1, R4   R3, R4      -             R3, R4      
## R4                   p(R4 | X3)               TRUE        R2, R3          R3, R5, R1, R5       R3          -             R3

4 Estimation

4.1 Estimation of propensity scores

Let’s first generate a data set according to the mDAG G2 defined above. We will generate n=5000 samples.

library(dplyr)
G2.dgp <- function(n,
                   parX1 = c(1,1),
                   parX2 = c(-1,1,1),
                   parX3 = c(1,1,1),
                   parX4 = c(1,-0.5,1),
                   parX5 = c(1,3,1),
                   parX6 = c(-1,1,1),
                   parR1 = c(0,1,1),
                   parR2 = c(-2,1,1),
                   parR3 = c(0,1,1,1,1),
                   parR4 = c(1,1),
                   parR5 = c(1,1,1,1),
                   parR6 = c(-2,1,1)
){

  # generate missing variables
  X1 <- rnorm(n, parX1[1], parX1[2])
  X2 <- rnorm(n, parX2[1] + parX2[2]*X1, parX2[3])
  X3 <- rnorm(n, parX3[1] + parX3[2]*X2, parX3[3])
  X4 <- rnorm(n, parX4[1] + parX4[2]*X3, parX4[3])
  X5 <- rnorm(n, parX5[1] + parX5[2]*X4, parX5[3])
  X6 <- rnorm(n, parX6[1] + parX6[2]*X5, parX6[3])

  # generate missing indicators
  R6 <- rbinom(n, 1, plogis(parR6[1] + parR6[2]*X3 + parR6[3]*X4))
  R5 <- rbinom(n, 1, plogis(parR5[1] + parR5[2]*X3+ parR5[3]*X6 + parR5[4]*R6))
  R4 <- rbinom(n, 1, plogis(parR4[1] + parR4[2]*X3))
  R3 <- rbinom(n, 1, plogis(parR3[1] + parR3[2]*X1+ parR3[3]*R5 + parR3[4]*R6+ parR3[5]*R4))
  R2 <- rbinom(n, 1, plogis(parR2[1] + parR2[2]*X5 + parR2[3]*R3))
  R1 <- rbinom(n, 1, plogis(parR1[1] + parR1[2]*X4 + parR1[3]*R2))

  # generate data
  data_no_missing <- data.frame(X1=X1, X2=X2, X3=X3, X4=X4, X5=X5,X6=X6, R1=R1, R2=R2, R3=R3, R4=R4, R5=R5, R6=R6)

  data <- data_no_missing %>%
    mutate(
      X1 = if_else(R1 == 0, NA_real_, X1),
      X2 = if_else(R2 == 0, NA_real_, X2),
      X3 = if_else(R3 == 0, NA_real_, X3),
      X4 = if_else(R4 == 0, NA_real_, X4),
      X5 = if_else(R5 == 0, NA_real_, X5),
      X6 = if_else(R6 == 0, NA_real_, X6)
    )

  missing_prop <- apply(data, 2, FUN=function(x) sum(is.na(x))/n)

  return(list(data = data, data_no_missing = data_no_missing, missing_prop = missing_prop))

}

set.seed(7)

G2.dgp.output <- G2.dgp(5000)
data <- G2.dgp.output$data
data_no_missing <- G2.dgp.output$data_no_missing

Once identification is achieved, we can proceed to estimation of the propensity scores. We can estimate them with simple logistic regressions, by default the formula for estimating propensity score of $R_i$ is main term regression including all its parents in the mDAG. We can also specify custom formulas for each propensity score by providing a list of formulas to the argument formula.R.

propensitys <- f.propensity(graph=G2,
                         data=data,
                         ID=ID2,
                         law='target', # 'target' or 'full'
                         link.R="logit",
                         formula.R=NULL,
                         vocal=T)
## Estimating propensity for R1.

## Done with propensity estimations.

## Estimating propensity for R2.

## Done with propensity estimations.

## Estimating propensity for R3.

## Done with propensity estimations.

## Estimating propensity for R5.

## Done with propensity estimations.

## Estimating propensity for R6.

## Done with propensity estimations.

## Estimating propensity for R4.

## Done with propensity estimations.

Let’s check what we have in the propensitys. Let’s use $R_3$ as an example.

This is the fitted regression for $R_3$ given its parents. We have NA for coefficients of $R_4$ and $R_5$ because recall selection_r for $R_3$ is ${R_4, R_5}$, meaning that we can only identify $p(R_3\mid X_1,R_4=1,R_5=1,R_6)$ following the identification procedure depicted by the tree. preselection contains missingness indicators that define the observations we use to fit the regression. For $R_3$ we have preselection $R_1, R_4, R_5$, meaning that we only use observations with $R_4=1$, $R_5=1$, and $R_6=1$ to fit the regression for $R_3$.

summary(propensitys[['R3']]$Rk_fit)
## 
## Call:
## glm(formula = as.formula(formula.R), family = binomial(link.R), 
##     data = data.obs, weights = 1/weights[fit_rows])
## 
## Coefficients: (2 not defined because of singularities)
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.27397    0.09435  34.699  < 2e-16 ***
## X1           0.78922    0.08198   9.627  < 2e-16 ***
## R4                NA         NA      NA       NA    
## R5                NA         NA      NA       NA    
## R6          -0.44289    0.15475  -2.862  0.00421 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1860.8  on 2285  degrees of freedom
## Residual deviance: 1760.7  on 2283  degrees of freedom
##   (2714 observations deleted due to missingness)
## AIC: 1715.7
## 
## Number of Fisher Scoring iterations: 6

The Rk_pred contains the predicted propensity scores for each observation in the data. We can not make prediction for each observation, therefore, some predicted values are NA. selection contains missingness indicators that define the observations we can make prediction for. For $R_3$ we have selection ${R_1,R_4, R_5}$, meaning that we can only make prediction for observations with $R_1=1, R_4=1$ and $R_5=1$.

head(propensitys[['R3']]$Rk_pred)
## [1] 0.9956163 0.9576566 0.9711185        NA        NA 0.9649595

4.2 Estimation of target law parameters

With the estimated propensity score, we can compute inverse probability weights and estimate any target law parameter that is a function of the target law. For example, we can estimate the mean of $X_3, X_4$ and the product between them using the f.Mesimation function below.

est_FUN <- function(data, theta){
  with(data, c(X3-theta[1], X4-theta[2], theta[3] - theta[1]*theta[2]))
}

variables <- list(c('X3'), c('X4'), NULL) # the missing variables in each estimating equation

est <- f.Mestimation(data=data,
                      graph=G2,
                      propensitys=propensitys,
                      ID=ID2,
                      law='target',
                      est_FUN=est_FUN,
                      variables=variables,
                      start=rep(0,3),
                      vocal=T)

# the estimates
cat("Estimates: ", est$results@estimates, "\n")

# the truth
cat("Truth: ", mean(data_no_missing$X3), mean(data_no_missing$X4), mean(data_no_missing$X3)*mean(data_no_missing$X4), "\n")

If we want more flexibility, we can also directly use inverse propensity score to perform any estimation task. For example, we can estimate the mean of $X_3$ as follows. The estimate could be different from the one above due to different implementation. The one above relies on solving estimating equations via geex package, while the one below directly computes the weighted mean. But they should be asymptotically equivalent.

library(tidyr)
clique <- find_clique(graph=G2,
                      data=data,
                      vocal=F,
                      propensitys=propensitys,
                      est_R='R3', # the missingness indicator corresponding to the missing variables in the estimation task
                      non_ID_nodes=c())

cat("Estimated mean of X3: ", mean(replace_na(clique$magic_weight*data$X3,0)), "\n")
## Estimated mean of X3:  1.039139
cat("Truth: ", mean(data_no_missing$X3), "\n")
## Truth:  1.028805

We can also perform regressions, which becomes weighted regression under missing data. For example, we can perform regression of $X_6$ on $X_5$ as follows.

clique <- find_clique(graph=G2,
                      data=data,
                      vocal=F,
                      propensitys=propensitys,
                      est_R=c('R5','R6'), # the missingness indicator corresponding to the missing variables in the estimation task
                      non_ID_nodes=c())

fit_rows <- clique$fit_rows # the rows we can use for fitting the regression
magic_weight <- clique$magic_weight # the inverse probability weights

weighted_regression <- lm(X6 ~ X5,
                          data = data[fit_rows, c('X5','X6')],
                          weights = magic_weight[fit_rows])

summary(weighted_regression)
## 
## Call:
## lm(formula = X6 ~ X5, data = data[fit_rows, c("X5", "X6")], weights = magic_weight[fit_rows])
## 
## Weighted Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.7977 -0.8934  0.0002  0.9076  6.8281 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.94199    0.02819  -33.41   <2e-16 ***
## X5           0.99433    0.00633  157.09   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.454 on 1890 degrees of freedom
## Multiple R-squared:  0.9289, Adjusted R-squared:  0.9288 
## F-statistic: 2.468e+04 on 1 and 1890 DF,  p-value: < 2.2e-16
# Truth
cat("True coefficients are -1 and 1. Quite close.\n")
## True coefficients are -1 and 1. Quite close.

The variance of the estimates outputted by either summary or f.Mestimation function is not correct as they fail to account for the uncertainty in estimating the propensity scores. Correct inference relies on bootstrap.

About

Weighting-based Identification and Estimation for Graphical Models of Missing Data

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages