Skip to content

Commit 91bc8f9

Browse files
add g-computation
1 parent 43dd84b commit 91bc8f9

File tree

9 files changed

+3169
-0
lines changed

9 files changed

+3169
-0
lines changed
Lines changed: 180 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,180 @@
1+
---
2+
title: "The Parametric G-Formula"
3+
output: html_document
4+
---
5+
6+
```{r setup}
7+
library(tidyverse)
8+
library(haven)
9+
library(broom)
10+
library(cidata)
11+
```
12+
13+
# Your Turn 1
14+
15+
For the parametric G-formula, we'll use a single model to fit a causal model of `qsmk` on `wt82_71` where we include all covariates, much as we normally fit regression models. However, instead of interpreting the coefficients, we'll calculate the estimate by predicting on cloned data sets.
16+
17+
First, let's fit the model.
18+
19+
1.Use `lm()`. We'll also create an interaction term with `smokeintensity`.
20+
2. Save the model as `standardized_model`
21+
22+
```{r}
23+
_______ ___ _______(
24+
wt82_71 ~ _______ + I(_______ * smokeintensity) + smokeintensity +
25+
I(smokeintensity^2) + sex + race + age + I(age^2) + education + smokeyrs +
26+
I(smokeyrs^2) + exercise + active + wt71 + I(wt71^2),
27+
data = nhefs_complete
28+
)
29+
```
30+
31+
32+
# Your Turn 2
33+
34+
Now that we've fit a model, we need to clone our data set. To do this, we'll simply mutate it so that in one set, all participants have `qsmk` set to 0 and in another, all participants have `qsmk` set to 1.
35+
36+
1. Create the cloned data sets, called `kept_smoking` and `quit_smoking`.
37+
2. For both data sets, use `standardized_model` and `augment()` to get the predicted values. Use the `newdata` argument in `augment()` with the relevant cloned data set. Then, select only the fitted value. Rename `.fitted` to either `kept_smoking` or `quit_smoking` (use the pattern `select(new_name = old_name)`).
38+
3. Save the predicted data sets as`predicted_kept_smoking` and `predicted_quit_smoking`.
39+
40+
```{r}
41+
_______ <- nhefs_complete %>%
42+
_______
43+
44+
_______ <- nhefs_complete %>%
45+
_______
46+
47+
predicted_kept_smoking <- standardized_model %>%
48+
_______(newdata = _______) %>%
49+
_______
50+
51+
predicted_quit_smoking <- standardized_model %>%
52+
_______(newdata = _______) %>%
53+
_______
54+
```
55+
56+
# Your Turn 3
57+
58+
Finally, we'll get the mean differences between the values.
59+
60+
1. Bind `predicted_kept_smoking` and `predicted_quit_smoking` using `bind_cols()`
61+
2. Summarize the predicted values to create three new variables: `mean_quit_smoking`, `mean_kept_smoking`, and `difference`. The first two should be the means of `quit_smoking` and `kept_smoking`. `difference` should be `mean_quit_smoking` minus `mean_kept_smoking`.
62+
63+
```{r}
64+
_______ %>%
65+
_______(
66+
mean_quit_smoking = _______,
67+
mean_kept_smoking = _______,
68+
difference = _______
69+
)
70+
```
71+
72+
That's it! `difference` is our effect estimate. To get confidence intervals, however, we would need to use the bootstrap method. See the link below for a full example.
73+
74+
## Stretch goal: Boostrapped intervals
75+
76+
Like propensity-based models, we need to do a little more work to get correct standard errors and confidence intervals. In this stretch goal, use rsample to bootstrap the estimates we got from the G-computation model.
77+
78+
Remember, you need to bootstrap the entire modeling process, including the regression model, cloning the data sets, and calculating the effects.
79+
80+
```{r}
81+
library(rsample)
82+
83+
84+
```
85+
86+
# Your Turn 4
87+
88+
1. Take a look at how many participants were lost to follow up in `nhefs`, called `censored` in this data set. You don't need to change anything in this code.
89+
90+
```{r}
91+
nhefs_censored <- nhefs %>%
92+
drop_na(
93+
qsmk, sex, race, age, school, smokeintensity, smokeyrs, exercise,
94+
active, wt71
95+
)
96+
97+
nhefs_censored %>%
98+
count(censored = as.factor(censored)) %>%
99+
ggplot(aes(censored, n)) +
100+
geom_col()
101+
```
102+
103+
2. Create a logistic regression model that predicts whether or not someone is censored.
104+
105+
```{r}
106+
cens_model <- ___(
107+
______ ~ qsmk + sex + race + age + I(age^2) + education +
108+
smokeintensity + I(smokeintensity^2) +
109+
smokeyrs + I(smokeyrs^2) + exercise + active +
110+
wt71 + I(wt71^2),
111+
data = nhefs_censored,
112+
family = binomial()
113+
)
114+
```
115+
116+
# Your Turn 5
117+
118+
1. Use the logistic model you just fit to create inverse probability of censoring weights
119+
2. Calculate the weights using `.fitted`
120+
3. Join `cens` to `nhefs_censored` so that you have the weights in your dataset
121+
4. Fit a linear regression model of `wt82_71` weighted by `cens_wts`. We'll use this model as the basis for our G-computation
122+
123+
```{r}
124+
cens <- _______ %>%
125+
augment(type.predict = "response", data = nhefs_censored) %>%
126+
mutate(cens_wts = 1 / ifelse(censored == 0, 1 - ______, 1)) %>%
127+
select(id, cens_wts)
128+
129+
# join all the weights data from above
130+
nhefs_censored_wts <- _______ %>%
131+
left_join(_____, by = "id")
132+
133+
cens_model <- lm(
134+
______ ~ qsmk + I(qsmk * smokeintensity) + smokeintensity +
135+
I(smokeintensity^2) + sex + race + age + I(age^2) + education + smokeyrs +
136+
I(smokeyrs^2) + exercise + active + wt71 + I(wt71^2),
137+
data = nhefs_censored_wts,
138+
weights = ______
139+
)
140+
```
141+
142+
# Your Turn 6
143+
144+
1. Next, we usually need to clone our datasets, but we can use `kept_smoking` and `quit_smoking` that we created in the first section
145+
2. Use the outcome model, `cens_model`, to make predictions for `kept_smoking` and `quit_smoking`
146+
3. Calculate the differences between the mean values of `kept_smoking` and `quit_smoking`
147+
148+
```{r}
149+
predicted_kept_smoking <- _______ %>%
150+
augment(newdata = _______) %>%
151+
select(kept_smoking = .fitted)
152+
153+
predicted_quit_smoking <- _______ %>%
154+
augment(newdata = _______) %>%
155+
select(quit_smoking = .fitted)
156+
157+
# summarize the mean difference
158+
bind_cols(predicted_kept_smoking, predicted_quit_smoking) %>%
159+
summarise(
160+
161+
)
162+
```
163+
164+
## Stretch goal: Boostrapped intervals
165+
166+
Finish early? Try bootstrapping the G-computation model with censoring weights
167+
168+
Remember, you need to bootstrap the entire modeling process, including fitting both regression models, cloning the data sets, and calculating the effects.
169+
170+
```{r}
171+
172+
```
173+
174+
***
175+
176+
# Take aways
177+
178+
* To fit the parametric G-formula, fit a standardized model with all covariates. Then, use cloned data sets with values set to each level of the exposure you want to study.
179+
* Use the model to predict the values for that level of the exposure and compute the effect estimate you want
180+
* If loss to follow-up is potentially related to your study question, inverse probability of censoring weights can help mitigate the bias.

0 commit comments

Comments
 (0)