Skip to content

Commit fab83f8

Browse files
Merge pull request #4 from malcolmbarrett/add_gcomp
Add g-computation
2 parents 43dd84b + 95ecbbe commit fab83f8

File tree

9 files changed

+3168
-0
lines changed

9 files changed

+3168
-0
lines changed
Lines changed: 179 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,179 @@
1+
---
2+
title: "The Parametric G-Formula"
3+
output: html_document
4+
---
5+
6+
```{r setup}
7+
library(tidyverse)
8+
library(broom)
9+
library(cidata)
10+
```
11+
12+
# Your Turn 1
13+
14+
For the parametric G-formula, we'll use a single model to fit a causal model of `qsmk` on `wt82_71` where we include all covariates, much as we normally fit regression models. However, instead of interpreting the coefficients, we'll calculate the estimate by predicting on cloned data sets.
15+
16+
First, let's fit the model.
17+
18+
1.Use `lm()`. We'll also create an interaction term with `smokeintensity`.
19+
2. Save the model as `standardized_model`
20+
21+
```{r}
22+
_______ ___ _______(
23+
wt82_71 ~ _______ + I(_______ * smokeintensity) + smokeintensity +
24+
I(smokeintensity^2) + sex + race + age + I(age^2) + education + smokeyrs +
25+
I(smokeyrs^2) + exercise + active + wt71 + I(wt71^2),
26+
data = nhefs_complete
27+
)
28+
```
29+
30+
31+
# Your Turn 2
32+
33+
Now that we've fit a model, we need to clone our data set. To do this, we'll simply mutate it so that in one set, all participants have `qsmk` set to 0 and in another, all participants have `qsmk` set to 1.
34+
35+
1. Create the cloned data sets, called `kept_smoking` and `quit_smoking`.
36+
2. For both data sets, use `standardized_model` and `augment()` to get the predicted values. Use the `newdata` argument in `augment()` with the relevant cloned data set. Then, select only the fitted value. Rename `.fitted` to either `kept_smoking` or `quit_smoking` (use the pattern `select(new_name = old_name)`).
37+
3. Save the predicted data sets as`predicted_kept_smoking` and `predicted_quit_smoking`.
38+
39+
```{r}
40+
_______ <- nhefs_complete %>%
41+
_______
42+
43+
_______ <- nhefs_complete %>%
44+
_______
45+
46+
predicted_kept_smoking <- standardized_model %>%
47+
_______(newdata = _______) %>%
48+
_______
49+
50+
predicted_quit_smoking <- standardized_model %>%
51+
_______(newdata = _______) %>%
52+
_______
53+
```
54+
55+
# Your Turn 3
56+
57+
Finally, we'll get the mean differences between the values.
58+
59+
1. Bind `predicted_kept_smoking` and `predicted_quit_smoking` using `bind_cols()`
60+
2. Summarize the predicted values to create three new variables: `mean_quit_smoking`, `mean_kept_smoking`, and `difference`. The first two should be the means of `quit_smoking` and `kept_smoking`. `difference` should be `mean_quit_smoking` minus `mean_kept_smoking`.
61+
62+
```{r}
63+
_______ %>%
64+
_______(
65+
mean_quit_smoking = _______,
66+
mean_kept_smoking = _______,
67+
difference = _______
68+
)
69+
```
70+
71+
That's it! `difference` is our effect estimate. To get confidence intervals, however, we would need to use the bootstrap method. See the link below for a full example.
72+
73+
## Stretch goal: Boostrapped intervals
74+
75+
Like propensity-based models, we need to do a little more work to get correct standard errors and confidence intervals. In this stretch goal, use rsample to bootstrap the estimates we got from the G-computation model.
76+
77+
Remember, you need to bootstrap the entire modeling process, including the regression model, cloning the data sets, and calculating the effects.
78+
79+
```{r}
80+
library(rsample)
81+
82+
83+
```
84+
85+
# Your Turn 4
86+
87+
1. Take a look at how many participants were lost to follow up in `nhefs`, called `censored` in this data set. You don't need to change anything in this code.
88+
89+
```{r}
90+
nhefs_censored <- nhefs %>%
91+
drop_na(
92+
qsmk, sex, race, age, school, smokeintensity, smokeyrs, exercise,
93+
active, wt71
94+
)
95+
96+
nhefs_censored %>%
97+
count(censored = as.factor(censored)) %>%
98+
ggplot(aes(censored, n)) +
99+
geom_col()
100+
```
101+
102+
2. Create a logistic regression model that predicts whether or not someone is censored.
103+
104+
```{r}
105+
cens_model <- ___(
106+
______ ~ qsmk + sex + race + age + I(age^2) + education +
107+
smokeintensity + I(smokeintensity^2) +
108+
smokeyrs + I(smokeyrs^2) + exercise + active +
109+
wt71 + I(wt71^2),
110+
data = nhefs_censored,
111+
family = binomial()
112+
)
113+
```
114+
115+
# Your Turn 5
116+
117+
1. Use the logistic model you just fit to create inverse probability of censoring weights
118+
2. Calculate the weights using `.fitted`
119+
3. Join `cens` to `nhefs_censored` so that you have the weights in your dataset
120+
4. Fit a linear regression model of `wt82_71` weighted by `cens_wts`. We'll use this model as the basis for our G-computation
121+
122+
```{r}
123+
cens <- _______ %>%
124+
augment(type.predict = "response", data = nhefs_censored) %>%
125+
mutate(cens_wts = 1 / ifelse(censored == 0, 1 - ______, 1)) %>%
126+
select(id, cens_wts)
127+
128+
# join all the weights data from above
129+
nhefs_censored_wts <- _______ %>%
130+
left_join(_____, by = "id")
131+
132+
cens_model <- lm(
133+
______ ~ qsmk + I(qsmk * smokeintensity) + smokeintensity +
134+
I(smokeintensity^2) + sex + race + age + I(age^2) + education + smokeyrs +
135+
I(smokeyrs^2) + exercise + active + wt71 + I(wt71^2),
136+
data = nhefs_censored_wts,
137+
weights = ______
138+
)
139+
```
140+
141+
# Your Turn 6
142+
143+
1. Next, we usually need to clone our datasets, but we can use `kept_smoking` and `quit_smoking` that we created in the first section
144+
2. Use the outcome model, `cens_model`, to make predictions for `kept_smoking` and `quit_smoking`
145+
3. Calculate the differences between the mean values of `kept_smoking` and `quit_smoking`
146+
147+
```{r}
148+
predicted_kept_smoking <- _______ %>%
149+
augment(newdata = _______) %>%
150+
select(kept_smoking = .fitted)
151+
152+
predicted_quit_smoking <- _______ %>%
153+
augment(newdata = _______) %>%
154+
select(quit_smoking = .fitted)
155+
156+
# summarize the mean difference
157+
bind_cols(predicted_kept_smoking, predicted_quit_smoking) %>%
158+
summarise(
159+
160+
)
161+
```
162+
163+
## Stretch goal: Boostrapped intervals
164+
165+
Finish early? Try bootstrapping the G-computation model with censoring weights
166+
167+
Remember, you need to bootstrap the entire modeling process, including fitting both regression models, cloning the data sets, and calculating the effects.
168+
169+
```{r}
170+
171+
```
172+
173+
***
174+
175+
# Take aways
176+
177+
* To fit the parametric G-formula, fit a standardized model with all covariates. Then, use cloned data sets with values set to each level of the exposure you want to study.
178+
* Use the model to predict the values for that level of the exposure and compute the effect estimate you want
179+
* If loss to follow-up is potentially related to your study question, inverse probability of censoring weights can help mitigate the bias.

0 commit comments

Comments
 (0)