Skip to content

Commit 1e02e44

Browse files
Merge pull request #56 from r-causal/g-comp-cont
2 parents 8aabf4b + 8ea4410 commit 1e02e44

File tree

163 files changed

+10442
-588
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

163 files changed

+10442
-588
lines changed
File renamed without changes.
Lines changed: 181 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,181 @@
1+
---
2+
title: "Continuous exposures and g-computation"
3+
format: html
4+
---
5+
6+
```{r}
7+
#| label: setup
8+
library(tidyverse)
9+
library(broom)
10+
library(touringplans)
11+
library(splines)
12+
```
13+
14+
For this set of exercises, we'll use g-computation to calculate a causal effect for continuous exposures.
15+
16+
In the touringplans data set, we have information about the posted waiting times for rides. We also have a limited amount of data on the observed, actual times. The question that we will consider is this: Do posted wait times (`avg_spostmin`) for the Seven Dwarves Mine Train at 8 am affect actual wait times (`avg_sactmin`) at 9 am? Here’s our DAG:
17+
18+
```{r}
19+
#| echo: false
20+
#| message: false
21+
#| warning: false
22+
library(ggdag)
23+
24+
coord_dag <- list(
25+
x = c(wdw_ticket_season = -1, close = -1, weather_wdwhigh = -2, extra_magic_morning = 0, avg_spostmin = 1, avg_sactmin = 2),
26+
y = c(wdw_ticket_season = -1, close = 1, weather_wdwhigh = 0.25, extra_magic_morning = 0, avg_spostmin = 0, avg_sactmin = 0)
27+
)
28+
29+
labels <- c(
30+
avg_sactmin = "Average actual wait",
31+
avg_spostmin = "Average posted wait ",
32+
extra_magic_morning = "Extra Magic Morning",
33+
wdw_ticket_season = "Ticket Season",
34+
weather_wdwhigh = "Historic high temperature",
35+
close = "Time park closed"
36+
)
37+
38+
wait_time_dag <- dagify(
39+
avg_sactmin ~ avg_spostmin + close + wdw_ticket_season + weather_wdwhigh + extra_magic_morning,
40+
avg_spostmin ~ weather_wdwhigh + close + wdw_ticket_season + extra_magic_morning,
41+
coords = coord_dag,
42+
labels = labels
43+
)
44+
45+
wait_time_dag |>
46+
ggdag(use_labels = "label", text = FALSE) +
47+
theme_void() +
48+
scale_x_continuous(
49+
limits = c(-2.25, 2.25),
50+
breaks = c(-2, -1, 0, 1, 2),
51+
labels = c("\n(one year ago)", "\n(6 months ago)", "\n(3 months ago)", "8am-9am\n(Today)", "9am-10am\n(Today)")
52+
) +
53+
theme(axis.text.x = element_text())
54+
```
55+
56+
First, let’s wrangle our data to address our question: do posted wait times at 8 affect actual weight times at 9? We’ll join the baseline data (all covariates and posted wait time at 8) with the outcome (average actual time). We also have a lot of missingness for `avg_sactmin`, so we’ll drop unobserved values for now.
57+
58+
You don't need to update any code here, so just run this.
59+
60+
```{r}
61+
eight <- seven_dwarfs_train_2018 |>
62+
filter(hour == 8) |>
63+
select(-avg_sactmin)
64+
65+
nine <- seven_dwarfs_train_2018 |>
66+
filter(hour == 9) |>
67+
select(date, avg_sactmin)
68+
69+
wait_times <- eight |>
70+
left_join(nine, by = "date") |>
71+
drop_na(avg_sactmin)
72+
```
73+
74+
# Your Turn 1
75+
76+
For the parametric G-formula, we'll use a single model to fit a causal model of Posted Waiting Times (`avg_spostmin`) on Actual Waiting Times (`avg_sactmin`) where we include all covariates, much as we normally fit regression models. However, instead of interpreting the coefficients, we'll calculate the estimate by predicting on cloned data sets.
77+
78+
Two additional differences in our model: we'll use a natural cubic spline on the exposure, `avg_spostmin`, using `ns()` from the splines package, and we'll include an interaction term between `avg_spostmin` and `extra_magic_mornin g`. These complicate the interpretation of the coefficient of the model in normal regression but have virtually no downside (as long as we have a reasonable sample size) in g-computation, because we still get an easily interpretable result.
79+
80+
First, let's fit the model.
81+
82+
1.Use `lm()` to create a model with the outcome, exposure, and confounders identified in the DAG.
83+
2. Save the model as `standardized_model`
84+
85+
```{r}
86+
_______ ___ _______(
87+
avg_sactmin ~ ns(_______, df = 5)*extra_magic_morning + _______ + _______ + _______,
88+
data = seven_dwarfs
89+
)
90+
```
91+
92+
# Your Turn 2
93+
94+
Now that we've fit a model, we need to clone our data set. To do this, we'll simply mutate it so that in one set, all participants have `avg_spostmin` set to 30 minutes and in another, all participants have `avg_spostmin` set to 60 minutes.
95+
96+
1. Create the cloned data sets, called `thirty` and `sixty`.
97+
2. For both data sets, use `standardized_model` and `augment()` to get the predicted values. Use the `newdata` argument in `augment()` with the relevant cloned data set. Then, select only the fitted value. Rename `.fitted` to either `thirty_posted_minutes` or `sixty_posted_minutes` (use the pattern `select(new_name = old_name)`).
98+
3. Save the predicted data sets as`predicted_thirty` and `predicted_sixty`.
99+
100+
```{r}
101+
_______ <- seven_dwarfs |>
102+
_______
103+
104+
_______ <- seven_dwarfs |>
105+
_______
106+
107+
predicted_thirty <- standardized_model |>
108+
_______(newdata = _______) |>
109+
_______
110+
111+
predicted_sixty <- standardized_model |>
112+
_______(newdata = _______) |>
113+
_______
114+
```
115+
116+
# Your Turn 3
117+
118+
Finally, we'll get the mean differences between the values.
119+
120+
1. Bind `predicted_thirty` and `predicted_sixty` using `bind_cols()`
121+
2. Summarize the predicted values to create three new variables: `mean_thirty`, `mean_sixty`, and `difference`. The first two should be the means of `thirty_posted_minutes` and `sixty_posted_minutes`. `difference` should be `mean_sixty` minus `mean_thirty`.
122+
123+
```{r}
124+
_______ |>
125+
_______(
126+
mean_thirty = _______,
127+
mean_sixty = _______,
128+
difference = _______
129+
)
130+
```
131+
132+
That's it! `difference` is our effect estimate, marginalized over the spline terms, interaction effects, and confounders.
133+
134+
## Stretch goal: Boostrapped intervals
135+
136+
Like propensity-based models, we need to do a little more work to get correct standard errors and confidence intervals. In this stretch goal, use rsample to bootstrap the estimates we got from the G-computation model.
137+
138+
Remember, you need to bootstrap the entire modeling process, including the regression model, cloning the data sets, and calculating the effects.
139+
140+
```{r}
141+
set.seed(1234)
142+
library(rsample)
143+
144+
fit_gcomp <- function(split, ...) {
145+
.df <- analysis(split)
146+
147+
# fit outcome model. remember to model using `.df` instead of `seven_dwarfs`
148+
149+
150+
# clone datasets. remember to clone `.df` instead of `seven_dwarfs`
151+
152+
153+
# predict actual wait time for each cloned dataset
154+
155+
156+
# calculate ATE
157+
bind_cols(predicted_yes, predicted_no) |>
158+
summarize(
159+
mean_thirty = mean(thirty_posted_minutes),
160+
mean_sixty = mean(sixty_posted_minutes),
161+
difference = mean_sixty - mean_thirty
162+
) |>
163+
# rsample expects a `term` and `estimate` column
164+
pivot_longer(everything(), names_to = "term", values_to = "estimate")
165+
}
166+
167+
gcomp_results <- bootstraps(seven_dwarfs, 1000, apparent = TRUE) |>
168+
mutate(results = map(splits, ______))
169+
170+
# using bias-corrected confidence intervals
171+
boot_estimate <- int_bca(_______, results, .fn = fit_gcomp)
172+
173+
boot_estimate
174+
```
175+
176+
***
177+
178+
# Take aways
179+
180+
* To fit the parametric G-formula, fit a standardized model with all covariates. Then, use cloned data sets with values set to each level of the exposure you want to study.
181+
* Use the model to predict the values for that level of the exposure and compute the effect estimate you want
File renamed without changes.
Lines changed: 12 additions & 113 deletions
Original file line numberDiff line numberDiff line change
@@ -1,121 +1,18 @@
11
---
2-
title: "The Parametric G-Formula"
2+
title: "Bonus: Selection bias and correcting for loss to follow-up"
33
format: html
44
---
55

66
```{r}
77
#| label: setup
88
library(tidyverse)
99
library(broom)
10-
library(touringplans)
1110
library(propensity)
12-
13-
seven_dwarfs <- seven_dwarfs_train_2018 |>
14-
filter(hour == 9)
15-
```
16-
17-
# Your Turn 1
18-
19-
For the parametric G-formula, we'll use a single model to fit a causal model of Extra Magic Hours (`extra_magic_morning`) on Posted Waiting Times (`avg_spostmin`) where we include all covariates, much as we normally fit regression models. However, instead of interpreting the coefficients, we'll calculate the estimate by predicting on cloned data sets.
20-
21-
First, let's fit the model.
22-
23-
1.Use `lm()` to create a model with the outcome, exposure, and confounders.
24-
2. Save the model as `standardized_model`
25-
26-
```{r}
27-
_______ ___ _______(
28-
avg_spostmin ~ _______ + wdw_ticket_season + close + weather_wdwhigh,
29-
data = seven_dwarfs
30-
)
31-
```
32-
33-
34-
# Your Turn 2
35-
36-
Now that we've fit a model, we need to clone our data set. To do this, we'll simply mutate it so that in one set, all participants have `extra_magic_morning` set to 0 and in another, all participants have `extra_magic_morning` set to 1.
37-
38-
1. Create the cloned data sets, called `yes` and `no`.
39-
2. For both data sets, use `standardized_model` and `augment()` to get the predicted values. Use the `newdata` argument in `augment()` with the relevant cloned data set. Then, select only the fitted value. Rename `.fitted` to either `yes_extra_hours` or `no_extra_hours` (use the pattern `select(new_name = old_name)`).
40-
3. Save the predicted data sets as`predicted_yes` and `predicted_no`.
41-
42-
```{r}
43-
_______ <- seven_dwarfs |>
44-
_______
45-
46-
_______ <- seven_dwarfs |>
47-
_______
48-
49-
predicted_yes <- standardized_model |>
50-
_______(newdata = _______) |>
51-
_______
52-
53-
predicted_no <- standardized_model |>
54-
_______(newdata = _______) |>
55-
_______
5611
```
5712

58-
# Your Turn 3
59-
60-
Finally, we'll get the mean differences between the values.
61-
62-
1. Bind `predicted_yes` and `predicted_no` using `bind_cols()`
63-
2. Summarize the predicted values to create three new variables: `mean_yes`, `mean_no`, and `difference`. The first two should be the means of `yes_extra_hours` and `no_extra_hours`. `difference` should be `mean_yes` minus `mean_no`.
64-
65-
```{r}
66-
_______ |>
67-
_______(
68-
mean_yes = _______,
69-
mean_no = _______,
70-
difference = _______
71-
)
72-
```
73-
74-
That's it! `difference` is our effect estimate. To get confidence intervals, however, we would need to use the bootstrap method. See the link below for a full example.
75-
76-
## Stretch goal: Boostrapped intervals
77-
78-
Like propensity-based models, we need to do a little more work to get correct standard errors and confidence intervals. In this stretch goal, use rsample to bootstrap the estimates we got from the G-computation model.
79-
80-
Remember, you need to bootstrap the entire modeling process, including the regression model, cloning the data sets, and calculating the effects.
81-
82-
```{r}
83-
set.seed(1234)
84-
library(rsample)
85-
86-
fit_gcomp <- function(split, ...) {
87-
.df <- analysis(split)
88-
89-
# fit outcome model. remember to model using `.df` instead of `seven_dwarfs`
90-
91-
92-
# clone datasets. remember to clone `.df` instead of `seven_dwarfs`
93-
94-
95-
# predict wait time for each cloned dataset
96-
97-
98-
# calculate ATE
99-
bind_cols(predicted_yes, predicted_no) |>
100-
summarize(
101-
mean_yes = mean(yes_extra_hours),
102-
mean_no = mean(no_extra_hours),
103-
difference = mean_yes - mean_no
104-
) |>
105-
# rsample expects a `term` and `estimate` column
106-
pivot_longer(everything(), names_to = "term", values_to = "estimate")
107-
}
108-
109-
gcomp_results <- bootstraps(seven_dwarfs, 1000, apparent = TRUE) |>
110-
mutate(results = map(splits, ______))
13+
In this example, we'll consider loss to follow-up in the NHEFS study. We'll use the binary exposure we used earlier in the workshop: does quitting smoking (`smk`) increase weight (`wt82_71`)? This time, however, we'll adjust for loss to followup (people who dropped out of the study between observation periods) using inverse probability of censoring weights.
11114

112-
# using bias-corrected confidence intervals
113-
boot_estimate <- int_bca(_______, results, .fn = fit_gcomp)
114-
115-
boot_estimate
116-
```
117-
118-
# Your Turn 4
15+
# Your Turn 1
11916

12017
1. Take a look at how many participants were lost to follow up in `nhefs`, called `censored` in this data set. You don't need to change anything in this code.
12118

@@ -145,7 +42,7 @@ cens_model <- ___(
14542
)
14643
```
14744

148-
# Your Turn 5
45+
# Your Turn 2
14946

15047
1. Use the logistic model you just fit to create inverse probability of censoring weights
15148
2. Calculate the weights using `.fitted`
@@ -155,7 +52,7 @@ cens_model <- ___(
15552
```{r}
15653
cens <- _______ |>
15754
augment(type.predict = "response", data = nhefs_censored) |>
158-
mutate(cens_wts = 1 / ifelse(censored == 0, 1 - ______, 1)) |>
55+
mutate(cens_wts = wt_ate(censored, ______)) |>
15956
select(id, cens_wts)
16057
16158
# join all the weights data from above
@@ -171,13 +68,16 @@ cens_model <- lm(
17168
)
17269
```
17370

174-
# Your Turn 6
71+
# Your Turn 3
17572

176-
1. Next, we usually need to clone our datasets, but we can use `kept_smoking` and `quit_smoking` that we created in the first section
177-
2. Use the outcome model, `cens_model`, to make predictions for `kept_smoking` and `quit_smoking`
73+
1. Create the cloned data sets, called `kept_smoking` and `no`, where one dataset has `quit_smoking` set to 1 (quit smoking) and the other has it set to 0 (kept smoking).
74+
2. Use the outcome model, `cens_model`, to make predictions for `kept_smoking` and `quit_smoking`
17875
3. Calculate the differences between the mean values of `kept_smoking` and `quit_smoking`
17976

18077
```{r}
78+
kept_smoking <- ____
79+
quit_smoking <- ____
80+
18181
predicted_kept_smoking <- _______ |>
18282
augment(newdata = _______) |>
18383
select(kept_smoking = .fitted)
@@ -239,6 +139,5 @@ boot_estimate_cens
239139

240140
# Take aways
241141

242-
* To fit the parametric G-formula, fit a standardized model with all covariates. Then, use cloned data sets with values set to each level of the exposure you want to study.
243-
* Use the model to predict the values for that level of the exposure and compute the effect estimate you want
244142
* If loss to follow-up is potentially related to your study question, inverse probability of censoring weights can help mitigate the bias.
143+
* You can use them in many types of models. If you're also using propensity score weights, simply multiply the weights together, then include the result as the weights for your outcome model.

0 commit comments

Comments
 (0)