You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: exercises/10-continuous-g-computation-exercises.qmd
+62-86Lines changed: 62 additions & 86 deletions
Original file line number
Diff line number
Diff line change
@@ -8,10 +8,10 @@ format: html
8
8
library(tidyverse)
9
9
library(broom)
10
10
library(touringplans)
11
-
library(propensity)
11
+
library(splines)
12
12
```
13
13
14
-
For this set of exercises, we'll use propensity scores for continuous exposures.
14
+
For this set of exercises, we'll use g-computation to calculate a causal effect for continuous exposures.
15
15
16
16
In the touringplans data set, we have information about the posted waiting times for rides. We also have a limited amount of data on the observed, actual times. The question that we will consider is this: Do posted wait times (`avg_spostmin`) for the Seven Dwarves Mine Train at 8 am affect actual wait times (`avg_sactmin`) at 9 am? Here’s our DAG:
17
17
@@ -73,124 +73,102 @@ wait_times <- eight |>
73
73
74
74
# Your Turn 1
75
75
76
-
First, let’s calculate the propensity score model, which will be the denominator in our stabilized weights (more to come on that soon). We’ll fit a model using `lm()` for `avg_spostmin` with our covariates, then use the fitted predictions of `avg_spostmin` (`.fitted`, `.sigma`) to calculate the density using `dnorm()`.
76
+
For the parametric G-formula, we'll use a single model to fit a causal model of Posted Waiting Times (`avg_spostmin`) on Actual Waiting Times (`avg_sactmin`) where we include all covariates, much as we normally fit regression models. However, instead of interpreting the coefficients, we'll calculate the estimate by predicting on cloned data sets.
77
77
78
-
1. Fit a model using `lm()` with `avg_spostmin` as the outcome and the confounders identified in the DAG.
79
-
2. Use `augment()` to add model predictions to the data frame
80
-
3. In `dnorm()`, use `.fitted` as the mean and the mean of `.sigma` as the SD to calculate the propensity score for the denominator.
78
+
Two additional differences in our model: we'll use a natural cubic spline on the exposure, `avg_spostmin`, using `ns()` from the splines package, and we'll include an interaction term between `avg_spostmin` and `extra_magic_mornin g`. These complicate the interpretation of the coefficient of the model in normal regression but have virtually no downside (as long as we have a reasonable sample size) in g-computation, because we still get an easily interpretable result.
79
+
80
+
First, let's fit the model.
81
+
82
+
1.Use `lm()` to create a model with the outcome, exposure, and confounders identified in the DAG.
As with the example in the slides, we have a lot of extreme values for our weights
103
-
104
-
```{r}
105
-
denominators |>
106
-
mutate(wts = 1 / denominator) |>
107
-
ggplot(aes(wts)) +
108
-
geom_density(col = "#E69F00", fill = "#E69F0095") +
109
-
scale_x_log10() +
110
-
theme_minimal(base_size = 20) +
111
-
xlab("Weights")
112
-
```
94
+
Now that we've fit a model, we need to clone our data set. To do this, we'll simply mutate it so that in one set, all participants have `avg_spostmin` set to 30 minutes and in another, all participants have `avg_spostmin` set to 60 minutes.
113
95
114
-
Let’s now fit the marginal density to use for stabilized weights:
115
-
116
-
1. Fit an intercept-only model of posted weight times to use as the numerator model
117
-
2. Calculate the numerator weights using `dnorm()` as above.
118
-
3. Finally, calculate the stabilized weights, `swts`, using the `numerator` and `denominator` weights
96
+
1. Create the cloned data sets, called `thirty` and `sixty`.
97
+
2. For both data sets, use `standardized_model` and `augment()` to get the predicted values. Use the `newdata` argument in `augment()` with the relevant cloned data set. Then, select only the fitted value. Rename `.fitted` to either `thirty_posted_minutes` or `sixty_posted_minutes` (use the pattern `select(new_name = old_name)`).
98
+
3. Save the predicted data sets as`predicted_thirty` and `predicted_sixty`.
119
99
120
100
```{r}
121
-
numerator_model <- lm(
122
-
___ ~ ___,
123
-
data = wait_times
124
-
)
125
-
126
-
numerators <- numerator_model |>
127
-
augment(data = wait_times) |>
128
-
mutate(
129
-
numerator = dnorm(
130
-
avg_spostmin,
131
-
___,
132
-
mean(___, na.rm = TRUE)
133
-
)
134
-
) |>
135
-
select(date, numerator)
136
-
137
-
wait_times_wts <- wait_times |>
138
-
left_join(numerators, by = "date") |>
139
-
left_join(denominators, by = "date") |>
140
-
mutate(swts = ___)
141
-
```
142
-
143
-
Take a look at the weights now that we've stabilized them:
101
+
_______ <- seven_dwarfs |>
102
+
_______
144
103
145
-
```{r}
146
-
ggplot(wait_times_wts, aes(swts)) +
147
-
geom_density(col = "#E69F00", fill = "#E69F0095") +
148
-
scale_x_log10() +
149
-
theme_minimal(base_size = 20) +
150
-
xlab("Stabilized Weights")
151
-
```
104
+
_______ <- seven_dwarfs |>
105
+
_______
152
106
153
-
Note that their mean is now about 1! That means the psuedo-population created by the weights is the same size as the observed population (the number of days we have wait time data, in this case).
107
+
predicted_thirty <- standardized_model |>
108
+
_______(newdata = _______) |>
109
+
_______
154
110
155
-
```{r}
156
-
round(mean(wait_times_wts$swts), digits = 2)
111
+
predicted_sixty <- standardized_model |>
112
+
_______(newdata = _______) |>
113
+
_______
157
114
```
158
115
159
-
160
116
# Your Turn 3
161
117
162
-
Now, let's fit the outcome model!
118
+
Finally, we'll get the mean differences between the values.
163
119
164
-
1. Estimate the relationship between posted wait times and actual wait times using the stabilized weights we just created.
120
+
1. Bind `predicted_thirty` and `predicted_sixty` using `bind_cols()`
121
+
2. Summarize the predicted values to create three new variables: `mean_thirty`, `mean_sixty`, and `difference`. The first two should be the means of `thirty_posted_minutes` and `sixty_posted_minutes`. `difference` should be `mean_sixty` minus `mean_thirty`.
165
122
166
123
```{r}
167
-
lm(___ ~ ___, weights = ___, data = wait_times_wts) |>
168
-
tidy() |>
169
-
filter(term == "avg_spostmin") |>
170
-
mutate(estimate = estimate * 10)
124
+
_______ |>
125
+
_______(
126
+
mean_thirty = _______,
127
+
mean_sixty = _______,
128
+
difference = _______
129
+
)
171
130
```
172
131
132
+
That's it! `difference` is our effect estimate, marginalized over the spline terms, interaction effects, and confounders.
133
+
173
134
## Stretch goal: Boostrapped intervals
174
135
175
-
Bootstrap confidence intervals for our estimate.
136
+
Like propensity-based models, we need to do a little more work to get correct standard errors and confidence intervals. In this stretch goal, use rsample to bootstrap the estimates we got from the G-computation model.
176
137
177
-
There's nothing new here. Just remember, you need to bootstrap the entire modeling process!
138
+
Remember, you need to bootstrap the entire modeling process, including the regression model, cloning the data sets, and calculating the effects.
178
139
179
140
```{r}
180
141
set.seed(1234)
181
142
library(rsample)
182
143
183
-
fit_model <- function(split, ...) {
144
+
fit_gcomp <- function(split, ...) {
184
145
.df <- analysis(split)
185
146
186
-
# fill in the rest!
147
+
# fit outcome model. remember to model using `.df` instead of `seven_dwarfs`
148
+
149
+
150
+
# clone datasets. remember to clone `.df` instead of `seven_dwarfs`
151
+
152
+
153
+
# predict actual wait time for each cloned dataset
* We can calculate propensity scores for continuous exposures. Here, we use `dnorm(true_value, predicted_value, mean(estimated_sigma, rm.na = TRUE))` to use the normal density to transform predictions to a propensity-like scale. We can also use other approaches like quantile binning of the exposure, calculating probability-based propensity scores using categorical regression models.
203
-
* Continuous exposures are prone to mispecification and usually need to be stabilized. A simple stabilization is to invert the propensity score by stabilization weights using an intercept-only model such as `lm(exposure ~ 1)`
204
-
* Stabilization is useful for any type of exposure where the weights are unbounded. Weights like the ATO, making them less susceptible to extreme weights.
205
-
* Using propensity scores for continuous exposures in outcome models is identical to using them with binary exposures.
180
+
* To fit the parametric G-formula, fit a standardized model with all covariates. Then, use cloned data sets with values set to each level of the exposure you want to study.
181
+
* Use the model to predict the values for that level of the exposure and compute the effect estimate you want
0 commit comments