Skip to content

Commit 2bd593b

Browse files
update selection bias exercises
1 parent ec58878 commit 2bd593b

File tree

1 file changed

+12
-113
lines changed

1 file changed

+12
-113
lines changed
Lines changed: 12 additions & 113 deletions
Original file line numberDiff line numberDiff line change
@@ -1,121 +1,18 @@
11
---
2-
title: "The Parametric G-Formula"
2+
title: "Bonus: Selection bias and correcting for loss to follow-up"
33
format: html
44
---
55

66
```{r}
77
#| label: setup
88
library(tidyverse)
99
library(broom)
10-
library(touringplans)
1110
library(propensity)
12-
13-
seven_dwarfs <- seven_dwarfs_train_2018 |>
14-
filter(hour == 9)
15-
```
16-
17-
# Your Turn 1
18-
19-
For the parametric G-formula, we'll use a single model to fit a causal model of Extra Magic Hours (`extra_magic_morning`) on Posted Waiting Times (`avg_spostmin`) where we include all covariates, much as we normally fit regression models. However, instead of interpreting the coefficients, we'll calculate the estimate by predicting on cloned data sets.
20-
21-
First, let's fit the model.
22-
23-
1.Use `lm()` to create a model with the outcome, exposure, and confounders.
24-
2. Save the model as `standardized_model`
25-
26-
```{r}
27-
_______ ___ _______(
28-
avg_spostmin ~ _______ + wdw_ticket_season + close + weather_wdwhigh,
29-
data = seven_dwarfs
30-
)
31-
```
32-
33-
34-
# Your Turn 2
35-
36-
Now that we've fit a model, we need to clone our data set. To do this, we'll simply mutate it so that in one set, all participants have `extra_magic_morning` set to 0 and in another, all participants have `extra_magic_morning` set to 1.
37-
38-
1. Create the cloned data sets, called `yes` and `no`.
39-
2. For both data sets, use `standardized_model` and `augment()` to get the predicted values. Use the `newdata` argument in `augment()` with the relevant cloned data set. Then, select only the fitted value. Rename `.fitted` to either `yes_extra_hours` or `no_extra_hours` (use the pattern `select(new_name = old_name)`).
40-
3. Save the predicted data sets as`predicted_yes` and `predicted_no`.
41-
42-
```{r}
43-
_______ <- seven_dwarfs |>
44-
_______
45-
46-
_______ <- seven_dwarfs |>
47-
_______
48-
49-
predicted_yes <- standardized_model |>
50-
_______(newdata = _______) |>
51-
_______
52-
53-
predicted_no <- standardized_model |>
54-
_______(newdata = _______) |>
55-
_______
5611
```
5712

58-
# Your Turn 3
59-
60-
Finally, we'll get the mean differences between the values.
61-
62-
1. Bind `predicted_yes` and `predicted_no` using `bind_cols()`
63-
2. Summarize the predicted values to create three new variables: `mean_yes`, `mean_no`, and `difference`. The first two should be the means of `yes_extra_hours` and `no_extra_hours`. `difference` should be `mean_yes` minus `mean_no`.
64-
65-
```{r}
66-
_______ |>
67-
_______(
68-
mean_yes = _______,
69-
mean_no = _______,
70-
difference = _______
71-
)
72-
```
73-
74-
That's it! `difference` is our effect estimate. To get confidence intervals, however, we would need to use the bootstrap method. See the link below for a full example.
75-
76-
## Stretch goal: Boostrapped intervals
77-
78-
Like propensity-based models, we need to do a little more work to get correct standard errors and confidence intervals. In this stretch goal, use rsample to bootstrap the estimates we got from the G-computation model.
79-
80-
Remember, you need to bootstrap the entire modeling process, including the regression model, cloning the data sets, and calculating the effects.
81-
82-
```{r}
83-
set.seed(1234)
84-
library(rsample)
85-
86-
fit_gcomp <- function(split, ...) {
87-
.df <- analysis(split)
88-
89-
# fit outcome model. remember to model using `.df` instead of `seven_dwarfs`
90-
91-
92-
# clone datasets. remember to clone `.df` instead of `seven_dwarfs`
93-
94-
95-
# predict wait time for each cloned dataset
96-
97-
98-
# calculate ATE
99-
bind_cols(predicted_yes, predicted_no) |>
100-
summarize(
101-
mean_yes = mean(yes_extra_hours),
102-
mean_no = mean(no_extra_hours),
103-
difference = mean_yes - mean_no
104-
) |>
105-
# rsample expects a `term` and `estimate` column
106-
pivot_longer(everything(), names_to = "term", values_to = "estimate")
107-
}
108-
109-
gcomp_results <- bootstraps(seven_dwarfs, 1000, apparent = TRUE) |>
110-
mutate(results = map(splits, ______))
13+
In this example, we'll consider loss to follow-up in the NHEFS study. We'll use the binary exposure we used earlier in the workshop: does quitting smoking (`smk`) increase weight (`wt82_71`)? This time, however, we'll adjust for loss to followup (people who dropped out of the study between observation periods) using inverse probability of censoring weights.
11114

112-
# using bias-corrected confidence intervals
113-
boot_estimate <- int_bca(_______, results, .fn = fit_gcomp)
114-
115-
boot_estimate
116-
```
117-
118-
# Your Turn 4
15+
# Your Turn 1
11916

12017
1. Take a look at how many participants were lost to follow up in `nhefs`, called `censored` in this data set. You don't need to change anything in this code.
12118

@@ -145,7 +42,7 @@ cens_model <- ___(
14542
)
14643
```
14744

148-
# Your Turn 5
45+
# Your Turn 2
14946

15047
1. Use the logistic model you just fit to create inverse probability of censoring weights
15148
2. Calculate the weights using `.fitted`
@@ -155,7 +52,7 @@ cens_model <- ___(
15552
```{r}
15653
cens <- _______ |>
15754
augment(type.predict = "response", data = nhefs_censored) |>
158-
mutate(cens_wts = 1 / ifelse(censored == 0, 1 - ______, 1)) |>
55+
mutate(cens_wts = wt_ate(censored, ______)) |>
15956
select(id, cens_wts)
16057
16158
# join all the weights data from above
@@ -171,13 +68,16 @@ cens_model <- lm(
17168
)
17269
```
17370

174-
# Your Turn 6
71+
# Your Turn 3
17572

176-
1. Next, we usually need to clone our datasets, but we can use `kept_smoking` and `quit_smoking` that we created in the first section
177-
2. Use the outcome model, `cens_model`, to make predictions for `kept_smoking` and `quit_smoking`
73+
1. Create the cloned data sets, called `kept_smoking` and `no`, where one dataset has `quit_smoking` set to 1 (quit smoking) and the other has it set to 0 (kept smoking).
74+
2. Use the outcome model, `cens_model`, to make predictions for `kept_smoking` and `quit_smoking`
17875
3. Calculate the differences between the mean values of `kept_smoking` and `quit_smoking`
17976

18077
```{r}
78+
kept_smoking <- ____
79+
quit_smoking <- ____
80+
18181
predicted_kept_smoking <- _______ |>
18282
augment(newdata = _______) |>
18383
select(kept_smoking = .fitted)
@@ -239,6 +139,5 @@ boot_estimate_cens
239139

240140
# Take aways
241141

242-
* To fit the parametric G-formula, fit a standardized model with all covariates. Then, use cloned data sets with values set to each level of the exposure you want to study.
243-
* Use the model to predict the values for that level of the exposure and compute the effect estimate you want
244142
* If loss to follow-up is potentially related to your study question, inverse probability of censoring weights can help mitigate the bias.
143+
* You can use them in many types of models. If you're also using propensity score weights, simply multiply the weights together, then include the result as the weights for your outcome model.

0 commit comments

Comments
 (0)