r-causal
diff --git a/‎exercises/05-quartets.qmd‎ renamed to ‎exercises/05-quartets-exercises.qmd‎ b/‎exercises/05-quartets.qmd‎ renamed to ‎exercises/05-quartets-exercises.qmd‎
diff --git a/‎exercises/10-continuous-g-computation-exercises.qmd‎
Lines changed: 181 additions & 0 deletions b/‎exercises/10-continuous-g-computation-exercises.qmd‎
Lines changed: 181 additions & 0 deletions
diff --git a/‎exercises/11-tipr.qmd‎ renamed to ‎exercises/11-tipr-exercises.qmd‎ b/‎exercises/11-tipr.qmd‎ renamed to ‎exercises/11-tipr-exercises.qmd‎
diff --git a/‎exercises/13-bonus-selection-bias-exercises.qmd‎
Lines changed: 12 additions & 113 deletions b/‎exercises/13-bonus-selection-bias-exercises.qmd‎
Lines changed: 12 additions & 113 deletions
@@ -0,0 +1,181 @@
+---
+title: "Continuous exposures and g-computation"
+format: html
+---
+
+```{r}
+#| label: setup
+library(tidyverse)
+library(broom)
+library(touringplans)
+library(splines)
+```
+
+For this set of exercises, we'll use g-computation to calculate a causal effect for continuous exposures.
+
+In the touringplans data set, we have information about the posted waiting times for rides. We also have a limited amount of data on the observed, actual times. The question that we will consider is this: Do posted wait times (`avg_spostmin`) for the Seven Dwarves Mine Train at 8 am affect actual wait times (`avg_sactmin`) at 9 am? Here’s our DAG:
+
+```{r}
+#| echo: false
+#| message: false
+#| warning: false
+library(ggdag)
+
+coord_dag <- list(
+  x = c(wdw_ticket_season = -1, close = -1, weather_wdwhigh = -2, extra_magic_morning = 0, avg_spostmin = 1, avg_sactmin = 2),
+  y = c(wdw_ticket_season = -1, close = 1, weather_wdwhigh = 0.25, extra_magic_morning = 0, avg_spostmin = 0, avg_sactmin = 0)
+)
+
+labels <- c(
+  avg_sactmin = "Average actual wait",
+  avg_spostmin = "Average posted wait ",
+  extra_magic_morning = "Extra Magic Morning",
+  wdw_ticket_season = "Ticket Season",
+  weather_wdwhigh = "Historic high temperature",
+  close = "Time park closed"
+)
+
+wait_time_dag <- dagify(
+  avg_sactmin ~ avg_spostmin + close + wdw_ticket_season + weather_wdwhigh + extra_magic_morning,
+  avg_spostmin ~ weather_wdwhigh + close + wdw_ticket_season + extra_magic_morning,
+  coords = coord_dag,
+  labels = labels
+)
+
+wait_time_dag |>
+  ggdag(use_labels = "label", text = FALSE) +
+  theme_void() +
+  scale_x_continuous(
+    limits = c(-2.25, 2.25), 
+    breaks = c(-2, -1, 0, 1, 2), 
+    labels = c("\n(one year ago)", "\n(6 months ago)", "\n(3 months ago)", "8am-9am\n(Today)", "9am-10am\n(Today)")
+  ) +
+  theme(axis.text.x = element_text())
+```
+
+First, let’s wrangle our data to address our question: do posted wait times at 8 affect actual weight times at 9? We’ll join the baseline data (all covariates and posted wait time at 8) with the outcome (average actual time). We also have a lot of missingness for `avg_sactmin`, so we’ll drop unobserved values for now.
+
+You don't need to update any code here, so just run this.
+
+```{r}
+eight <- seven_dwarfs_train_2018 |>
+  filter(hour == 8) |>
+  select(-avg_sactmin)
+
+nine <- seven_dwarfs_train_2018 |>
+  filter(hour == 9) |>
+  select(date, avg_sactmin)
+
+wait_times <- eight |>
+  left_join(nine, by = "date") |>
+  drop_na(avg_sactmin)
+```
+
+# Your Turn 1
+
+For the parametric G-formula, we'll use a single model to fit a causal model of Posted Waiting Times (`avg_spostmin`) on Actual Waiting Times (`avg_sactmin`) where we  include all covariates, much as we normally fit regression models. However, instead of interpreting the coefficients, we'll calculate the estimate by predicting on cloned data sets.
+
+Two additional differences in our model: we'll use a natural cubic spline on the exposure, `avg_spostmin`, using `ns()` from the splines package, and we'll include an interaction term between `avg_spostmin` and `extra_magic_mornin g`. These complicate the interpretation of the coefficient of the model in normal regression but have virtually no downside (as long as we have a reasonable sample size) in g-computation, because we still get an easily interpretable result.
+
+First, let's fit the model. 
+
+1.Use `lm()` to create a model with the outcome, exposure, and confounders identified in the DAG. 
+2. Save the model as `standardized_model`
+
+```{r}
+_______ ___ _______(
+  avg_sactmin ~ ns(_______, df = 5)*extra_magic_morning + _______ + _______ + _______, 
+  data = seven_dwarfs
+)
+```
+
+# Your Turn 2
+
+Now that we've fit a model, we need to clone our data set. To do this, we'll simply mutate it so that in one set, all participants have `avg_spostmin` set to 30 minutes and in another, all participants have `avg_spostmin` set to 60 minutes. 
+
+1. Create the cloned data sets, called `thirty` and `sixty`.
+2. For both data sets, use `standardized_model` and `augment()` to get the predicted values. Use the `newdata` argument in `augment()` with the relevant cloned data set. Then, select only the fitted value. Rename `.fitted` to either `thirty_posted_minutes` or `sixty_posted_minutes` (use the pattern `select(new_name = old_name)`).
+3. Save the predicted data sets as`predicted_thirty` and `predicted_sixty`.
+
+```{r}
+_______ <- seven_dwarfs |>
+  _______
+
+_______ <- seven_dwarfs |>
+  _______
+
+predicted_thirty <- standardized_model |>
+  _______(newdata = _______) |>
+  _______
+
+predicted_sixty <- standardized_model |>
+  _______(newdata = _______) |>
+  _______
+```
+
+# Your Turn 3
+
+Finally, we'll get the mean differences between the values. 
+
+1. Bind `predicted_thirty` and  `predicted_sixty` using `bind_cols()`
+2. Summarize the predicted values to create three new variables: `mean_thirty`, `mean_sixty`, and `difference`. The first two should be the means of `thirty_posted_minutes` and `sixty_posted_minutes`. `difference` should be `mean_sixty` minus `mean_thirty`.
+
+```{r}
+_______ |>
+  _______(
+    mean_thirty = _______,
+    mean_sixty = _______,
+    difference = _______
+  )
+```
+
+That's it! `difference` is our effect estimate, marginalized over the spline terms, interaction effects, and confounders.
+
+## Stretch goal: Boostrapped intervals
+
+Like propensity-based models, we need to do a little more work to get correct standard errors and confidence intervals. In this stretch goal, use rsample to bootstrap the estimates we got from the G-computation model.
+
+Remember, you need to bootstrap the entire modeling process, including the regression model, cloning the data sets, and calculating the effects.
+
+```{r}
+set.seed(1234)
+library(rsample)
+
+fit_gcomp <- function(split, ...) { 
+  .df <- analysis(split) 
+  
+  # fit outcome model. remember to model using `.df` instead of `seven_dwarfs`
+  
+  
+  # clone datasets. remember to clone `.df` instead of `seven_dwarfs`
+  
+  
+  # predict actual wait time for each cloned dataset
+
+  
+  # calculate ATE
+  bind_cols(predicted_yes, predicted_no) |>
+    summarize(
+      mean_thirty = mean(thirty_posted_minutes),
+      mean_sixty = mean(sixty_posted_minutes),
+      difference = mean_sixty - mean_thirty
+    ) |>
+    # rsample expects a `term` and `estimate` column
+    pivot_longer(everything(), names_to = "term", values_to = "estimate")
+}
+
+gcomp_results <- bootstraps(seven_dwarfs, 1000, apparent = TRUE) |>
+  mutate(results = map(splits, ______))
+
+# using bias-corrected confidence intervals
+boot_estimate <- int_bca(_______, results, .fn = fit_gcomp)
+
+boot_estimate
+```
+
+***
+
+# Take aways
+
+* To fit the parametric G-formula, fit a standardized model with all covariates. Then, use cloned data sets with values set to each level of the exposure you want to study. 
+* Use the model to predict the values for that level of the exposure and compute the effect estimate you want
@@ -1,121 +1,18 @@
 ---
-title: "The Parametric G-Formula"
+title: "Bonus: Selection bias and correcting for loss to follow-up"
 format: html
 ---
 
 ```{r}
 #| label: setup
 library(tidyverse)
 library(broom)
-library(touringplans)
 library(propensity)
-
-seven_dwarfs <- seven_dwarfs_train_2018 |>
-  filter(hour == 9)
-```
-
-# Your Turn 1
-
-For the parametric G-formula, we'll use a single model to fit a causal model of Extra Magic Hours (`extra_magic_morning`) on Posted Waiting Times (`avg_spostmin`) where we  include all covariates, much as we normally fit regression models. However, instead of interpreting the coefficients, we'll calculate the estimate by predicting on cloned data sets.
-
-First, let's fit the model. 
-
-1.Use `lm()` to create a model with the outcome, exposure, and confounders.
-2. Save the model as `standardized_model`
-
-```{r}
-_______ ___ _______(
-  avg_spostmin ~ _______ + wdw_ticket_season + close + weather_wdwhigh, 
-  data = seven_dwarfs
-)
-```
-
-
-# Your Turn 2
-
-Now that we've fit a model, we need to clone our data set. To do this, we'll simply mutate it so that in one set, all participants have `extra_magic_morning` set to 0 and in another, all participants have `extra_magic_morning` set to 1.
-
-1. Create the cloned data sets, called `yes` and `no`.
-2. For both data sets, use `standardized_model` and `augment()` to get the predicted values. Use the `newdata` argument in `augment()` with the relevant cloned data set. Then, select only the fitted value. Rename `.fitted` to either `yes_extra_hours` or `no_extra_hours` (use the pattern `select(new_name = old_name)`).
-3. Save the predicted data sets as`predicted_yes` and `predicted_no`.
-
-```{r}
-_______ <- seven_dwarfs |>
-  _______
-
-_______ <- seven_dwarfs |>
-  _______
-
-predicted_yes <- standardized_model |>
-  _______(newdata = _______) |>
-  _______
-
-predicted_no <- standardized_model |>
-  _______(newdata = _______) |>
-  _______
 ```
 
-# Your Turn 3
-
-Finally, we'll get the mean differences between the values. 
-
-1. Bind `predicted_yes` and  `predicted_no` using `bind_cols()`
-2. Summarize the predicted values to create three new variables: `mean_yes`, `mean_no`, and `difference`. The first two should be the means of `yes_extra_hours` and `no_extra_hours`. `difference` should be `mean_yes` minus `mean_no`.
-
-```{r}
-_______ |>
-  _______(
-    mean_yes = _______,
-    mean_no = _______,
-    difference = _______
-  )
-```
-
-That's it! `difference` is our effect estimate. To get confidence intervals, however, we would need to use the bootstrap method. See the link below for a full example.
-
-## Stretch goal: Boostrapped intervals
-
-Like propensity-based models, we need to do a little more work to get correct standard errors and confidence intervals. In this stretch goal, use rsample to bootstrap the estimates we got from the G-computation model. 
-
-Remember, you need to bootstrap the entire modeling process, including the regression model, cloning the data sets, and calculating the effects.
-
-```{r}
-set.seed(1234)
-library(rsample)
-
-fit_gcomp <- function(split, ...) { 
-  .df <- analysis(split) 
-  
-  # fit outcome model. remember to model using `.df` instead of `seven_dwarfs`
-  
-  
-  # clone datasets. remember to clone `.df` instead of `seven_dwarfs`
-  
-  
-  # predict wait time for each cloned dataset
-
-  
-  # calculate ATE
-  bind_cols(predicted_yes, predicted_no) |>
-    summarize(
-      mean_yes = mean(yes_extra_hours),
-      mean_no = mean(no_extra_hours),
-      difference = mean_yes - mean_no
-    ) |>
-    # rsample expects a `term` and `estimate` column
-    pivot_longer(everything(), names_to = "term", values_to = "estimate")
-}
-
-gcomp_results <- bootstraps(seven_dwarfs, 1000, apparent = TRUE) |>
-  mutate(results = map(splits, ______))
+In this example, we'll consider loss to follow-up in the NHEFS study. We'll use the binary exposure we used earlier in the workshop: does quitting smoking (`smk`) increase weight (`wt82_71`)? This time, however, we'll adjust for loss to followup (people who dropped out of the study between observation periods) using inverse probability of censoring weights.
 
-# using bias-corrected confidence intervals
-boot_estimate <- int_bca(_______, results, .fn = fit_gcomp)
-
-boot_estimate
-```
-
-# Your Turn 4
+# Your Turn 1
 
 1. Take a look at how many participants were lost to follow up in `nhefs`, called `censored` in this data set. You don't need to change anything in this code.
 
@@ -145,7 +42,7 @@ cens_model <- ___(
 )
 ```
 
-# Your Turn 5
+# Your Turn 2
 
 1. Use the logistic model you just fit to create inverse probability of censoring weights
 2. Calculate the weights using `.fitted`
@@ -155,7 +52,7 @@ cens_model <- ___(
 ```{r}
 cens <- _______ |>
   augment(type.predict = "response", data = nhefs_censored) |>
-  mutate(cens_wts = 1 / ifelse(censored == 0, 1 - ______, 1)) |>
+  mutate(cens_wts = wt_ate(censored, ______)) |>
   select(id, cens_wts)
 
 #  join all the weights data from above
@@ -171,13 +68,16 @@ cens_model <- lm(
 )
 ```
 
-# Your Turn 6
+# Your Turn 3
 
-1. Next, we usually need to clone our datasets, but we can use `kept_smoking` and `quit_smoking` that we created in the first section
-2. Use the outcome model, `cens_model`, to make predictions for `kept_smoking` and `quit_smoking`
+1. Create the cloned data sets, called `kept_smoking` and `no`, where one dataset has `quit_smoking` set to 1 (quit smoking) and the other has it set to 0 (kept smoking).
+2. Use the outcome model, `cens_model`, to make predictions for `kept_smoking` and `quit_smoking` 
 3. Calculate the differences between the mean values of `kept_smoking` and `quit_smoking`
 
 ```{r}
+kept_smoking <- ____
+quit_smoking <- ____
+
 predicted_kept_smoking <- _______ |>
   augment(newdata = _______) |>
   select(kept_smoking = .fitted)
@@ -239,6 +139,5 @@ boot_estimate_cens
 
 # Take aways
 
-* To fit the parametric G-formula, fit a standardized model with all covariates. Then, use cloned data sets with values set to each level of the exposure you want to study. 
-* Use the model to predict the values for that level of the exposure and compute the effect estimate you want
 * If loss to follow-up is potentially related to your study question, inverse probability of censoring weights can help mitigate the bias.
+* You can use them in many types of models. If you're also using propensity score weights, simply multiply the weights together, then include the result as the weights for your outcome model.