Skip to content

Commit c667fa9

Browse files
Merge pull request #58 from r-causal/opening_exercises
2 parents b406246 + d7c7a97 commit c667fa9

File tree

7 files changed

+844
-67
lines changed

7 files changed

+844
-67
lines changed
Lines changed: 119 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
---
2+
title: "Causal Inference with `group_by()` and `summarize()`"
3+
format: html
4+
---
5+
6+
```{r}
7+
#| label: setup
8+
library(tidyverse)
9+
set.seed(1)
10+
```
11+
12+
## Your Turn 1
13+
14+
Run this code to generate the simulated data set
15+
16+
```{r}
17+
n <- 1000
18+
sim <- tibble(
19+
confounder = rbinom(n, 1, 0.5),
20+
p_exposure = case_when(
21+
confounder == 1 ~ 0.75,
22+
confounder == 0 ~ 0.25
23+
),
24+
exposure = rbinom(n, 1, p_exposure),
25+
outcome = confounder + rnorm(n)
26+
)
27+
```
28+
29+
1. Group the dataset by `confounder` and `exposure`
30+
2. Calculate the mean of the `outcome` for the groups
31+
32+
```{r}
33+
sim |>
34+
group_by(______, ______) |>
35+
summarise(avg_y = mean(______)) |>
36+
# pivot the data so we can get the difference
37+
# between the exposure groups
38+
pivot_wider(
39+
names_from = exposure,
40+
values_from = avg_y,
41+
names_prefix = "x_"
42+
) |>
43+
summarise(estimate = x_1 - x_0) |>
44+
summarise(estimate = mean(estimate)) # note, we would need to weight this if the confounder groups were not equal sized
45+
```
46+
47+
## Your Turn 2
48+
49+
Run the following code to generate `sim2`
50+
51+
```{r}
52+
n <- 1000
53+
sim2 <- tibble(
54+
confounder_1 = rbinom(n, 1, 0.5),
55+
confounder_2 = rbinom(n, 1, 0.5),
56+
57+
p_exposure = case_when(
58+
confounder_1 == 1 & confounder_2 == 1 ~ 0.75,
59+
confounder_1 == 0 & confounder_2 == 1 ~ 0.9,
60+
confounder_1 == 1 & confounder_2 == 0 ~ 0.2,
61+
confounder_1 == 0 & confounder_2 == 0 ~ 0.1,
62+
),
63+
exposure = rbinom(n, 1, p_exposure),
64+
outcome = confounder_1 + confounder_2 + rnorm(n)
65+
)
66+
```
67+
68+
1. Group the dataset by the confounders and exposure
69+
2. Calculate the mean of the outcome for the groups
70+
71+
```{r}
72+
sim2 |>
73+
group_by(_____, _____, _____) |>
74+
summarise(avg_y = mean(_____)) |>
75+
pivot_wider(names_from = exposure,
76+
values_from = avg_y,
77+
names_prefix = "x_") |>
78+
summarise(estimate = x_1 - x_0, .groups = "drop") |>
79+
summarise(estimate = mean(estimate))
80+
81+
```
82+
83+
## Your Turn 3
84+
85+
Run the following code to generate `sim3`
86+
87+
```{r}
88+
n <- 10000
89+
sim3 <- tibble(
90+
confounder = rnorm(n),
91+
p_exposure = exp(confounder) / (1 + exp(confounder)),
92+
exposure = rbinom(n, 1, p_exposure),
93+
outcome = confounder + rnorm(n)
94+
)
95+
```
96+
97+
1. Use `ntile()` from dplyr to calculate a binned version of `confounder` called `confounder_q`. We'll create a variable with 5 bins.
98+
2. Group the dataset by the binned variable you just created and exposure
99+
3. Calculate the mean of the outcome for the groups
100+
101+
```{r}
102+
sim3 |>
103+
mutate(confounder_q = _____(_____, 5)) |>
104+
group_by(_____, _____) |>
105+
summarise(avg_y = mean(_____)) |>
106+
pivot_wider(
107+
names_from = exposure,
108+
values_from = avg_y,
109+
names_prefix = "x_"
110+
) |>
111+
summarise(estimate = x_1 - x_0)
112+
113+
```
114+
115+
# Take aways
116+
117+
* Sometimes correlation *is* causation!
118+
* In simple cases, grouping by confounding variables can get us the right answer without a statistical model
119+
* Propensity scores generalize the idea of summarizing exposure effects to any number of confounders. Although we'll use models for this process, the foundations are the same.
146 KB
Binary file not shown.

slides/raw/03-causal-inference-with-group-by-and-summarise.html

Lines changed: 65 additions & 53 deletions
Large diffs are not rendered by default.

slides/raw/03-causal-inference-with-group-by-and-summarise.qmd

Lines changed: 38 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -124,7 +124,14 @@ sim |>
124124
summarise(estimate = x_1 - x_0)
125125
```
126126

127-
## Simulation
127+
## *Your Turn 1* (`03-ci-with-group-by-and-summarise-exercises.qmd`)
128+
129+
### Group the dataset by `confounder` and `exposure`
130+
### Calculate the mean of the `outcome` for the groups
131+
132+
`r countdown::countdown(minutes = 3)`
133+
134+
## *Your Turn 1*
128135

129136
```{r}
130137
#| code-line-numbers: "|2"
@@ -134,7 +141,7 @@ sim |>
134141
summarise(avg_y = mean(outcome))
135142
```
136143

137-
## Simulation
144+
## *Your Turn 1*
138145

139146
```{r}
140147
#| code-line-numbers: "|2"
@@ -147,7 +154,8 @@ sim |>
147154
values_from = avg_y,
148155
names_prefix = "x_"
149156
) |>
150-
summarise(estimate = x_1 - x_0)
157+
summarise(estimate = x_1 - x_0) |>
158+
summarise(estimate = mean(estimate)) # note, we would need to weight this if the confounder groups were not equal sized
151159
```
152160

153161
. . .
@@ -196,7 +204,12 @@ sim2 |>
196204
lm(outcome ~ exposure, data = sim2)
197205
```
198206

199-
## Simulation
207+
## *Your Turn 2*
208+
209+
### Group the dataset by the confounders and exposure
210+
### Calculate the mean of the outcome for the groups
211+
212+
## *Your Turn 2*
200213

201214
```{r}
202215
#| code-line-numbers: "|2"
@@ -209,10 +222,11 @@ sim2 |>
209222
values_from = avg_y,
210223
names_prefix = "x_"
211224
) |>
212-
summarise(estimate = x_1 - x_0)
225+
summarise(estimate = x_1 - x_0, .groups = "drop") |>
226+
summarise(estimate = mean(estimate))
213227
```
214228

215-
---
229+
`r countdown::countdown(minutes = 2)`
216230

217231
## Simulation
218232

@@ -222,7 +236,7 @@ sim2 |>
222236
```{r}
223237
#| code-line-numbers: "|1"
224238
n <- 100000
225-
sim2 <- tibble(
239+
big_sim2 <- tibble(
226240
confounder_1 = rbinom(n, 1, 0.5),
227241
confounder_2 = rbinom(n, 1, 0.5),
228242
@@ -241,7 +255,7 @@ sim2 <- tibble(
241255
::: {.column width="50%"}
242256
```{r}
243257
#| echo: false
244-
sim2 |>
258+
big_sim2 |>
245259
select(confounder_1, confounder_2, exposure, outcome)
246260
```
247261
:::
@@ -251,21 +265,22 @@ sim2 |>
251265
## Simulation
252266

253267
```{r}
254-
lm(outcome ~ exposure, data = sim2)
268+
lm(outcome ~ exposure, data = big_sim2)
255269
```
256270

257271
## Simulation
258272

259273
```{r}
260274
#| code-line-numbers: "|2"
261275
#| output-location: fragment
262-
sim2 |>
276+
big_sim2 |>
263277
group_by(confounder_1, confounder_2, exposure) |>
264278
summarise(avg_y = mean(outcome)) |>
265279
pivot_wider(names_from = exposure,
266280
values_from = avg_y,
267281
names_prefix = "x_") |>
268-
summarise(estimate = x_1 - x_0)
282+
summarise(estimate = x_1 - x_0, .groups = "drop") |>
283+
summarise(estimate = mean(estimate))
269284
```
270285

271286

@@ -305,10 +320,18 @@ sim3 |>
305320
lm(outcome ~ exposure, data = sim3)
306321
```
307322

308-
## Simulation
323+
## *Your Turn 3*
324+
325+
### Use `ntile()` from dplyr to calculate a binned version of `confounder` called `confounder_q`. We'll create a variable with 5 bins.
326+
### Group the dataset by the binned variable you just created and exposure
327+
### Calculate the mean of the outcome for the groups
328+
329+
`r countdown::countdown(minutes = 3)`
330+
331+
## *Your Turn 3*
309332

310333
```{r}
311-
#| code-line-numbers: "|2"
334+
#| code-line-numbers: "|2|3-4"
312335
#| output-location: fragment
313336
sim3 |>
314337
mutate(confounder_q = ntile(confounder, 5)) |>
@@ -319,7 +342,8 @@ sim3 |>
319342
values_from = avg_y,
320343
names_prefix = "x_"
321344
) |>
322-
summarise(estimate = x_1 - x_0)
345+
summarise(estimate = x_1 - x_0) |>
346+
summarise(estimate = mean(estimate))
323347
```
324348

325349
## {background-color="#23373B" .center .huge}

0 commit comments

Comments
 (0)