MD-PS09-SC/PS09_source.qmd at master · STAT-ATA-ASU/MD-PS09-SC · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
---
title: "Problem Set 09"
author: "Your Name"
date: last-modified
date-format: "[Last modified on] MMMM DD, YYYY HH:mm:ss zzz"
format:
  html: default
  pdf: default
editor: source
---


```{r include = FALSE}
# Do not edit this code block/chunk!
library(knitr)
knitr::opts_chunk$set(echo = TRUE, fig.align = "center", comment = NA, message = FALSE,  warning = FALSE, fig.width = 16/2, fig.height = 9/2)
# Set seed value of random number generator to get "replicable" random numbers.
# Why 76? Because of https://www.youtube.com/watch?v=xjJ7FheCkCU
set.seed(76)
# This only works if you have 12 cores!
library(doMC)
registerDoMC(cores = 12)
library(scales)
library(tidyverse)
library(infer)
```


# Background

First load the necessary packages:

::: {.callout-caution icon="false" title="R Code"}
```{r}
library(tidyverse)
library(infer)
```
:::

For this Problem Set, you will work with some grade-point-average (GPA) data for college freshman. The following will read in the data:

::: {.callout-caution icon="false" title="R Code"}
```{r}
if(!dir.exists("./Data")){
  dir.create("./Data")
  }
url <- "https://rudeboybert.github.io/SDS220/static/PS/sat_gpa.csv"
if(!file.exists("./Data/sat_gpa.csv")){
  download.file(url, destfile = "./Data/sat_gpa.csv")
  }
sat_gpa <- read_csv("./Data/sat_gpa.csv")
dim(sat_gpa)
# Show first 6 rows of sat_gpa
kable(head(sat_gpa))
```
:::

Be sure to take a look at the data in `sat_gpa`. Each row or case in this data frame is a student. The data includes the binary gender (`sex`) of each student; the math (`sat_math`), verbal (`sat_verbal`) and total SAT scores (`sat_total`) for each student; the GPA of each student in high school (`gpa_hs`) categorized as "low" or "high"; and the GPA of each student their first year of college on a numeric scale (`gpa_fy`).

::: {.callout-note}
We will use hypothesis testing to answer the following questions:

*  Is the mean freshmen GPA for females greater than the mean freshmen GPA for males?
*  Is the mean total SAT score for students with a "high" GPA greater than the mean total SAT score for students with a "low" GPA?

Note, if you get stuck as you are working through this, it will be helpful to go back and read [Chapter 9](https://moderndive.com/9-hypothesis-testing.html) in [ModernDive](https://moderndive.com/index.html).
:::


# Gender Differences in First-Year GPA?

## Exploratory Data Analysis

::: {.callout-note icon="false" title="Problem 1"}
Calculate the mean GPA score for each sex, using the `group_by` and `summarize` commands from the `dplyr` package.  Store the result in `avg_gpa_sex`.  What is the difference in sample mean GPA's? Make a guess: is this difference statistically significant?
:::

::: {.callout-important icon="false" collapse="false" title="Problem 1 Answers"}
```{r}
# Type your code and comments inside the code chunk


```

* Delete this and put your text answer here.

* Delete this and put your text answer here.

:::


::: {.callout-note icon="false" title="Problem 2"}
Generate a data visualization that displays the GPAs of the two groups. Be sure to include a title and label your axes.
:::


::: {.callout-important icon="false" collapse="false" title="Problem 2 Answers"}
```{r}
# Type your code and comments inside the code chunk

```
:::

## Stating a Null Hypothesis

We will now test the null hypothesis that there's no difference in population mean GPA between the sexes at the population level versus the alternative hypothesis that the mean GPA for females is greater than the mean GPA for males. We can write this out in mathematical notation

$$\begin{aligned} H_0:&\mu_{\text{female}} = \mu_{\text{male}} \\\ \mbox{vs }H_A:& \mu_{\text{female}} > \mu_{\text{male}} \end{aligned}$$

or expressed differently, that the difference is 0 or not:

$$\begin{aligned} H_0:&\mu_{\text{female}} - \mu_{\text{male}} = 0 \\\ \mbox{vs }H_A:& \mu_{\text{female}} - \mu_{\text{male}} \neq 0 \end{aligned}$$

## Testing the Hypothesis

Here's how we use infer to run this hypothesis test:

### Step 1: Calculate the Observed Difference{-}

Note that the order we choose does not matter here (female then male)...but since we used `order = c("Female", "Male")` here, we should do the same in subsequent calculations!

::: {.callout-caution icon="false" title="R Code"}
```{r}
obs_diff_gpa_sex <- sat_gpa %>%
  specify(gpa_fy ~ sex) %>%
  calculate(stat = "diff in means",
            order = c("Female", "Male")) %>%
  pull()

obs_diff_gpa_sex
# OR
obs_diff <- -diff(tapply(sat_gpa$gpa_fy, sat_gpa$sex, mean))
obs_diff
```
:::

Note that this is the difference in the group means we calculated earlier!

::: {.callout-caution icon="false" title="R Code"}
```{r}
obs_diff_gpa_sex
2.544587 - 2.396066
avf_avm <- obs_diff_gpa_sex
avf_avm
```
:::

### Step 2. Generate the Null Distribution of $\delta$

This step involves generating simulated values *as if* we lived in a world where there's no difference between the two groups. Going back to the idea of permutation, and tactile sampling, this is akin to shuffling the GPA scores between male and female labels (i.e. removing the structure to the data) just as we could have done with index cards.

::: {.callout-caution icon="false" title="R Code"}
```{r}
gpas_in_null_world <- sat_gpa %>%
  specify(gpa_fy ~ sex) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 2000, type = "permute")
kable(head(gpas_in_null_world))
```
:::


::: {.callout-note icon="false" title="Problem 3"}
What was the size of the "shuffled" (permuted) sample in each replicate?
:::

::: {.callout-important icon="false" collapse="false" title="Problem 3 Answers"}

* Delete this and put your text answer here.

:::

::: {.callout-note icon="false" title="Problem 4"}
How many times did we do a different "shuffle" (permute) here to the sample? How many rows are in the `gpas_in_null_world` data frame?
:::

::: {.callout-important icon="false" collapse="false" title="Problem 4 Answers"}
```{r}
# Type your code and comments inside the code chunk

```


* Delete this and put your text answer here.

* Delete this and put your text answer here.
:::


### Step 3. Calculate the Differences Between Male and Females Under the Null Hypothesis

The following calculates the differences in mean GPA for males and females for "shuffled" (permuted) data.

::: {.callout-caution icon="false" title="R Code"}
```{r}
gpa_diff_under_null <- gpas_in_null_world %>%
  calculate(stat = "diff in means", order = c("Female", "Male"))

gpa_diff_under_null %>%
  slice(1:5) %>%  # show first five rows
  kable()
# Done with a for loop
B <- 2000
mean_diff <- numeric(B)
for(i in 1:B){
  mean_diff[i] <- -diff(tapply(sat_gpa$gpa_fy, sample(sat_gpa$sex), mean))
}
hist(mean_diff)
abline(v = obs_diff, lty = "dashed")
```
:::


::: {.callout-note icon="false" title="Problem 5"}
How many rows are in the `gpa_diff_under_null` data frame? Why?
:::

::: {.callout-important icon="false" collapse="false" title="Problem 5 Answers"}

* Delete this and put your text answer here.

:::

### Step 4. Visualize how the Observed Difference Compares to the Null Distribution of $\delta$

The following plots the $\delta$ values we calculated for each of the different "shuffled" replicates. This is the null distribution of $\delta$. The red line shows the observed difference between male and female scores in the data (`r avf_avm`) from Step 1.


::: {.callout-caution icon="false" title="R Code"}
```{r}
gpa_diff_under_null %>%
  visualize() +
  labs(x = "Difference in mean GPA for males and females", y = "Count",
       title = "Null distribution of differences in male and female GPAs",
      subtitle = "Actual difference observed in the data is marked in red") +
  shade_p_value(obs_stat = obs_diff_gpa_sex, direction = "two-sided") +
  theme_bw()
```
:::

::: {.callout-note}
**Note that zero is the center of this null distribution.** The null hypothesis is that there is no difference between males and females in GPA score. In the permutations, zero was the most common value, because all structure was removed from the data.  GPA values were sorted into male and female **at random**. Values as large as 0.1 and -0.1 occurred, but much less frequently, because they are just not as likely when structure is removed from the data.
:::

### Step 5: Calculate a p-value

::: {.callout-caution icon="false" title="R Code"}
```{r}
gpa_diff_under_null %>%
  get_pvalue(obs_stat = obs_diff_gpa_sex, direction = "greater") %>%
  pull() -> pvalue
pvalue
# OR from the for loop - slighlty more conservative approach
p_value <- (sum(mean_diff >= obs_diff) + 1)/(B + 1)
p_value
```
:::

The p-value indicates that there is a `r pvalue` or `r round(p_value,4)` chance (very low even with the conservative p-value computation) that we would see a difference of `r obs_diff_gpa_sex` in GPA scores between males and females (or a bigger difference) if in fact there was truly no difference between the sexes in GPA scores at the population level.


::: {.callout-note icon="false" title="Problem 6"}
Fill in the blanks and select the appropriate words below to write up the results & conclusions for this test:

:::


::: {.callout-important icon="false" collapse="false" title="Problem 6 Answers"}
The mean GPA scores for females in our sample ($\bar{x}_{f}$ = ____) was greater than that of males ($\bar{x}_{m}$ = ____). This difference **was/was not** statistically significant at  $\alpha = 0.05$, (p = ____).  Given this p-value, I **would/would not** reject the Null hypothesis and **find evidence/do not find evidence** that **females** have higher GPAs than **males** at the population level.
:::


### Step 6: Calculate a Confidence Interval for the Difference

The following will allow us to calculate a 95% bootstrap percentile confidence interval for the difference between mean GPA scores for females and males.


::: {.callout-caution icon="false" title="R Code"}
```{r}
ci_diff_gpa_means <- sat_gpa %>%
  specify(gpa_fy ~ sex) %>%
  generate(reps = 2000, type = "bootstrap") %>%
  calculate(stat = "diff in means", order = c("Female", "Male")) %>%
  get_ci(level = 0.95, type = "percentile")
kable(ci_diff_gpa_means)
# Do the same thing with a for loop
sat_gpa %>%
  filter(sex == "Female") %>%
  select(gpa_fy) %>%
  pull() -> fem_gpa
sat_gpa %>%
  filter(sex == "Male") %>%
  select(gpa_fy) %>%
  pull() -> mal_gpa
mean_ds <- numeric(B)
for(i in 1:B){
  bss1 <- sample(fem_gpa, size = sum(!is.na(fem_gpa)), replace = TRUE)
  bss2 <- sample(mal_gpa, size = sum(!is.na(mal_gpa)), replace = TRUE)
  mean_ds[i] <- mean(bss1) - mean(bss2)
}
kable(quantile(mean_ds, probs =c(0.025, 0.975)))
```
:::

## Complete all the Above Tasks with a t-test

Note that all the above steps can be done with one line of code **if a slew of assumptions** like normality and equal variance of the groups are met.

::: {.callout-caution icon="false" title="R Code"}
```{r}
t.test(gpa_fy ~ sex, var.equal = TRUE, data = sat_gpa, alternative = "greater")
```
:::


# Relationship Between High-School & First-Year GPA?

For this analysis `sat_total` is the outcome variable, and `gpa_hs` is the predictor variable, with two levels "low" and "high".

## Exploratory Data Analysis

We can first calculate the mean total SAT score for each group (i.e students with a low and high GPA), using the `group_by` and `summarize` commands from the `dplyr` package.

::: {.callout-caution icon="false" title="R Code"}
```{r}
avg_sat_gpa <- sat_gpa %>%
  group_by(gpa_hs) %>%
  summarize(sat_total = mean(sat_total), n = n())
kable(avg_sat_gpa)
```
:::

We will next generate a data visualization that displays the total SAT scores of the two groups. Be sure to include a title and label your axes.

::: {.callout-caution icon="false" title="R Code"}
```{r, fig.width = 10}
library(patchwork)
p1 <- ggplot(sat_gpa, aes(x = gpa_hs, y = sat_total)) +
  geom_boxplot(fill = "darkgreen") +
  labs(title = "SAT scores based on high school \n GPA scores",
       x = "GPA ranking",
       y = "SAT score") +
  theme_bw()
# Or
p2 <- ggplot(data = sat_gpa,
             aes(x = sat_total, color = gpa_hs)) +
  geom_density() +
  theme_bw() +
  labs(color = "High School GPA",
       title = "Densities of SAT scores based \n on high school GPA scores",
       x = "SAT score")
p1 + p2
```
:::

## Stating a Null Hypothesis


::: {.callout-note icon="false" title="Problem 7"}
State the null and alternative hypotheses that you are testing (using words and symbols).
:::

::: {.callout-important icon="false" collapse="false" title="Problem 7 Answers"}

* Delete this and put your text answer here.

* Delete this and put your text answer here.

:::

## Testing the Null Hypothesis

::: {.callout-note icon="false" title="Problem 8"}
Calculate the observed difference between the mean total SAT scores of the low and high GPA high-school students. Store the result in an object named `obs_diff_sat_hs_gpa`
:::

::: {.callout-important icon="false" collapse="false" title="Problem 8 Answers"}

```{r}
# Type your code and comments inside the code chunk

```
:::

::: {.callout-note icon="false" title="Problem 9"}
Generate the null distribution of $\delta$. Here you need to generate simulated values *as if* we lived in a world where there's no difference in SAT scores between high school students with low and high GPAs.  Use 2000 replications to generate the null distribution.
:::


::: {.callout-important icon="false" collapse="false" title="Problem 9 Answers"}

```{r}
# Type your code and comments inside the code chunk

```

:::


::: {.callout-note icon="false" title="Problem 10"}
Calculate the differences in mean SAT scores between students with high and low GPA scores under the null hypothesis. Note you should use whatever order you chose in 7. Store your results in an object named `sat_diff_under_null`.  Show the first six rows of `sat_diff_under_null`.
:::


::: {.callout-important icon="false" collapse="false" title="Problem 10 Answers"}
```{r}
# Type your code and comments inside the code chunk

```
:::


::: {.callout-note icon="false" title="Problem 11"}
Visualize how the observed difference compares to the null distribution of $\delta$. Generate a histogram of the null distribution, with a vertical red line showing the observed difference in SAT scores between high school students with a high and low GPA.
:::

::: {.callout-important icon="false" collapse="false" title="Problem 11 Answers"}
```{r}
# Type your code and comments inside the code chunk

```
:::


::: {.callout-note icon="false" title="Problem 12"}
Calculate the p-value.
:::

::: {.callout-important icon="false" collapse="false" title="Problem 12 Answers"}
```{r}
# Type your code and comments inside the code chunk

```

* Delete this and put your text answer here.

:::


::: {.callout-note icon="false" title="Problem 13"}
Write up the results & conclusions for this hypothesis test. Note, p-values less than 0.001 are often reported as p < 0.001.
:::

::: {.callout-important icon="false" collapse="false" title="Problem 13 Answers"}

* Delete this and put your text answer here.
:::

::: {.callout-note icon="false" title="Problem 14"}
Compute a 95% bootstrap percentile confidence interval for the difference in total SAT scores for students with high and low high-school GPA scores. Note that you should use whatever order you chose for your null hypothesis. That is either `order = c("low", "high")` or `order = c("high", "low")`.  Provide a basic interpretation of your computed interval.
:::


::: {.callout-important icon="false" collapse="false" title="Problem 14 Answers"}
```{r}
# Type your code and comments inside the code chunk

```

* Delete this and put your text answer here.

:::


::: {.callout-note icon="false" title="Problem 15"}
Use a t-test to test the null hypothesis that the mean total SAT scores do not differ between students with high and low high school GPA scores versus the mean total SAT scores is greater for students with high versus low high school GPA scores.  State you English conclusion and verify the conclusion using the bootstrap T distribution.
:::

::: {.callout-important icon="false" collapse="false" title="Problem 15 Answers"}
```{r}
# Type your code and comments inside the code chunk

# Use the bootstrap T distribution to test

```

* English Conclusion here.

:::


# Turning in Your Work

You will need to make sure you commit and push all of your changes to the github education repository where you obtained the lab.

::: callout-tip
-   Make sure you **render a final copy with all your changes** and work.
-   Look at your final html file to make sure it contains the work you expect and is formatted properly.
:::

# Logging out of the Server

There are many statistics classes and students using the Server. To keep the server running as fast as possible, it is best to sign out when you are done. To do so, follow all the same steps for closing Quarto document:

::: callout-tip
-   Save all your work.
-   Click on the orange button in the far right corner of the screen to quit `R`
-   Choose **don't save** for the **Workspace image**
-   When the browser refreshes, you can click on the sign out next to your name in the top right.
-   You are signed out.
:::

```{r}
sessionInfo()
```