R_workshop/day2.Rmd at main · MVesuviusC/R_workshop · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
---
title: "Intro to R 2025 day 2"
author: "Matt Cannon"
date: '2025-04-01'
output:
  html_document:
    code_folding: hide
    toc: true
    toc_depth: 2
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Day 2

## General stuff

## How to make an Rmd file
- Open RStudio
- File -> New File -> R Markdown
- Give it a title and author
- Click OK
- You'll get a template with some example text and code
- Write your text and code in the chunks


## Packages
- Packages are collections of functions that do specific things. You're generally going to end up using packages specific to whatever you want to do in R.
- You can install packages with `install.packages("package_name")` and load them with `library(package_name)`.
    - The functions inside a package are only available to use when loaded with `library()`, even if you've installed it!
- Some packages need to be installed using `BiocManager::install("package_name")`.
- If the package is from GitHub, you'll use `devtools::install_github("package_name")`.
- The tidyverse package is actually a bundle of other packages and has a ton of very useful functions.
- I almost consider this package to be a requirement for using R.
```{r}
# install.packages("tidyverse", Ncpus = 4)
library(tidyverse)

install.packages("praise")
library(praise)
praise()
```

### Data.frames
- A data.frame is a variable type.
- It is a table of data with columns and rows. Most of your data will be in data frames.
    - When we loaded in the western data yesterday it was a data.frame.
- You can subset a data.frame with square brackets `[ ]`.
    - You'll put two numbers (or two groups of numbers) in between the brackets with a comma between them.
    - The first number is the row(s) you want and the second number is the column(s) you want.
    - You can remember rows / columns by remembering "Rats eat Cheese".
- You can also select names of data.frames using the column's name or a row's name.
- mtcars is a test data.frame that comes with R.

data_frame[row(s), column(s)]

```{r}
# typing a variable will print it out
mtcars

# first row, first column (just one element)
mtcars[1, 1]

# First row, all columns Note that this uses the row's name
mtcars["Mazda RX4", ]

# Everything
mtcars[, ]

# First column, all rows
mtcars[, 1]
mtcars$mpg
mtcars[, "mpg"]

# First ten rows, all columns
mtcars[1:10, ]
```

### Vectors
- A vector is a group of things. It has to all be the *same data type*.
    - All numbers, words, etc.
- You can create a vector with the `c()` function.
    - `c()` stands for "combine".
- Vectors have only one dimension.
- You can subset a vector with square brackets `[ ]`.
    - But you only need one number or group of numbers since the data are 1D.
    - Can also use boolean (TRUE/FALSE) values to subset data

```{r}
# A vector
my_vector <- c(1, 2, 3, 4, 5)

# First element
my_vector[1]
# First three elements
my_vector[1:3]

my_vector[c(TRUE, TRUE, FALSE, FALSE, TRUE)]
```

## Functions for today's activity

### mean()
This function calculates the average of a set of numbers. It takes in a vector of numbers and returns a single number.
```{r mean}
one_big_number <- c(1, 2, 4, 5, 10000000000)
big_mean <- mean(one_big_number)
big_mean

small_mean <- mean(c(2, 3, 4, 10))
small_mean
```

Many of these functions don't work if there is missing data (`NA`).

Include na.rm = TRUE as an argument to ignore missing data,

#### This doesn't return what we want
```{r}
mean(c(1, 3, 4, NA, 34))
```

#### This works
```{r}
mean(c(1, 3, 4, NA, 34), na.rm = TRUE)
```

### median()
This function calculates the median of a set of numbers. It takes in a vector of numbers and returns a single number.

Note here, we're using a variable that we created in an earlier chunk. Variables persist across chunks.
```{r median}
median(one_big_number)

mtcars <- mtcars
median(mtcars$mpg)
```

### min()/max()
These functions calculate the minimum and maximum of a set of numbers. They take in a vector of numbers and return a single number.
```{r minmax}
min(mtcars$mpg)

max(one_big_number)
```

### sum()
This function calculates the sum of a set of numbers. It takes in a vector of numbers and returns a single number.
```{r some}
sum(one_big_number)
sum(c(1, 1, 1, 1, 1, 1))
```

### standard deviation
This function calculates the standard deviation of a set of numbers. It takes in a vector of numbers and returns a single number.
```{r sd}
sd(one_big_number)
mpg_sd <- sd(mtcars$mpg)
mpg_sd
```

### filter()
- Filter is a function inside the dplyr package (which is part of tidyverse) that works on data.frames or tibbles (fancy dataframes).
- It filters rows based on some logical (TRUE/FALSE) condition.
- It keeps rows where the condition evaluates to TRUE.
- You can use `&` for "and" and `|` for "or".
    - The `|` character is just above the enter key and unless you code you've likely never used it.
- If you need something to match exactly, you need to use two equal signs `==`.
    - One equal sign is used for assignment.
- You can use `!=` for "not equal".
- You can use `>=` for "greater than or equal to" and `<=` for "less than or equal to".
- You can use `>` for "greater than" and `<` for "less than".
- Note that when you use many of the tidyverse functions you can specify column names inside the function without needing to use the `$` operator.
    - This is because the functions are designed to work with data.frames and tibbles.
```{r filter}
library(tidyverse)
mtcars <- mtcars

"bob" == "bob"
"bob" != "bob"

mtcars$mpg > 30

# We don't need to use the $ operator here like this: mtcars$mpg
filter(mtcars, mpg > 30)

# and
filter(mtcars, mpg > 30 & wt < 2)

# or
filter(mtcars, mpg > 30 | wt < 2)

filter(mtcars, carb == 2)
```

### select
Pick (or exclude) columns from a dataframe or tibble

### Pipes
- `|>` is a pipe - it passes data from one function to the next
- `%>%` is the same thing and you'll see this more, but it requires the magrittr package or tidyverse
    - That way you don't have to nest functions (((((((())))))))
- When using pipes, the data passed from a previous function is assumed to go to the first argument of the next function
    - You can specify a different argument by using a period `.`

`head()` returns the fist six elements of your data
```{r select}
select(mtcars, mpg, disp)

head(select(mtcars, -mpg, -disp))

select(mtcars, -mpg, -disp) |>
  head()

select(mtcars, !mpg, !disp) |>
  head(., n = 4)

4 %>%
  head(mtcars, n = .)

select(mtcars, !mpg, !disp) %>%
  head(n = 4)

# Nested functions, the inner-most function is executed first
head(select(mtcars, mpg, disp))
```

### hist() - histogram
`rnorm()` samples numbers from a normal distribution
```{r hist}
hist(rnorm(n = 1000, mean = 200, sd = 20), n = 100)

rnorm(n = 1000, mean = 200, sd = 20) |>
  hist(n = 100)
```

### aov()
This does an analysis of variance test. It takes in a formula and a data.frame and returns an object that you can use to get more information.

The output of this function doesn't contain everything we usually want, so we can pass the output to `summary()` to get more information.
```{r anova}
aov(mpg ~ disp + wt, data = mtcars)

anova_summary <- summary(aov(mpg ~ disp + wt, data = mtcars))
anova_summary
```

### table()
Count frequencies or combinations
```{r countStuff}
table(
  c(
    "is red",
    "is blue",
    "is red",
    "is not red",
    "is not red",
    "is blue"
  ),
  c(
    "red",
    "blue",
    "red",
    "blue",
    "blue",
    "blue"
  )
)

select(mtcars, cyl, vs)

table(mtcars$cyl, mtcars$vs)
```

## Activity
- Use titanic.csv
    - This file is located in the `materials` folder.
    - This is real data on titanic passengers
    - Survived: 1 = yes, 0 = no
- Explore the data
- Make variables that contain average, median, max and min values for age and fare
    - Do the same, but split the data by male/female
- Plot a histogram of age for all the data
    - Plot only the survivors as a second plot
- If you're super fast:
    - Help someone around you who may be struggling
    - Use table() to compare how many men/women survived
    - Use aov() to see which factors influenced survival most

### Read in and check out the Titanic data
```{r}

```

### Make variables that contain the Mean/median/min/max of age/fare
```{r}

```

### Use filter to make two new dataframe variables with just male and female only and then calculate the mean/median values for age and fare
Basically, calculate the mean/median age and fare for men vs women
```{r}

```

### Make a histogram of the age all passengers
```{r}

```

### Make another histogram for only passengers who survived

## If you're super fast

### Use `aov()` to test which factors influenced who survived the titanic disaster
```{r}

```

### Use table() to compare how many women vs men survived
```{r}

```