R_workshop/day3.Rmd at main · MVesuviusC/R_workshop · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
---
title: "Intro to R 2025 day 3"
author: "Matt Cannon"
date: '2025-04-02'
output:
  html_document:
    code_folding: hide
    toc: true
    toc_depth: 2
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

Load libraries
```{r}
library(tidyverse)
```

# Day 3

## General stuff

### Common error messages
```{r, eval=FALSE}
mean(bob)
# Error in mean(bob) : object 'bob' not found
```
R is looking for a variable named bob, but it doesn't exist
-   Do you need to make it?
-   Check spelling/capitalization
-   Did you forget to put quotes around a character string perhaps?

```{r, eval=FALSE}
arbleGarble(mtcars)
# Error in arbleGarble(mtcars) : could not find function "arbleGarble"
```
R is trying to use the function `arbleGarble()`, but it doesn't exist
-   Either load in the library that has it
    -   (or use `coolLib::arbleGarble()`)
-   Check spelling/capitalization

```{r, eval=FALSE}
hist(mtcars$mpg, n = 20
# +
((((((1))))))
```
R thinks you have unfinished business here
-   Hit escape to cancel the command
-   Most likely you're missing a parenthesis or a quotation mark somewhere
-   Rainbow brackets for the win!

### googling error messages

### googling "R how do I…"

### ChatGPT

### Be careful using code from the web/AI!
- Make sure you understand what it does
- You'll learn a lot faster if you read the documentation and try to understand what's happening
- You can still use code from the web, but try to understand it first!

---

## Functions for today's activity

### cbind(), rbind()
- attach two data.frames either side to side or top to bottom
- `rbind()` binds rows together  and will arrange column names
    - Stacks data on top of each other
- `cbind()` binds columns together just puts things together without checking order
    - Put data side by side
    - Generally a bad idea to cbind() unless you're very careful!

```{r}
efficient_cars <-
  mtcars %>%
  filter(mpg > 30)

gas_guzzlers <-
  mtcars %>%
  filter(mpg < 14.5)

rbind(efficient_cars, gas_guzzlers)

# This is total nonsense, but R does it anyways
cbind(efficient_cars, gas_guzzlers)
```

### merge()
- merge() will attach two data.frames side to side
- It uses shared columns to match data into the right rows
- By default, does an inner join
    - Keeps only rows that have matching data in both data.frames
- Use the arguments `all`, `all.x` or `all.y` to specify other join types

https://www.datasciencemadesimple.com/wp-content/uploads/2017/06/merge-in-R-2.jpg?ezimgfmt=rs:535x142/rscb1/ng:webp/ngcb1

band_members and band_instruments are two pre-loaded data.frames
```{r}
band_members
band_instruments
merge(band_members, band_instruments)
merge(band_members, band_instruments, all = TRUE)
```

### mutate()
- Add a new column or change an existing one
- Takes in a data.frame and a new column name and value
- Can use other columns in the data.frame to calculate the new column

```{r}
km_per_mi <- 1.60934

mtcars %>%
  mutate(kpg = mpg * km_per_mi) %>%
  head()

mtcars %>%
  mutate(mpg = mpg * km_per_mi)
```

### summarize() and group_by()
- `group_by()` silently divides up rows by categories
- `summarize()` summarizes data within groups in a data.frame
- Makes a new column and drops all columns not in group_by() or created by summarize()
- `ungroup()` removes the grouping

starwars is another pre-loaded data.frame play dataset
```{r}
mtcars %>%
  group_by(cyl) %>%
  summarize(mean_mpg = mean(mpg)) %>%
  ungroup()

starwars %>%
  group_by(homeworld, species) %>%
  summarize(
    mean_height = mean(height),
    max_mass = max(mass)
  )
```

### t()
- Rotate a dataframe (or matrix) 90 degrees
- A matrix is a 2D array of data similar to a data.frame
    - All the data inside has to be the same type
- `t()` also turns the data into a matrix, so we re-convert it using as.data.frame()
    - You can often convert variable types using `as.the_thing_I_want()`

```{r}
rotated_cars <-
  t(mtcars) %>%
  as.data.frame()
```

### pivot_longer()
- wide data vs long data
    - Wide data has multiple columns that hold the same type of data
        - For instance, a column for each year
    - Long data has a single column for each data category and other columns for metadata
        - For instance, a column for year and a column for value
- A lot of tidyverse functions work best with long data
- pivot_longer() takes wide data and makes it long
- "tidy" data - https://www.youtube.com/watch?v=K-ss_ag2k9E explanation starts ~ 6 min in
- https://epirhandbook.com/en/images/pivoting/pivot_longer_new.png
- https://fromthebottomoftheheap.net/2019/10/25/pivoting-tidily/

- For pivot_longer()
    - The cols argument is the columns that are in wide format
        - here it is everything but "religion", all the income columns
    - The names_to argument is the name of the column that will hold the column names
        - We want our new column to be named "income"
    - The values_to argument is the name of the column that will hold the values
        - We want our new column to be named "count"
- For pivot_wider()
    - The names_from argument is the name of the column that holds the column names
        - We want to use the "income" column
        - The values in this column will become the new column names
    - The values_from argument is the name of the column that holds the values
        - We want to use the "count" column
        - These values will be put into the new columns beneath the corresponding column name

```{r}
# wide data
relig_income

# long data
long_data <-
  relig_income %>%
  pivot_longer(
    cols = !religion,
    names_to = "income",
    values_to = "count"
  )

# Make it wider
wide_again <-
  long_data %>%
  pivot_wider(
    names_from = "income",
    values_from = "count"
  )
```

### Write a table out to a text file
```{r}
write.table(
  wide_again,
  file = "exampleOutput.txt",
  quote = FALSE,
  sep = "\t",
  row.names = FALSE
)

write_tsv(
  wide_again,
  file = "exampleOutputAlso.txt"
)
```

## Activity
- Use patientGroups.txt and exercise.txt from the materials folder
    - patientGroups is patient # and who received treatment
    - Exercise is the how many minutes each patient exercised across five days
- Combine the datasets into a single data.frame
- Pivot the data to long form
    - Columns: patient, day, exercise_min, glucose, trt_group
- Save the pivoted data.frame to a text file with write.table()
- Make a new column by multiplying glucose by 1000
- Calculate the average daily minutes of exercise per patient
- If you're super fast:
    - Help someone who is struggling
    - Plot glucose levels for each group (treated/control)
    - Test if they're statistically different
    - Plot daily exercise minutes per group
    - Test if exercise minutes was statistically different between groups

### Read in patientGroups.txt and exercise.txt
patientGroups.txt is:

-   patient #
-   treatment groups
-   final blood glucose measurements

exercise.txt is:

-   patient #
-   how many minutes each patient exercised across five days

```{r}

```

### Combine the datasets into a single data.frame
```{r}

```

### Pivot the data to long form
Columns should be patient, day, exercise_min, glucose
```{r}

```

### Save the pivoted data.frame to a text file with write.table() or write_tsv()
```{r}

```

### Make a new column where glucose is multiplied by 1000
```{r}

```

### Calculate the average daily minutes of exercise per patient
```{r}

```

## If you're super fast

### Plot glucose levels for each group (treated/control)
```{r}

```

### Test if glucose levels are statistically different between groups
```{r}

```

### Plot daily exercise minutes per group
```{r}

```

### Test if exercise minutes is statistically different between groups
```{r}

```