R_tutorial_2021/R_tutorial_2021.Rmd at main · LindaBetz/R_tutorial_2021 · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
---
title: "Tutorial: Data Analysis with R"
author: Linda T. Betz, M.Sc.
date: January 18, 2021
output:
  beamer_presentation:
    theme: "Boadilla"
    colortheme: "seahorse"
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
def.chunk.hook  <- knitr::knit_hooks$get("chunk")
knitr::knit_hooks$set(chunk = function(x, options) {
  x <- def.chunk.hook(x, options)
  ifelse(options$size != "normalsize", paste0("\n \\", options$size,"\n\n", x, "\n\n \\normalsize"), x)
})


library(tidyverse)

```
# Structure of the tutorial

- January 18 (today): introduction to basic R syntax and priniciples, data cleaning and wrangling
- January 25: Machine Learning in R (part I)
- Feburary 1: Machine Learning in R (part II), working on your reports


# Why R?

- Optimized for data analysis and visualization
- Open source → access to source code: everyone can check what is going on inside a function (in contrast to, e.g. SPSS)
- Many add-on packages for advanced statistical analysis
- Active, inclusive community → support/feedback, sharing of code

# Today's tutorial ...
... is based on selected chapters from *R for Data Science* (https://r4ds.had.co.nz/).
\
\
![](./imgs/r4ds.png){ width=50% height=50% }


# PART I: the basics
![](./imgs/start.jpg){ width=50% height=50% }

# RStudio
![](./imgs/console.png)

# The basics
R as a "calculator":
```{r, include=TRUE}
1 / 200 * 30
(59 + 73 + 2) / 3
sin(pi / 2)
```
# The basics
You can create new objects with <-:
```{r, include=TRUE}
x <- 3 * 4
```
- All R statements where you create objects, assignment statements, have the same form
```{r eval = FALSE}
object_name <- value
```
- "object gets value"
- shortcut for “<-”: Alt + -

# The basics
object names

- … must start with a character
- … can only contain letters, numbers, _ and .
- good practice: “snake_case”

```{r, eval = FALSE}
i_use_snake_case
otherPeopleUseCamelCase
some.people.use.periods
And_aFew.People_RENOUNCEconvention
```

# The basics
You can inspect an object by typing its name:
```{r, include=TRUE}
x
```

# The basics
Object names can be long:
```{r, include=TRUE}
this_is_a_really_long_name <- 3.5
```
If we made a mistake and, for instance, wanted to set the value to 2.5, we can simply overwrite the old value.
```{r, include=TRUE}
this_is_a_really_long_name <- 2.5
```

# The basics
Make yet another assignment:
```{r, include=TRUE, size="tiny"}
r_rocks <- 2^3
```
Let’s try to inspect it:
```{r, include=TRUE, error=TRUE, size="tiny"}
r_rock
R_rocks
```
Typos matter. Case matters.

# The basics
```{r, include=TRUE, eval=FALSE}
function_name(arg1 = val1, arg2 = val2, ...)
```
```{r}
seq(1, 10)
```

To get help, type:
```{r, eval = FALSE}
?seq
```

# The basics
![](./imgs/seq.png){ width=90% height=90% }


# The basics
If you make an assignment, you don't get to see the value. You're then tempted to immediately double-check the result:

```{r}
y <- seq(1, 10, length.out = 5)
y
```

This common action can be shortened by surrounding the assignment with parentheses, which causes assignment and "print to screen" to happen.

```{r}
(y <- seq(1, 10, length.out = 5))
```

# The basics
Quotation marks and parentheses must always come in a pair. RStudio does its best to help you, but it's still possible to mess up and end up with a mismatch. If this happens, R will show you the continuation character "+":

```
> x <- "hello
+
```
The `+` tells you that R is waiting for more input; it doesn't think you're done yet. Usually that means you've forgotten either a `"` or a `)`. Either add the missing pair, or press ESCAPE to abort the expression and try again.


# The basics
In the environment in the upper right pane, we can see all of the objects that we’ve created.
\
\
![](./imgs/environment.PNG){ width=85% height=85% }

# The basics
Important:

- 99% of problems/error messages can be solved using Google

Strategies:

- Copy error messages (works best if you search the English error message, i.e., use R in English)
- Search in English
- Good starting point: stackoverflow

![](./imgs/lmgtfy.jpg){ width=35% height=35% }


# Create projects
It is common practice to keep all the files associated with a project together — input data, R scripts, analytical results, figures.
To create a project: Click File → New Project, then:
\
\

![](./imgs/project1.png){ width=30% height=30% }
![](./imgs/project2.png){ width=30% height=30% }
![](./imgs/project3.png){ width=30% height=30% }

Think carefully about which subdirectory you put the project in.

- Please open RStudio and create a project called "R_tutorial_2021" now.

# Create projects
Check that the "home" directory of your project is the current working directory:
```{r}
getwd()
```
Whenever you refer to a file with a relative path it will look for it here.\
Some good practices:

- Create an R project for each data analysis project.
- Keep data files there.
- Keep scripts there; edit them, run them in bits or as a whole.
- Save your outputs (plots and cleaned data) there.
- Only ever use relative paths, not absolute paths.


# Install packages
Many useful functions are not built-in directly into R, but have to be imported from external extensions, so-called "packages". We first need to install them. Here, we install the tidyverse, a collection of R packages for data wrangling, analysis and visualization. This piece of code should look familiar ;)
```{r, eval=FALSE}
install.packages("tidyverse")
```
![](./imgs/tidyverse.png){ width=50% height=50% }


# PART II: data wrangling
![](./imgs/cartoon.PNG){ width=70% height=70% }


# Why data wrangling?
Data often does not come in a tidy, ready-to-use format.

- Uninformative variable names
- Missing values coded as numbers (e.g. "-999")
- Exclude people with to many missing values

![](./imgs/weirddata.png){ width=60% height=60% }
![](./imgs/weirddata2.png){ width=30% height=30% }


# Workflow
![](./imgs/tidyworkflow.png){ width=65% height=65% }

# Import data
We can import files into R using the tidyverse package. First, we load the package via `library()`. It is good practice to load all libraries you use in an analysis in the beginning of the script. Generally, packages from the tidyverse should be loaded last.
```{r, message=TRUE}
library(tidyverse)
```
Check out the print out you get when you run the code. It shows which packages are overwritten by dplyr. Some functions from base R now have to be called explicitly (e.g. stats::filter).

# Import data
Next, we can use `read_csv()` from the tidyverse package to import the data file "data_cobre.csv". The first argument to `read_csv()` is the most important: it's the path to the file to read. If you have created a R project and have placed "data_cobre.csv" in the corresponding project folder, you can use relative rather than absolute paths to load your file:

```{r, message = TRUE, size="tiny"}
data_cobre <- read_csv("data_cobre.csv")
```
When you run `read_csv()`, it prints out a column specification with the name and type of each column. `read_csv()` uses the first line of the data for the column names, which is a common convention.


# The dataset
- Open dataset from the Center for Biomedical Research Excellence (https://coins.trendscenter.org)
- Contains clinical participants with various diagnoses from the schizophrenia spectrum, and healthy controls
- Every row represents one participant, every column is a variable
- We'll use neuropsychological data later for Machine Learning
```{r, message = TRUE, size="tiny"}
data_cobre
```

# Variable types
Letter abbreviations under the column names describe the type of each variable:

- `dbl` stands for doubles, or real numbers (0.25, 1.00, -5).
- `chr` stands for character vectors, or strings ("abc").
- `fctr` stands for factors, which R uses to represent categorical variables with fixed possible values ("female"/"male").
- `lgl` stands for logical, vectors that contain only TRUE or FALSE.

There are other common types of variables that aren’t used in this dataset:

- `int` stands for integers (1, 2, ...).
- `date` stands for dates (31.12.2020).
- `dttm` stands for date-times (a date + a time; 31.12.2020 21:00).

# Vectors
A vector is a basic data structure in R. It contains *element of the same type*. Vectors are generally created using the `c()` function.
```{r}
(x <- c(1, 5, 4, 9, 0))
(y <- c("male", "female", "female", "male"))

```
`c()` will try and coerce elements to the *same type* if they are different.
```{r}
(x <- c(1, 5.4, TRUE, "hello"))
```

# Getting started
After loading the data, it is helpful to get an overview of what is going on. There are several options:
```{r, size="tiny"}
head(data_cobre) # show the first 6 rows of a data set
```

# Getting started
```{r, size="tiny"}
summary(data_cobre) # gives descriptive stats for every variable
```

# Getting started
```{r, size="tiny"}
data_cobre$age # shows one specific variable

table(data_cobre$gender) # shows frequency of values for one variable
```

# Data transformation
First, we are going to look at four key dplyr functions that allow you to solve the vast majority of data manipulation challenges:

- Pick observations by their values (`filter()`).
- Pick variables by their names (`select()`).
- Create new variables with functions of existing variables (`mutate()`).
- Collapse many values down to a single summary (`summarize()`).

All 'verbs' work similarly:

- The first argument is a data frame.
- The subsequent arguments describe what to do with the data frame, using the variable names (without quotes).
- The result is a new data frame.

# Filter rows with `filter()`
`filter()` allows you to subset observations based on their values. The first argument is the name of the data frame. The second and subsequent arguments are the expressions that filter the data frame. For example, we can filter all participants with an age of 65 years.
```{r, size="tiny"}
filter(data_cobre, age == 65)
```
When you run that line of code, dplyr executes the filtering operation and returns a new data frame. dplyr functions *never modify their inputs*, so if you want to save the result, you’ll need to use the assignment operator, <-:
```{r, size="tiny"}
old_participants <- filter(data_cobre, age == 65)
```

# Comparisons
To use filtering effectively, you have to know how to select the observations that you want using the comparison operators. R provides the standard suite: `>`, `>=`, `<`, `<=`, `!=` (not equal), and `==` (equal). When you’re starting out with R, the easiest mistake to make is to use `=` instead of `==` when testing for equality. When this happens you’ll get an informative error:
```{r, error=TRUE}
old_participants <- filter(data_cobre, age = 65)
```

# Logical operations
Multiple arguments to `filter()` are combined with “and”: every expression must be true in order for a row to be included in the output. For other types of combinations, you’ll need to use Boolean operators:

- `&` is “and”
- `|` is “or”
- `!` is “not”
\
\
\
![](./imgs/booleans.png){ width=50% height=50% }

# Logical operations
The following code finds all participants that are either defined as "bipolar" or "schizoaffective":

```{r, size="tiny"}
filter(data_cobre, study_group == "bipolar" | study_group == "schizoaffective")
```

# Logical operations
A useful short-hand for this problem is `x %in% y`. This will select every row where x is one of the values in y. We could use it to rewrite the code on the previous slide:
```{r, size="tiny"}
filter(data_cobre, study_group %in% c("bipolar", "schizoaffective"))
```


# Missing values
One important feature of R that can make comparison tricky are missing values, or NAs (“not availables”). `NA` represents an unknown value: almost any operation involving an unknown value will also be unknown.
```{r}
10 == NA
NA + 5
NA / 9
```

# Missing values
If you want to determine if a value is missing, use `is.na()`:
```{r, size="tiny"}
is.na(x)
```

`filter()` only includes rows where the condition is `TRUE`; it excludes both `FALSE` and `NA` values. If you want to preserve missing values, ask for them explicitly:
```{r, size="tiny"}
df <- tibble(x = c(1, NA, 3))
filter(df, x > 1)
filter(df, is.na(x) | x > 1)
```

# Break
![](./imgs/break.jpg){width=50% height=50%}


# Exercises (I)
Before getting started with the exercises, put the file `data_cobre.csv` in your project folder. In RStudio, click File → New File → R Script. Use this script to type in and save your solutions for the exercises (e.g. as "exercises.R"). These should be the first lines of your script:
```{r, eval=FALSE}
library(tidyverse) # load the package
data_cobre <- read_csv("data_cobre.csv") # load data
```
You can run blocks of code by selecting them and pressing Ctrl+Enter (or the "Run" button in the upper right corner of the script pane).

# Exercises (I)
- Find all participants who are 19 years and younger. How many are there?
- Find all healthy control participants who are 19 years and younger. How many are there?
- Find all women who have been diagnosed with either bipolar disorder or schizophrenia. How many are there?
- How many patients have a missing value in the variable `age_onset_psychiatric_illness`?

# Solutions
Find all participants who are 19 years and younger. How many are there?
```{r, size="tiny"}
filter(data_cobre, age <= 19)
```

# Solutions
Find all healthy control participants who are 19 years and younger. How many are there?
```{r, size="tiny"}
filter(data_cobre, study_group == "control" & age <= 19)
```

# Solutions
Find all women who have been diagnosed with either bipolar disorder or schizophrenia. How many are there?
```{r, size="tiny"}
filter(data_cobre, gender == "female" & study_group %in% c("bipolar", "schizophrenia"))
```

# Solutions
How many patients have a missing value in the variable `age_onset_psychiatric_illness`?
```{r, size="tiny"}
filter(data_cobre, study_group != "control" & is.na(age_onset_psychiatric_illness))
```

# Select columns with `select()`
It’s not uncommon to get datasets with hundreds or even thousands of variables. `select()` allows you to select a useful subset using operations based on the names of the variables.

`select()` is not super useful with our dataset because we only have `r ncol(data_cobre)` variables, but you can still get the general idea:
```{r, size="tiny"}
select(data_cobre,  id,  age, gender)
```

# Select columns with `select()`
There are a number of helper functions you can use within `select()`:

- `starts_with("abc")`: matches names that begin with “abc”.
- `ends_with("xyz")`: matches names that end with “xyz”.
- `contains("ijk")`: matches names that contain “ijk”.
- `matches("(.)\\1")`: selects variables that match a regular expression.
- `num_range("x", 1:3)`: matches x1, x2 and x3.

```{r, size="tiny"}
select(data_cobre, contains("age"))
```

# Rename variables
```{r, size="tiny"}
rename(data_cobre, participant_id = id) # logic: new_name = old_name
```

# Exercises (II)
- Rename all variables such that they contain upper case letters only.
- Select all variables which contain the words "study" or "age".

Remember to use Google if necessary :)

# Solutions
Rename all variables such that they contain upper case letters only.
```{r, size="tiny"}
rename_with(data_cobre, toupper)
```

# Solutions
Select all variables that contain the words "study" or "age".
```{r, size="tiny"}
select(data_cobre, matches("study|age"))
```

# Add new variables with `mutate()`
`mutate()` adds new columns that are functions of existing columns. `mutate()` always adds new columns at the end of your dataset.
```{r, size="tiny"}
mutate(data_cobre, age_months = age*12) # age in months
```
If you only want to keep the new variables, use `transmute()`:
```{r, size="tiny"}
transmute(data_cobre, age_months = age*12) # age in months
```

# Useful creation functions
There are many functions for creating new variables that you can use with `mutate()`. The function must have a key property: it must take a vector of values as input, return a vector with the same number of values as output. Some examples:

- Arithmetic operators: `+`, `-`, `*`, `/`, `^`.
- Modular arithmetic: `%/%` (integer division) and `%%` (remainder).
- Logs: `log()`, `log2()`, `log10()`.
- Logical comparisons: `<`, `<=`, `>`, `>=`, `!=`, and `==`.

# Exercises (III)
- Create a new variable `patient` that indicates if a participant has been diagnosed with any disorder or not.
- Z-transform the variable `age`, and store the result in a new variable called `age_z_score`.

Remember to use Google if necessary :)

# Solutions
Create a new variable `patient` that indicates if a participant has been diagnosed with any disorder or not.
````{r, size="tiny"}
mutate(data_cobre, patient = if_else(study_group != "control", true = "yes", false = "no"))
```

# Solutions
Z-transform the variable `age`, and store the result in a new variable called `age_z_score`.
````{r, size="tiny"}
mutate(data_cobre, age_z_score = (age-mean(age))/sd(age))

```
````{r, size="tiny", eval=FALSE}
mutate(data_cobre, age_z_score = scale(age))

```

# Part III: summarizing
![](./imgs/summary.jpg){width=50% height=50%}


# Grouped summaries with `summarize()`

The last key data operation we'll look at is `summarize()`. It collapses a data frame to a single row:
```{r, size="tiny"}
summarize(data_cobre, age_mean = mean(age))
```
`summarize()` is not terribly useful unless we pair it with `group_by()`. When you then use the dplyr verbs on a grouped data frame they’ll be automatically applied “by group”.

````{r, size="tiny"}
by_group <- group_by(data_cobre, study_group) # save grouped data frame here
summarize(by_group, age = mean(age), .groups = "keep") # mean age per group
```

# Grouped summaries with `summarize()`
![](./imgs/group_summarize.png){ width=90% height=90% }


# Useful summary functions
![](./imgs/summarize_funs.PNG){ width=80% height=80% }

# Combine multiple operations with the pipe (`%>%`)
````{r, size="tiny"}
by_group <- group_by(data_cobre, study_group) # save grouped data frame here
mean_age_groups <- summarize(by_group, age = mean(age), .groups = "drop") # mean age per group
filtered_age <- filter(mean_age_groups, age > 39)
```

An alternative, faster and less frustrating way to tackle this proplem is the pipe operator (`%>%`), which focuses on the transformations, not what’s being transformed, which makes the code easier to read. You can read it as a series of imperative statements: group, then summarize, then filter.

````{r, size="tiny"}
data_cobre %>%
  group_by(study_group) %>% # save grouped data frame here
  summarize(age = mean(age), .groups = "drop") %>% # mean age per group
  filter(age > 39)
```

# Group by multiple variables
When you group by multiple variables, each summary peels off one level of the grouping:

````{r, size="tiny"}
data_cobre %>%
  group_by(study_group, first_degree_relative_psychosis) %>% # save grouped data frame here
  summarize(age_mean = mean(age), age_sd = sd(age), .groups = "drop") # mean/SD age per group
```

# Missing values (again)
Aggregation functions obey the usual rule of missing values: if there’s any missing value in the input, the output will be a missing value.
````{r, size="tiny"}
data_cobre %>%
  summarize(age_onset_mean = mean(age_onset_psychiatric_illness))
```
All aggregation functions have an `na.rm` argument which removes the missing values prior to computation:
````{r, size="tiny"}
data_cobre %>%
  summarize(age_onset_mean = mean(age_onset_psychiatric_illness, na.rm = TRUE))
```
This mean onset age, however, is far too high. Something else is wrong here...

# Problematic coding of missing values
Let's look at the variable more closely:
````{r, size="tiny"}
table(data_cobre$age_onset_psychiatric_illness)
```
Looks like missing values are coded as '9999'. Let's change that:
````{r, size="tiny"}
data_cobre %>%
  mutate(
    age_onset_psychiatric_illness = if_else(
      age_onset_psychiatric_illness == 9999,
      true = NA_real_,
      false = age_onset_psychiatric_illness
    )
  ) %>%
  summarize(mean = mean(age_onset_psychiatric_illness, na.rm = TRUE))
```

# Exercises (IV) (use `%>%`)
- Compute the mean (sd) age for patients who have a first degree relative with a psychotic disorder and those who do not.
- Recode the values in the variables `age_onset_psychiatric_illness` and `age_first_hospitalization` such that they can be analyzed properly. Overwrite the existing variables in `data_cobre`.
- Create a new variable that gives the time since the onset of psychiatric illness in years. Then compute the mean time since onset of psychiatric illness for men and women.
- Compute the median age of onset of psychiatric illness by diagnosis.

# Solutions
Compute the mean (sd) age for patients who have a first degree relative with a psychotic disorder and those who do not.
````{r, size="tiny"}
data_cobre %>%
  filter(study_group != "control" &
           !is.na(first_degree_relative_psychosis)) %>%
  group_by(first_degree_relative_psychosis) %>%
  summarize(age_mean = mean(age),
            age_sd = sd(age),
            .groups = "keep")
```

# Solutions
Recode the values in the variables `age_onset_psychiatric_illness` and `age_first_hospitalization` such that they can be analyzed properly. Overwrite the existing variables in `data_cobre`.
````{r, size="tiny"}
table(data_cobre$age_first_hospitalization)

data_cobre <- data_cobre %>%
  mutate(
    age_onset_psychiatric_illness = if_else(
      age_onset_psychiatric_illness == 9999,
      true = NA_real_,
      false = age_onset_psychiatric_illness
    ),
    age_first_hospitalization = if_else(
      age_first_hospitalization %in% c("9999", "0", "05"),
      true = NA_real_,
      false = as.numeric(age_first_hospitalization)
    )
  )

```

# Solutions
Create a new variable that gives the time since the onset of psychiatric illness in years. Then compute the mean time since onset of psychiatric illness for men and women.
````{r, size="tiny"}
data_cobre %>%
  mutate(illness_duration = age - age_onset_psychiatric_illness) %>%
  group_by(gender) %>%
  summarize(mean = mean(illness_duration, na.rm = TRUE),
            .groups = "keep")
```

# Solutions
Compute the median age of onset of psychiatric illness by diagnosis.
````{r, size="tiny"}
data_cobre %>%
  filter(study_group != "control") %>%
  group_by(study_group) %>%
  summarize(age_onset_median = median(age_onset_psychiatric_illness, na.rm = TRUE),
             .groups = "keep")
```


# Thank you for your attention!
If you have any questions on today's content, please feel free to contact me.

- linda.betz@uk-koeln.de

Next week, Julian and Pedro will walk you through a Machine Learning analysis using neuropsychological data.

# References
- R for Data Science: https://r4ds.had.co.nz/
- Open dataset: https://coins.trendscenter.org/ (create an account to download data)
- Photos: https://unsplash.com/
- Comic: https://www.kdnuggets.com/2017/09/cartoon-machine-learning-class.html