Physalia_ML/scripts/R_advanced_libraries.Rmd at main · JustinePa/Physalia_ML · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
---
title: "R Advanced Libraries"
author: "Pietro Franceschi"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```


# Introduction

This .Rmd file is a concise summary of a set of advanced R libraries and ideas which can be used for the analysis of large scale dataset.

In particular:

* Pipes (`%>%`)
* Tabular data
  * `data.frame`, `tibble`, `data.table`
  * data carpentry (`tidyverse`, `data.table`)
  * *long* and *wide* tables
  * modeling (`broom`)
* Writing R functions (maybe ...)
* Vectorizing operations (`purrr`)


```{r}
library(tidyverse)     ## the full tidyverse ecosystem for seamless working with tables
library(broom)         ## a broom to tidy the outcomes of modelling
library(data.table)    ## a less flexible (but more fast) approach to the manipulation of tabular data
```


# Piping

The overall idea behind piping is to make easy to read (and perform) a chain of functions. Pipes `%>%` have been introduced in the `magrittr` package but are now a core tool of the _tidyverse_. Since R 4.1.0  pipes `|>` has been included also in the base R installation. Here we will stick to the `tidyverse` flavour.

```{r}
## sequence going from 1 to the square root of 100
one_old <-log10(seq(1,sqrt(100)))

## The old style work have to be read in an "onion" fashion

one_pipe <- sqrt(100) %>%  ## calculate the square root of 100
  seq(1,.) %>%  ## make the sqrt
  log10(.)      ## make the log

## one_old and one_pipe are exactly equivalent, but the second is by far more easy to read
## in addition, when used in data manipulation tasks, pipes do not save intermediate objects

```

When putting a function in a pipe, you refer to "what is coming from the pipe" with a dot `.`

```{r}
## Pipes can be used also to produce plots!

data(iris) ## iris dataset

## note the dot to refer to the iris data.frame which reach the pairs function from the pipe
iris %>%
  pairs(., col = factor(iris$Species), pch = 19)
```

### Advantages

* Clear writing
* No need of cluttering the workspace with intermediate objects


### Disadvantages

* no big disadvantages, even if I normally rely on them when I use the console, while in programming tasks I prefer to rely on the "old" onion approach.


# Tabular Data

In R tabular data are commonly treated with three different classes of objects:

* `data.frame`: old, faithful and the father of almost everything else. A data frame is a list.
* `data.table`: basically still a data.frame which have been optimized for efficiency.
* `tibble`: is the tidyverse form of data.frames, less efficient than data.table, but more flexible since it is integrated in the tidy environment.

Both data.tables and tibble retains the characteristics of dfs (indexing with square brackets, possible use of $ to get the columns). Importantly, in both cases the *row.names* attribute has been removed and  its use is discouraged even if sometimes useful ;-)

These three box of code allow to benchmark the efficiency of the three solution sin reading a relatively big dataset (35 MB)

```{r}
## Base R
system.time(read.csv("athlete_events.csv"))
```

```{r}
# data.table
system.time(fread("athlete_events.csv"))
```

```{r}
# tidyverse
system.time(read_csv("athlete_events.csv"))
```

As it can be seen data.table is almost ten times faster than base R, while tidyverse stays somehow in the middle.

tibbles and data.tables also own an improved print method that allows for a more relaxed visualization of the content of the table.

```{r}
## read the three tables

baseR <- read.csv("athlete_events.csv")
datat <- fread("athlete_events.csv")
tidyv <- read_csv("athlete_events.csv")
```


```{r}
## this is the data.table printout
datat
```

```{r}
## this is the tidyverse printout
tidyv
```

# Data carpentry in`tidyverse` and `data.table`

Data tables and tibbles are very useful when you are engaged in manipulating tabular data in what is called **data carpentry**

## Filtering rows on condition

It is helpful to summarize the data.table slicing approach:

`DT[i, j, by]`
“Take DT, subset rows using i, then calculate j grouped by *by*”

```{r}
## filtering with data.table
iris %>%
  data.table(.) %>%  ## this is needed to transform the iris data.frame to a data.table
  .[Species %in% c("setosa","versicolor"),]

## Note here I'm fitting a data.table slicing into a tidyverse like piping style. The . before the square root refers to the
## data table coming from the pipe

```

The syntax of tidyverse is slightly more verbose but extremely easy to read


```{r}
iris %>%
  tibble() %>%         ## here I'm casting my data frame into a tibble
  filter(Species == "setosa") ## the function filter selects rows on condition

## Note: the tibble() call is not necessary, since the coercion to a tibble object
## is performed under the hood by the fact that you are using pipes
```


For the large majority of situations data.tables and tibbles can be used interchangeably, but remember that everything for data table has been optimized for speed. This is true for calculations, selections and sorting

## Selecting columns

If you consider the previous general syntax, you see that selecting columns in dt is fast. Remember the comma!
```{r}
## Extracting several columns as a data.table
iris %>%
  data.table(.) %>%
  .[,c("Sepal.Width","Species")]

## Extracting can also be done by using a list of columns
iris %>%
  data.table(.) %>%
  .[,list(Sepal.Width,Species)]


## extracting one column as as vector. Beware,, if you use the previous syntax
## you will get a data table
iris %>%
  data.table(.) %>%
  .[,Species]
```

Obviously, the selection of many columns can be performed by using another dot ... just to make life more clear ..


```{r}
## Extracting several columns as a data.table
iris %>%
  data.table(.) %>%
  .[,.(Species,Sepal.Width)]
```

Let's call this, *dot deluge* ... ;-)

In tidyverse the extraction of a single column as a vector is performed by the `pull` function, while the selection of one or more column resulting in a smaller tibble is performed by the `select` function


```{r}
## pull one column as vector
iris %>%
  pull(Species)


## select two columns and return a tibble
iris %>%
  tibble(.) %>%
  select(Sepal.Length,Species)

```

An interesting and useful characteristic of `select` is the possibility of using a series of selection helpers to identify columns on the base of their properties. See the help of select for a more detailed description

```{r}
## extract all the column with a name starting with sepal
iris %>%
  tibble() %>%
  select(starts_with("Sepal"))

## interesting! getting only numeric columns
iris %>%
  tibble() %>%
  select(where(~is.numeric(.x)))


## The syntax ~ something is a shorthand for function(.x) ....
iris %>%
  tibble() %>%
  select(where(function(c) is.numeric(c)))

```

Unfortunately we have another dot ...

Note: the writing `~is.numeric(.x)` could seem wired. This is a special short cut to construct _functionals_. Tidysomething will transform formulas starting with `~` into functions.

There are shorthands to refer to their arguments. For functions with one argument you can use the dot! For one or two (`.x` and `.y`), for an arbitrary number of arguments `..1`,`..2`, `..3`, etc.

So in our case, the following three constructs are equivalent

```{r}
iris %>%
  select(where(function(c) is.numeric(c)))

iris %>%
  select(where(~is.numeric(.x)))

iris %>%
  select(where(~is.numeric(.)))

iris %>%
  select(where(~is.numeric(..1)))


```

## Creating new columns or mutating existing ones

Creating new columns on the bases of the ones present in our dataset is one of the most useful and common tasks of data carpentry.
If you look to it in abstract, also mutating the content of an existing column fits in the previous reasoning: I'm creating a new column with a name which is identical to the old one ...

In DT, new columns are created by using the `:=` operator in the second "place holder" of the call
```{r}
## Create a column with the ration between sepal lenght and sepal width
new_dt_col <- iris %>%
  data.table(.) %>%
  .[,myratio := Sepal.Length/Sepal.Width]


## Create multiple columns

new_dt_col <- iris %>%
  data.table(.) %>%
  .[,c("myratio","myratio1") := list(Sepal.Length/Sepal.Width, Petal.Length/Petal.Width)]

```


In TB there is specific function `mutate` which can be piped to create or manipulate the columns


```{r}
## here the creation!
nef_tb_col <- iris %>%
 # tibble(.) %>%       ## thi s step is implicitly transforming the Df into ia tibble
  mutate(myratio = Sepal.Length/Sepal.Width,
         myratio2 = Petal.Length/Petal.Width)

## multiple columns can be created inside the same mutate call by using commas

```

As usual the DT syntax is more compact, the TB syntax is more easy to read. But DT is by far more efficient!

In tidyverse, the combination of selectors and mutate can be used to apply some sort of transformation to a bunch of columns, To do that, `mutate` have to be combined with `across`. The following example clearly shows the idea:

```{r}
## Suppose I want to calculate the logarithm of all the numeric columns in the iris dataset ...

iris %>%
  mutate(across(where(~is.numeric(.x)),~log10(.x), .names = "log_{.col}"))

## Here:
## where is used to select the columns which are numeric
## across is used to mutate on all these columns
## and there is a beautiful ~ and . deluge ;-)
## the .names argument allows you to specify a set of new names ... {.col} refers to the old names ...

```

```{r}
iris %>%
  mutate(across(where(~is.numeric(.x)),~log10(.x)))
```


## Perform operations on subgroups of samples (lines)

The last type of operations I want to touch on this flyby, are the one meant to calculate some quantity from groups of samples (rows).
This is normally handy when you want to calculate summary statistics over a large table of samples.


In DT this operation is performed combining what we have done before with the `by` argument

```{r}
## Calculate the average of sepal length on the three species

## Summarising the output as a data.table
iris_mean_dt <- iris %>%
  data.table(.) %>%
  .[,list(myavg = mean(Sepal.Length)), by = Species]


## Creating a new column with the separate averages "recycled". I.e the columns of averages is of full length
iris_mean_newcol <- iris %>%
  data.table(.) %>%
  .[,myavg := mean(Sepal.Length), by = Species]

```


in the case of TB, "by group" operations are performed by using the `group_by` function, often combined with `summarize`

```{r}
## this does what we have just done ...
iris %>%
  group_by(Species) %>%
  summarise(mymean = mean(Sepal.Width), sd = sd(Sepal.Width))

## note that I have here two summary functions
```

Group_by can also be combined with `mutate()` to mirror the "recycling" behavior of dt

```{r}
## here, for example, I'm adding a column with the number of samples for each group.

iris %>%
  group_by(Species) %>%
  mutate(nsamples = length(Species))
```


## Setting the Stage: the olympic game dataset

De dataset we will be using contains the information on the participants to the olympic games

```{r}
## read the data
dat <- read_csv("athlete_events.csv")

## these are the variables
head(dat)
```

* the name of the athlete
* some characteristics
* the team he was participating for
* the code of the team
* infos about the games
* the sport
* the actual race
* does he/she got a medal?


## Practical #1

Get the Olympic game dataset and:

* focus on athletics
* find the athlete(s) participating to the larger number of events
* create a new dataset which contains the body mass index of the athletes
* find the top athletes in terms of absolute number of medals on the Olympic games history, both on absolute and per year


## Correction of Practical 1

```{r}
## read the data
dat <- read_csv("athlete_events.csv")
```

```{r}
## Athlete participating to the largest number of events in athletics
dat %>%
  filter(Sport == "Athletics") %>%  ## here I wanted to focus only on people doing athletics
  group_by(Name) %>%       ## I group by more columns because I want to retain them in the output of summarise
  summarise(nevents = length(NOC)) %>%
  arrange(desc(nevents))  %>%        ## arrange allows to order the output ascending (or descending with desc)
  slice(1:10)
```

```{r}
## Calculate the BMI
dat <- dat %>%
  mutate(bmi = Weight/((Height/100)^2))
```


```{r}
# find the top athletes in terms of absolute number of medals on the Olympic games history
dat %>%
  filter(!is.na(Medal)) %>%      ## keeping only rows where Medal is not NA
  filter(Sport == "Athletics") %>%
  group_by(Name) %>%
  summarise(nevents = length(NOC)) %>%
  ungroup() %>%
  arrange(desc(nevents)) %>%
  slice_head(n = 10)

## per year
dat %>%
  filter(!is.na(Medal)) %>%
  filter(Sport == "Athletics") %>%
  group_by(Name, Year) %>%
  summarise(nevents = length(NOC)) %>%
  ungroup() %>%
  arrange(desc(nevents)) %>%
  slice_head(n = 10)
```

We were discussing whether there are athletes who got medals in different sports ...


```{r}
dat %>%
  filter(!is.na(Medal)) %>%
  group_by(Name) %>%
  summarise(n_sport = length(unique(Sport)),
            what_sport = paste(unique(Sport), collapse = ";")) %>%
  filter(n_sport > 1)  %>%
  arrange(desc(n_sport))
```


## Brooming models (`broom`)

`broom` is a package part of the tidyverse environment which allow to reformat the output of a model (actually of many classes of models, tests, etc) making them tidier and easy to use

```{r}
## this is an ignorant model of the iris dataset
ing_model <- lm(Sepal.Width ~ Sepal.Length, data = iris)
```

The standard way to look to this model is by using the "summary" function
```{r}
summary(ing_model)
```

there is nothing particularly bad with this output, but it is textual ...  and to get out the "numbers" one should dig into the model (either by hand or using methods like `coefficients`) and the structure of different models is often not coherent ...

`broom` allows to clear this output, squeezing it into a tibble

```{r}
tidy(ing_model)
```

Up to this point the advantage is not completely clear, but be patient ...


## Vectorizing operations (`purrr`)

The ones of you with some experience with "old" R know that - mainly for historical reasons nowadays - _for loops_ should be avoided like the plague when you want an efficient code.

The base way to skip them was to use _functionals_ like `apply`,`lapply`, `sapply`,`vapply`,`mapply`, ...

These functions (yes they are functions!) allow to apply functions over the elements of an iterable object, often a list.

To show you how this is working in a rather "modern" settings I will introduce you a new tidyverse trick: the possibility of constructing a tibble with a column containing other tibbles.

```{r}
## I'm nesting a tibble with three categories in a `nested` tibble
iris_nest <- iris %>%
  nest(data = !c(Species))  ## this line says: please nest all columns different from `Species` in a new column called data

iris_nest
```

The fact that you can nest list is not surprising, and since tibbles are lists, here we are simply saying that we constructed a list of lists

```{r}
## the first sub table of may data list (which is an element of my iris_nest list ...)
iris_nest$data[[1]]
```

The overall structure of the object better clarify the nested nature of the data ...

```{r}
str(iris_nest)
```

Ok, now let's suppose we would like to calculate the dimension of the three sub data table by using `apply`

```{r}
## lapply applies a function to a list and return a list
lapply(iris_nest$data, function(e) dim(e))
```

Voila! Fast, efficient ... but not very clear to read

To clean-up all that, in the tidyverse world you can rely on the functions part of the `purrr` package. What I will show here is the basic idea (so the advantage in readability in comparison to base R is not enormous), but `purrr` contains a bunch of functions which can really improve your R experience

The analogous of `apply` in `purrr` is `map`

```{r}
map(iris_nest$data, ~dim(.x))

## Note here that we can use the compact syntax of functions
```

In the package you will find a plethora of mappers (map, map_2, pmap, walk, map_dbl, ...)


Now let's try to combine tibbles and purr. The objective is organize the output in a tidy way

```{r}
## the syntax here is self explanatory ...
iris_nest <- iris_nest %>%
  mutate(mydim = map(data,~dim(.x)))
```

but this type of approach can be really handy if you combine it with modelling, and brooming.

Suppose we want to make an ignorant model for the three species of varieties and get out the coefficients in a clean and readable way ...

```{r}
iris_nest <- iris_nest %>%
  mutate(models = map(data, ~lm(Sepal.Length ~ Sepal.Width, data = .x))) %>%     ## here I'm mapping the modelling
  mutate(coefficients = map(models, ~tidy(.x))) %>%                              ## here I'm mapping the tidying
  unnest(coefficients)                                                          ## here I'm unnesting the coefficients

## note that here I'm recycling the data and the models ... so normally I would create a new tibble with the coefficients discarding
## the redundant columns

```


```{r}
iris %>%
  nest(data = !c(Species)) %>%
  mutate(models = map(data, ~lm(Sepal.Length ~ Sepal.Width, data = .x))) %>%     ## here I'm mapping the modelling
  mutate(coefficients = map(models, ~tidy(.x))) %>%                              ## here I'm mapping the tidying
  unnest(coefficients) %>%
  select(-c("data","models")) %>%
  write_csv(file = "test.csv")


```


## Wide to long

In the previous step we have been using modeling on sample groups ... could I use the same approach to model the separate variables?

To do that I have first to reshape my data in "long" format. This is done in tidyverse by the `pivot_longer` function

```{r}

## make iris long
iris_long <- iris %>%
  pivot_longer(!(Species), names_to = "mynames", values_to = "myvalues")


## The previous command pivots all columns which are not `Species` ... Remember that here I can happily use all the selectors
## we were introducing when using select ...

iris_long
```


But now we have a column with the variable names! ...

so I could nest the long DF and switch on my modeling machinery ;-)

Suppose, for example that you want to use t.test to compare all the properties of virginica and versicolor flowers


```{r}
iris %>%
  filter(Species != "setosa") %>%
  droplevels() %>%
  pivot_longer(!(Species), names_to = "mynames", values_to = "myvalues") %>%
  nest(data = !mynames) %>%
  mutate(mytest = map(data, function(t) t.test( x =t %>% filter(Species == "versicolor") %>% pull(myvalues),
                                                y =t %>%  filter(Species == "virginica") %>% pull(myvalues))
                      )
         ) %>%
  mutate(coeff = map(mytest,~tidy(.x))) %>%
  unnest(coeff) %>%
  mutate(is_sig = ifelse(p.value < 0.05,"sig","non_sig")) %>%
  ggplot() +
  geom_col(aes(x = mynames, y = estimate1, fill = is_sig)) +
  theme_light()

```