Physalia_ML/scripts/final_exercise.Rmd at main · JustinePa/Physalia_ML · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
---
title: "final_exercise"
author: "Filippo Biscarini"
date: "7/9/2020"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)

library("knitr")
library("ROCit")
library("glmnet")
library("ggplot2")
library("corrplot")
library("tidyverse")
library("tidymodels")
library("data.table")
library("randomForest")
```

## Wrap-up exercise

We now use the metabolomics data on raspberries (genus *Rubus*) for a wrap-up exercise on the use of machine learning algorithms for predictions. Some code is already there as hint, however you will mostly have to write your own code (and make your own choices)

### Reading the data

This is a classification problem: we want to use metabolomics data to classify samples into the color/pigmentation class they belong to.

```{r data}
rubus <- fread("../data/rubusTable.txt")

rubus %>%
  group_by(color, variety) %>%
  summarise(N=n())
```

### Data preprocessing

We see that we have **three colors** (`R`, `Y`, `P`), and 15 varieties nested within color. Color `P` has only one observation, and we can either remove it or assign it to the pigmented samples (`R`).

```{r rubus, echo=FALSE}
rubus <- na.omit(rubus) ## remove observations with missing values (there are none in this dataset)

rubus[rubus$color == "P","color"] <- "R"
metabolites <- rubus[,-c(1,2)]
corx <- cor(metabolites)

corx[1:10,1:10]
```

The number of pairwise correlations between variables is ${n \choose k}$ = ${10\,986 \choose 2}$ = $60\,340\,605$ (a lot!). Let's see how many are highly correlated:

```{r}
diag(corx) <- 0
num_correlated_vars <- sum(abs(corx)>0.99)/2 ## correlation matrix is symmetric
print(num_correlated_vars)
```

In total, `r num_correlated_vars` pairs of variable have linear correlation coefficient $> 0.99$. We can consider removing one variable for each pair.

First, however, we split the data in the training and test set.

#### Organisation of the exercise

We will proceed in a step-wise manner to go through this wrap-up exercise: we split the exercise in consecutive steps and you will be asked to complete the exercise **one step at a time**.

After each step, we will revise together your progress.

### 1. Training and test sets: do the split!

```{r}
## write here your code
```

### 2. Data preprocessing

First we remove highly correlated variables (collinearity) from the training set (remember, that any preprocessing should be done after splitting the data in the training and testing sets!)

```{r}
## write here your code
```

### 3. Training the classifier

#### Lasso model

```{r}
## write here your code
```

##### Fit the final model on the training data

We can now fit the final model (tuned with cross-validation) to the training set:

```{r}
## write here your code
```

##### Make predictions

```{r}
## write here your code
```

#### Random Forest

##### Tuning the model

```{r}
## write here your code
```

##### Fitting the RF model

```{r}
## write here your code
```

##### Variable importance

```{r varimp}
## write here your code
```

##### Predictions

```{r}
## write here your code
```

### 4. Compare models: ROC curves (or anything else)

```{r, label = 'roc'}
## write here your code
```