Physalia_ML/scripts/7.lasso_with_tidymodels.Rmd at main · JustinePa/Physalia_ML · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
---
title: "Lasso regularization"
author: "Filippo Biscarini"
date: "7/8/2020"
output: html_document
jupyter:
  jupytext:
    text_representation:
      extension: .Rmd
      format_name: rmarkdown
      format_version: '1.2'
      jupytext_version: 1.10.3
  kernelspec:
    display_name: R
    language: R
    name: ir
---

```{r}
library("knitr")
library("glmnet")
library("ggplot2")
library("tidyverse")
library("tidymodels")
library("data.table")
```

## Lasso-penalised logistic regression using tidymodels

For this illustration, we will use the metabolomics dataset on NMR spectroscopy metabolite abundances in diabetes patients (with controls):

```{r}
mtbsl1 <- fread("../data/MTBSL1.tsv")
nrow(mtbsl1)
```

```{r}
head(mtbsl1)
```

```{r}
names(mtbsl1)[c(4:ncol(mtbsl1))] <- paste("mtbl",seq(1,ncol(mtbsl1)-3), sep = "_")
print(paste("N. of columns;", ncol(mtbsl1)))
```

We have **132 records** and **189 variables** (191 - 2 (`ID` and `Metabolic_syndrome`)): still, $p > n$. We see now the distribution of classes (control, diabetes) per sex (not dramatically unbalanced):

```{r}
table(mtbsl1$Metabolic_syndrome,mtbsl1$Gender)
```

We see that the metabolites abundances have largely different scales/magnitudes, and it could be a good idea to normalise them before running the model (remember that Lasso constraints the size of the coefficients, and these depend on the magnitude of each variable $\rightarrow$ same $\lambda$ applied to all variables). This is the range between maximum values across metabolites:

```{r}
mm_diab <- mtbsl1 %>%
  gather(key = "metabolite", value = "abundance", -c(`Primary ID`,Gender,Metabolic_syndrome))

group_by(mm_diab, metabolite) %>% summarise(max_each=max(abundance)) %>% summarise(min = min(max_each), max = max(max_each))
```

Below the boxplots of the distributions of values for all metabolites:

```{r}
library("repr")
options(repr.plot.width=14, repr.plot.height=8)

mm_diab %>%
  ggplot(aes(metabolite, abundance, fill = as.factor(metabolite))) +
  geom_boxplot(show.legend = FALSE)
```

### Training and test sets

We now split the data in the training and test sets using `tidymodels` functions: we now split in a stratified way, because we want to keep a similar proportion of cases and controls in both the training and test sets.

```{r}
diab_dt <- select(mtbsl1, -c(`Primary ID`, Gender)) ## remove gender for the moment, and keep only numeric features for convinience
mtbsl1_split <- initial_split(diab_dt, strata = Metabolic_syndrome)
mtbsl1_train <- training(mtbsl1_split)
mtbsl1_test <- testing(mtbsl1_split)
```

### Preprocessing

We remove variables with zero variance (no variability, not informative) and normalise all numeric variables: to do so, we build a preprocessing "recipe", where we specify:

-   the model equation: diabetes/control as a function of all metabolites
-   remove all predictors (non-outcome variables) that don't have variance
-   normalize all numeric predictors (standard deviation of one and a mean of zero)

```{r}
## build a recipe for preprocessing
mtbsl1_rec <- recipe(Metabolic_syndrome ~ ., data = mtbsl1_train) %>%
  # update_role(`Primary ID`, new_role = "ID") %>%
  step_zv(all_numeric(), -all_outcomes()) %>%
  step_normalize(all_numeric(), -all_outcomes())

print(mtbsl1_rec)
```

Now we use the `tidymodels` functions `prep()` and `juice()` to obtain the preprocessed training set:

```{r}
mtbsl1_prep <- mtbsl1_rec %>%
  prep(strings_as_factors = FALSE)

print(mtbsl1_prep)
mtbsl1_train <- juice(mtbsl1_prep)
```

```{r}
options(repr.plot.width=14, repr.plot.height=8)
mm_train <- mtbsl1_train %>%
  gather(key = "metabolite", value = "abundance", -c(Metabolic_syndrome))

group_by(mm_train, metabolite) %>% summarise(mean(abundance),sd(abundance))
mm_train %>%
  ggplot(aes(metabolite, abundance, fill = as.factor(metabolite))) +
  geom_boxplot(show.legend = FALSE)
```

## The Lasso model

We start by specifying our Lasso model:

-   it's logistic regression for a classification problem
-   we set the $\lambda$ parameter (penalty) to the arbitrary value of 0.1
-   mixture: amount of L1 regularization, when 1 it's Lasso (0 is Ridge regression)
-   the engine is set to `glmnet`

Then we add everything to a workflow object, piecewise, and fit the Lasso model:

```{r}
lasso_spec <- logistic_reg(mode = "classification", penalty = 0.1, mixture = 1) %>%
  set_engine("glmnet")
print(lasso_spec)
```

```{r}
wf <- workflow() %>%
  add_recipe(mtbsl1_rec) %>%
  add_model(lasso_spec)

print(wf)
```

```{r}
lasso_fit <- wf %>%
  fit(data = mtbsl1_train)
```

```{r}
lasso_fit %>%
  pull_workflow_fit() %>%
  tidy()
```

```{r}
lasso_fit %>%
  pull_workflow_fit() %>%
  tidy() %>%
  filter(estimate > 0 | estimate < 0)
```

## Tuning the hyperparameters

We use k-fold cross-validation to tune the hyperparameters ($\lambda$ penalty in this case) in the training set:

-   `vfold_cv`: to specify n. of folds (stratified) and replicates
-   `logistic_reg`: to specify that we want a logistic regression model for classification, we want to use Lasso penalization, and we are fine-tuning the penalty parameter
-   `grid_regular`: defines the range of penalty parameter values to try

```{r}
diab_cv <- vfold_cv(mtbsl1_train, v=5, repeats = 10, strata = Metabolic_syndrome)

tune_spec <- logistic_reg(penalty = tune(), mixture = 1) %>%
  set_engine("glmnet")

lambda_grid <- grid_regular(penalty(), levels = 50, filter = penalty <= .05)
print(lambda_grid)
```

We build a **new workflow** object:

```{r}
wf1 <- workflow() %>%
  add_recipe(mtbsl1_rec) %>%
  add_model(tune_spec) ## remember: the model equation was specified in the recipe (top of this document)
```

We are now ready to **fine-tune the model**!!

```{r}
doParallel::registerDoParallel()

lasso_grid <- tune_grid(
  wf1,
  resamples = diab_cv,
  grid = lambda_grid
)
```

Here we can see the results for each value of the penalty parameter that was tried in the fine-tuning process:

```{r}
lasso_grid %>%
  collect_metrics()
```

Plotting the results will help us see what happened during fine-tuning of the model, and then select the best value for $\lambda$ (the penalty parameter) based on the maximum AUC (binary classification problem):

```{r}
lasso_grid %>%
  collect_metrics() %>%
  ggplot(aes(penalty, mean, color = .metric)) +
  geom_errorbar(aes(
    ymin = mean - std_err,
    ymax = mean + std_err
  ),
  alpha = 0.5
  ) +
  geom_line(size = 1.5) +
  facet_wrap(~.metric, scales = "free", nrow = 2) +
  scale_x_log10() +
  theme(legend.position = "none")
```

```{r}
best_roc <- lasso_grid %>%
  select_best("roc_auc")

print(best_roc)
```

With the selected penalty parameter from fine-tuning, we can (finally!) finalize our workflow:

```{r}
final_lasso <- finalize_workflow(
  wf1,
  best_roc
)

print(final_lasso)
```

## Testing the model

We are now ready to test our fine-tuned Lasso model on the test partition:

```{r}
lr_res <- last_fit(
  final_lasso,
  mtbsl1_split
)

lr_res %>%
  collect_metrics()
```

```{r}
lr_res %>% collect_predictions() %>%
  group_by(.pred_class,Metabolic_syndrome) %>%
  summarise(N=n()) %>%
  spread(key = ".pred_class", value = N)
```

we can also do it step-by-step:

1.  preprocess the test data
2.  fit the final Lasso model to the training data
3.  make predictions on the test data

```{r}
## 1 preprocess
mtbsl1_testing <- mtbsl1_prep %>% bake(testing(mtbsl1_split)) ## preprocess test data

## 2 model fit
final_lasso_fit <- fit(final_lasso, data = mtbsl1_train) ## fit final model on the training set

## 3 make predictions
final_lasso_fit %>%
  predict(mtbsl1_testing, type = "class") %>%
  bind_cols(mtbsl1_testing) %>%
  # metrics(truth = Metabolic_syndrome, estimate = .pred_class)
  group_by(.pred_class,Metabolic_syndrome) %>%
  summarise(N=n()) %>%
  spread(key = ".pred_class", value = N)
```

```{r}
lr_auc <-
  lr_res %>%
  collect_predictions() %>%
  roc_curve(Metabolic_syndrome,`.pred_Control Group`) %>%
  mutate(model = "Logistic Regression")

autoplot(lr_auc)
```

### Variable importance

Lasso models have a nice side feature: they naturally select variables, based on the shrinking of some coefficients exactly to zero. Based on this, Lasso models can return the importance of the variables used to fit the model:

```{r}
library("vip")

final_lasso %>%
  fit(mtbsl1_train) %>%
  pull_workflow_fit() %>%
  vi(lambda = best_roc$penalty) %>%
  mutate(
    Importance = abs(Importance),
    Variable = fct_reorder(Variable, Importance)
  ) %>%
  filter(Importance > 0) %>%
  ggplot(aes(x = Importance, y = Variable, fill = Sign)) +
  geom_col() +
  scale_x_continuous(expand = c(0, 0)) +
  labs(y = NULL)
```

Variable importance from `vi`:

`method`: - "model" (the default): for model-specific VI scores (see vi_model for details) - "firm": for variance-based VI scores (see vi_firm for details) - "permute": for permutation-based VI scores (see vi_permute for details) - "shap": for Shapley-based VI scores (see vi_shap for details)

`glmnet`: Similar to (generalized) linear models, the absolute value of the coefficients are returned for a specific model. It is important that the features (and hence, the estimated coefficients) be standardized prior to fitting the model. You can specify which coefficients to return by passing the specific value of the penalty parameter via the lambda argument (this is equivalent to the s argument in coef.glmnet). By default, lambda = NULL and the coefficients corresponding to the final penalty value in the sequence are returned; in other words, you should ALWAYS SPECIFY lambda! For cv.glmnet objects, the largest value of lambda such that the error is within one standard error of the minimum is used by default. For a multinomial response, the coefficients corresponding to the first class are used; that is, the first component of coef.glmnet.

## Add gender (factor) to the Lasso model

To add factors to the Lasso model (engine `glmnet`) we need to use [one hot encoding](https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f)

```{r}
diab_dt <- model.matrix(~ ., mtbsl1[,"Gender"]) %>% as_tibble() %>% select(GenderMale) %>% rename(gender = GenderMale) %>%
bind_cols(diab_dt)

head(diab_dt)
```

#### Interactive exercise

We'll now rerun a Lasso-penalised logistic regression model including Gender as feature, going together through the steps involved in Lasso models with `tidymodels`:

1.  splitting the data
2.  preprocessing
3.  hyperparameters tuning
4.  fitting the final model
5.  predictions on the test set

```{r}
## code here

```

```{r}
## code here
```