Physalia_ML/scripts/5.2.model_complexity.Rmd at main · JustinePa/Physalia_ML · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
---
title: "Model complexity"
author: "Filippo Biscarini"
date: "`r Sys.Date()`"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)

library("knitr")
library("dplyr")
library("ggplot2")
library("flextable")
library("data.table")
```

## Linear relationships?

From [here](https://www.geo.fu-berlin.de/en/v/soga-r/Basics-of-statistics/Linear-Regression/Polynomial-Regression/Polynomial-Regression---An-example/index.html):

```{r poly_data}
poly_data <- fread("../data/poly_data.txt")
```

We see that the relationship between $y$ and $x$ is not necessarily linear: fitting a straight line might be suboptimal in such cases.

```{r plot_data, echo=FALSE}
ggplot(poly_data, aes(x = x, y = y)) + geom_point()
```

Fortunately, we have (many) tools to deal with non-linearity in the data. One such tool is **polynomial regression**: we add higher order terms of x, e.g. a **quadratic term**, or a **cubic term**. This is done in the following way:

$$
y = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \ldots + \beta_n x^n + e
$$

#### Complexity hyperparameter

This is still a linear model (linear combination of parameters), but allows for additional complexity (the polynomial terms) that captures the non-linearity in the data. To do this, we need to introduce a **complexity hyperparameter** that will control the degree of complexity of our model, i.e. how much it departs from simple linear regression (where the complexity hyperparameter has the minimum value of 1).

------------------------------------------------------------------------

Let's see how this works!

We start by splitting the data in a **training set** and a **test set**:

```{r label='split'}
set.seed(13) ## 13; 137; 210

vec <- sample(nrow(poly_data), 4)
test <- poly_data[vec, -1]
train <- poly_data[!vec, -1]
```

### Simple linear regression model

First, we try the usual linear regression model (**degree of complexity = 1**):

```{r}
fit <- lm(y ~ x, data = train)
summary(fit)
```

```{r}
predictions <- predict(fit, test, interval="none", type = "response", na.action=na.pass)
preds = cbind(test, predictions)
print(preds)
```

```{r}
temp <- preds |>
  mutate(squared_error = (y-predictions)^2) |>
  summarise(rmse = sqrt(mean(squared_error)), r = round(cor(y, predictions, method = "pearson"),3), pct_error = 100*rmse/abs(mean(train$y)))

temp |>
  regulartable() |>
  autofit()
```

We see that the **RMSE is 0.6865** (**230.2%** as percentage of the mean of the target variable $y$).

Let's visualize the training data, the fitted regression curve and the predictions vs test values:

```{r}
train$label = "train"
test$label = "test"
temp <- preds |> select(-y) |> rename(y = predictions) |> mutate(label = "prediction")
df <- bind_rows(train,test, temp)
```

```{r}
p <- ggplot(df, aes(x = x, y = y)) + geom_point(aes(color=label)) + scale_color_manual(values = c("red", "blue", "gray"))
p <- p + theme_bw()
p + stat_smooth(data = train, method = "lm", formula = y ~ x, size = 0.5, se = FALSE)
```

### Quadratic model

We now try a quadratic model, i.e. a model where we add the term $x^2$: **degree of complexity = 2**

```{r}
fit2 <- lm(y ~ poly(x, 2, raw = TRUE), data = train)
summary(fit2)
```

```{r}
predictions <- predict(fit2, test, interval="none", type = "response", na.action=na.pass)
preds = cbind(test, predictions)
print(preds)
```

```{r}
temp <- preds |>
  mutate(squared_error = (y-predictions)^2) |>
  summarise(rmse = sqrt(mean(squared_error)), r = round(cor(y, predictions, method = "pearson"),3), pct_error = 100*rmse/abs(mean(train$y)))

temp |>
  regulartable() |>
  autofit()
```

```{r}
temp <- preds |> select(-y) |> rename(y = predictions) |> mutate(label = "prediction")
df <- bind_rows(train,test, temp)

p <- ggplot(df, aes(x = x, y = y)) + geom_point(aes(color=label)) + scale_color_manual(values = c("red", "blue", "gray"))
p <- p + theme_bw()
p + stat_smooth(data = train, method = "lm", formula = y ~ poly(x, 2, raw=TRUE), size = 0.5, se = FALSE)
```

The fit seems slightly better, but still rather off the actual data: **RMSE is 0.6454** (**216.4%** as percentage of the mean of the target variable $y$).

### Polynomial model

We now move on to a model with **degree of complexity = 3**, where we add the cubic term.

::: {style="color:red"}
**Question for you: what does the model look like now?**
:::

```{r}
fit3 <- lm(y ~ poly(x, 3, raw = TRUE), data = train)
summary(fit3)
```

```{r}
predictions <- predict(fit3, test, interval="none", type = "response", na.action=na.pass)
preds = cbind(test, predictions)
print(preds)
```

```{r}
temp <- preds |>
  mutate(squared_error = (y-predictions)^2) |>
  summarise(rmse = sqrt(mean(squared_error)), r = round(cor(y, predictions, method = "pearson"),3), pct_error = 100*rmse/abs(mean(train$y)))

temp |>
  regulartable() |>
  autofit()
```

```{r}
temp <- preds |> select(-y) |> rename(y = predictions) |> mutate(label = "prediction")
df <- bind_rows(train,test, temp)

p <- ggplot(df, aes(x = x, y = y)) + geom_point(aes(color=label)) + scale_color_manual(values = c("red", "blue", "gray"))
p <- p + theme_bw()
p + stat_smooth(data = train, method = "lm", formula = y ~ poly(x, 3, raw=TRUE), size = 0.5, se = FALSE)
```

The fit looks much better, and this is reflected in the predictions: **RMSE is 0.2243** (**75.2%** as percentage of the mean of the target variable $y$).

### What if we overdo it?

We can keep increasing the complexity of the model: what if we try degree = 10?

```{r}
fitx <- lm(y ~ poly(x, 10, raw = TRUE), data = train)
summary(fitx)
```

```{r}
predictions <- predict(fitx, test, interval="none", type = "response", na.action=na.pass)
preds = cbind(test, predictions)
print(preds)
```

```{r}
temp <- preds |>
  mutate(squared_error = (y-predictions)^2) |>
  summarise(rmse = sqrt(mean(squared_error)), r = round(cor(y, predictions, method = "pearson"),3), pct_error = 100*rmse/abs(mean(train$y)))

temp |>
  regulartable() |>
  autofit()
```

```{r}
temp <- preds |> select(-y) |> rename(y = predictions) |> mutate(label = "prediction")
df <- bind_rows(train,test, temp)

p <- ggplot(df, aes(x = x, y = y)) + geom_point(aes(color=label)) + scale_color_manual(values = c("red", "blue", "gray"))
p <- p + theme_bw()
p + stat_smooth(data = train, method = "lm", formula = y ~ poly(x, 10, raw=TRUE), size = 0.5, se = FALSE)
```

The fit went rather astray: **RMSE is 1.6359** (**548.5%** as percentage of the mean of the target variable $y$).

::: {style="color:red"}
**Question for you: why do you think this happened?**
:::

------------------------------------------------------------------------

::: {style="color:red"}
**Question for you: how do we choose the best value for the complexity (hyper)parameter?**
:::