Skip to content

Commit 1ec4a52

Browse files
final edits to match everything up
1 parent d075cd3 commit 1ec4a52

14 files changed

+1286
-769
lines changed

book-en/02-introduction.Rmd

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -20,20 +20,21 @@ major assumptions:
2020
A linear model can sometimes accommodate certain types of non-linear responses (e.g. $x^2$), but this approach strongly relies on decisions that can be either arbitrary or well-informed, and is much less flexible than using an additive model. For example, this linear model with multiple predictors can handle a non-linear response, but quickly becomes difficult to interpret and manage:
2121

2222
$$y_i = \beta_0 + \beta_1x_{1,i}+\beta_2x_{2,i}+\beta_3x_{3,i}+...+\beta_kx_{k,i} + \epsilon_i$$
23+
2324
Linear models work very well in certain specific cases where all these criteria are met:
2425

2526
```{r, fig.align = 'center', out.width = '70%', echo = FALSE, purl = FALSE}
2627
knitr::include_graphics("images/linreg.png")
2728
```
2829

29-
In reality, we often cannot meet these criteria. This means that in many cases, linear models are inappropriate:
30+
In reality, we often cannot meet these criteria. In many cases, linear models are inappropriate:
3031

3132
```{r, fig.align = 'center', out.width = '100%', echo = FALSE, purl = FALSE}
3233
knitr::include_graphics("images/linreg_bad.png")
3334
```
3435

3536
So, how can we fit a better model? To answer this question, we must first consider what the regression model is
36-
trying to do. The linear model is trying to fit the best __straight line__ that passes through the middle of the data, _without __overfitting___ the data, which is what would happen if we simply drew a line between each point and its neighbours. Linear models do this by finding the best fit straight line that passes through the data.
37+
trying to do. The linear model is trying to fit the best __straight line__ that passes through the middle of the data, _without __overfitting___ the data, which is what would happen if we simply drew a line between each point and its neighbours. Linear models do this by finding the best fit straight line that passes through the data.
3738

3839
In the same way, additive models fit a curve through the data, while controlling the ___wiggliness___ of this curve to avoid overfitting. This means additive models like GAMs can capture non-linear relationships by fitting a smooth function through the data, rather than a straight line. We will come back to the concept of ___wiggliness___ later!
3940

@@ -61,13 +62,13 @@ linear_model <- gam(Sources ~ SampleDepth, data = isit2)
6162
summary(linear_model)
6263
```
6364

64-
The linear model is explaining quite a bit of variance in our dataset ($R_{adj}$ = 0.588), which means it's a pretty good model, right? Well, let's take a look at how our model fits the data:
65+
The linear model is explaining quite a bit of variance in our dataset ( $R_{adj}$ = 0.588), which means it's a pretty good model, right? Well, let's take a look at how our model fits the data:
6566

6667
```{r}
67-
data_plot <- ggplot(data = isit2, aes(y = Sources, x = SampleDepth)) +
68+
data_plot <- ggplot(data = isit2, aes(y = Sources, x = SampleDepth)) +
6869
geom_point() +
6970
geom_line(aes(y = fitted(linear_model)),
70-
colour = "red", size = 1.2) +
71+
colour = "red", size = 1.2) +
7172
theme_bw()
7273
data_plot
7374
```
@@ -94,7 +95,7 @@ where $y_i$ is the response variable, $x_i$ is the predictor variable, and $f$ i
9495
Importantly, given that the smooth function $f(x_i)$ is non-linear and
9596
local, the magnitude of the effect of the explanatory variable can vary
9697
over its range, depending on the relationship between the variable and
97-
the response.
98+
the response.
9899

99100
That is, as opposed to one fixed coefficient $\beta$, the function $f$ can continually change over the range of $x_i$.
100101
The degree of smoothness (or wiggliness) of $f$ is controlled using penalized regression determined automatically in `mgcv` using a generalized cross-validation (GCV) routine [@wood_2006].
@@ -122,7 +123,7 @@ Recall: As opposed to one fixed coefficient $\beta$ in a linear model, the funct
122123

123124
:::
124125

125-
The `mgcv` package also includes a default plot to look at the smooths:
126+
The `mgcv` package also includes a default `plot()` function to look at the smooths:
126127

127128
```{r}
128129
plot(gam_model)
@@ -144,7 +145,7 @@ AIC(linear_model, smooth_model)
144145
Here, the AIC of the smooth GAM is lower, which indicates that adding a smoothing function improves model performance. Linearity is therefore not supported by our data.
145146

146147
:::explanation
147-
As a brief explanation, the Akaike Information Criterion (AIC) is a comparative metric of model performance, where lower scores indicate that a model is performing "better" compared to other considered models.
148+
As a brief explanation, the Akaike Information Criterion (AIC) is a comparative metric of model performance, where lower scores indicate that a model is performing "better" compared to other considered models.
148149
:::
149150

150151
## Challenge 1
@@ -165,7 +166,7 @@ We have not discussed effective degrees of freedom (**EDF**) yet, but these are
165166

166167
```{r, echo = FALSE, include = FALSE}
167168
# Challenge 1 ----
168-
#
169+
#
169170
# 1. Fit a linear and smoothed GAM model to the relation between `SampleDepth` and `Sources`.
170171
# 2. Determine if linearity is justified for this data.
171172
# 3. How many effective degrees of freedom does the smoothed term have?
@@ -196,13 +197,13 @@ ggplot(isit1, aes(x = SampleDepth, y = Sources)) +
196197
theme_bw()
197198
```
198199

199-
We can supplement this with a quantitative comparison of model performance using `AIC()`.
200+
We can supplement this with a quantitative comparison of model performance using `AIC()`.
200201

201202
```{r}
202203
AIC(linear_model_s1, smooth_model_s1)
203204
```
204205

205-
The lower AIC score indicates that smooth model is performing better than the linear model, which confirms that linearity is not appropriate for our dataset.
206+
The lower AIC score indicates that smooth model is performing better than the linear model, which confirms that linearity is not appropriate for our dataset.
206207

207208
__3.__ How many effective degrees of freedom does the smoothed term have?
208209

@@ -213,4 +214,3 @@ smooth_model_s1
213214
```
214215

215216
The effective degrees of freedom (EDF) are >> 1. Keep this in mind, because we will be coming back to EDF [later](#edf)!
216-

book-en/04-smooth-terms.Rmd

Lines changed: 21 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
1-
# GAM with multiple smooth terms
1+
# GAM with multiple smooth terms
22

33
## GAM with linear and smooth terms
44

55
GAMs make it easy to include both smooth and linear terms, multiple
6-
smoothed terms, and smoothed interactions.
6+
smoothed terms, and smoothed interactions.
77

88
For this section, we will use the `ISIT` dataset again. We will try to model the response `Sources` using the predictors `Season` and `SampleDepth` simultaneously.
99

@@ -25,7 +25,7 @@ basic_model <- gam(Sources ~ Season + s(SampleDepth), data = isit, method = "REM
2525
basic_summary <- summary(basic_model)
2626
```
2727

28-
The `p.table` provides information on the linear effects:
28+
The `p.table` provides information on the linear effects:
2929

3030
```{r}
3131
basic_summary$p.table
@@ -70,9 +70,10 @@ In linear regression, the *model* degrees of freedom are equivalent to the numbe
7070

7171
Because the number of free parameters in GAMs is difficult to define, the **EDF** are instead related to the smoothing parameter $\lambda$, such that the greater the penalty, the smaller the **EDF**.
7272

73-
An upper bound on the **EDF** is determined by the basis dimension $k$ for each smooth function, meaning the **EDF** cannot exceed $k-1$.
73+
An upper bound on the **EDF** is determined by the basis dimension $k$ for each smooth function, meaning the **EDF** cannot exceed $k-1$.
74+
75+
In practice, the exact choice of $k$ is arbitrary, but it should be **large enough** to accommodate a sufficiently complex smooth function. We will talk about choosing $k$ in upcoming sections.
7476

75-
In practice, the exact choice of $k$ is arbitrary, but it should be **large enough** to accommodate a sufficiently complex smooth function. We will talk about choosing $k$ in [Chapter 6](#model-checking).
7677

7778
:::explanation
7879
Higher EDF imply more complex, wiggly splines.
@@ -85,7 +86,7 @@ When a term has an EDF value that is close to 1, it is close to being a linear t
8586
We can add a second term (`RelativeDepth`) to our basic model, but specify a linear relationship with `Sources`.
8687

8788
```{r}
88-
two_term_model <- gam(Sources ~ Season + s(SampleDepth) + RelativeDepth,
89+
two_term_model <- gam(Sources ~ Season + s(SampleDepth) + RelativeDepth,
8990
data = isit, method = "REML")
9091
two_term_summary <- summary(two_term_model)
9192
```
@@ -116,7 +117,7 @@ If we want to know whether the relationship between `Sources` and `RelativeDepth
116117
non-linear, we can model `RelativeDepth` as a smooth term instead. In this model, we would have two smooth terms:
117118

118119
```{r}
119-
two_smooth_model <- gam(Sources ~ Season + s(SampleDepth) + s(RelativeDepth),
120+
two_smooth_model <- gam(Sources ~ Season + s(SampleDepth) + s(RelativeDepth),
120121
data = isit, method = "REML")
121122
two_smooth_summary <- summary(two_smooth_model)
122123
```
@@ -133,7 +134,7 @@ In the `s.table`, we will now find two non-linear smoothers, `s(SampleDepth)` an
133134
two_smooth_summary$s.table
134135
```
135136

136-
Let us take a look at the relationships between the linear and non-linear predictors and our response variable.
137+
Let us take a look at the relationships between the linear and non-linear predictors and our response variable.
137138

138139
```{r, fig.height = 8}
139140
par(mfrow=c(2,2))
@@ -154,16 +155,16 @@ We can see that `two_smooth_model` has the lowest AIC value. The best fit model
154155

155156
## Challenge 2
156157

157-
For our second challenge, we will be building onto our model by adding variables which we think might be ecologically significant predictors to explain bioluminescence.
158+
For our second challenge, we will be building onto our model by adding variables which we think might be ecologically significant predictors to explain bioluminescence.
158159

159160
1. Create two new models: Add `Latitude` to `two_smooth_model`, first as a linear term, then as a smoothed term.
160161
2. Is `Latitude` an important term to include? Does `Latitude` have a linear or additive effect? Use plots, coefficient tables, and the `AIC()` function to help you answer this question.
161162

162163
```{r, echo = FALSE, include = FALSE}
163164
# Challenge 2 ----
164-
#
165-
# For our second challenge, we will be building onto our model by adding variables which we think might be ecologically significant predictors to explain bioluminescence.
166-
#
165+
#
166+
# For our second challenge, we will be building onto our model by adding variables which we think might be ecologically significant predictors to explain bioluminescence.
167+
#
167168
#
168169
# 1. Create two new models: Add `Latitude` to `two_smooth_model`, first as a linear term, then as a smoothed term.
169170
# 2. Is `Latitude` an important term to include? Does `Latitude` have a linear or additive effect? Use plots, coefficient tables, and the `AIC()` function to help you answer this question.
@@ -177,17 +178,17 @@ __1.__ Create two new models: Add `Latitude` to `two_smooth_model`, first as a l
177178

178179
```{r}
179180
# Add Latitude as a linear term
180-
three_term_model <- gam(Sources ~
181-
Season + s(SampleDepth) + s(RelativeDepth) +
182-
Latitude,
181+
three_term_model <- gam(Sources ~
182+
Season + s(SampleDepth) + s(RelativeDepth) +
183+
Latitude,
183184
data = isit, method = "REML")
184185
(three_term_summary <- summary(three_term_model))
185186
```
186187

187188
```{r}
188189
# Add Latitude as a smooth term
189-
three_smooth_model <- gam(Sources ~
190-
Season + s(SampleDepth) + s(RelativeDepth) +
190+
three_smooth_model <- gam(Sources ~
191+
Season + s(SampleDepth) + s(RelativeDepth) +
191192
s(Latitude),
192193
data = isit, method = "REML")
193194
(three_smooth_summary <- summary(three_smooth_model))
@@ -221,14 +222,14 @@ Before deciding which model is "best", we should test whether the effect of `Lat
221222
AIC(three_smooth_model, three_term_model)
222223
```
223224

224-
Our model including Latitude as a _smooth_ term has a lower AIC score, meaning it performs better than our model including Latitude as a _linear_ term.
225+
Our model including Latitude as a _smooth_ term has a lower AIC score, meaning it performs better than our model including Latitude as a _linear_ term.
225226

226227
But, does adding `Latitude` as a smooth predictor actually improve on our last "best" model (`two_smooth_model`)?
227228

228229
```{r}
229230
AIC(two_smooth_model, three_smooth_model)
230231
```
231232

232-
Our `three_smooth_model`, which includes `SampleDepth`, `RelativeDepth`, and `Latitude` as _smooth_ terms, and Season as a linear term, has a lower AIC score than our previous best model, which did not include `Latitude`.
233+
Our `three_smooth_model`, which includes `SampleDepth`, `RelativeDepth`, and `Latitude` as _smooth_ terms, and Season as a linear term, has a lower AIC score than our previous best model, which did not include `Latitude`.
233234

234-
This implies that `Latitude` is indeed an informative non-linear predictor of bioluminescence.
235+
This implies that `Latitude` is indeed an informative non-linear predictor of bioluminescence.

book-en/05-interactions.Rmd

Lines changed: 13 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -14,9 +14,9 @@ There are two ways to include interactions between variables:
1414
We will examine interaction effects to determine whether the non-linear smoother `s(SampleDepth)` varies across different levels of `Season`.
1515

1616
```{r}
17-
factor_interact <- gam(Sources ~ Season +
18-
s(SampleDepth, by=Season) +
19-
s(RelativeDepth),
17+
factor_interact <- gam(Sources ~ Season +
18+
s(SampleDepth, by=Season) +
19+
s(RelativeDepth),
2020
data = isit, method = "REML")
2121
2222
summary(factor_interact)$s.table
@@ -38,8 +38,10 @@ We can also plot the interaction effect in 3D on a single plot, using `vis.gam()
3838
```{r}
3939
vis.gam(factor_interact, theta = 120, n.grid = 50, lwd = .4)
4040
```
41-
> This plot can be rotated by changing the value of the `theta` argument.
4241

42+
:::explanation
43+
This plot can be rotated by changing the value of the `theta` argument.
44+
:::
4345

4446
To test our idea that this interaction is important, we will perform a model comparison using AIC to determine whether the interaction term improves our model's performance.
4547

@@ -55,12 +57,12 @@ The AIC of our model with a factor interaction between the `SampleDepth` smooth
5557
Next, we'll look at the interactions between two smoothed terms, `SampleDepth` and `RelativeDepth`.
5658

5759
```{r}
58-
smooth_interact <- gam(Sources ~ Season + s(SampleDepth, RelativeDepth),
60+
smooth_interact <- gam(Sources ~ Season + s(SampleDepth, RelativeDepth),
5961
data = isit, method = "REML")
6062
summary(smooth_interact)$s.table
6163
```
6264

63-
In the previous section, we were able to visualise an interaction effect between a smooth and a factor term by plotting a different smooth function of `SampleDepth` for each level of `Season`.
65+
In the previous section, we were able to visualise an interaction effect between a smooth and a factor term by plotting a different smooth function of `SampleDepth` for each level of `Season`.
6466

6567
In this model, we have two smoothed terms, which means that the effect of `SampleDepth` varies smoothly with `RelativeDepth`, and vice-versa. When we visualise this interaction, we instead get a gradient between two continuous smooth functions:
6668

@@ -71,13 +73,14 @@ plot(smooth_interact, page = 1, scheme = 2)
7173
We can also plot this interaction on a 3D surface:
7274

7375
```{r}
74-
vis.gam(smooth_interact,
75-
view = c("SampleDepth", "RelativeDepth"),
76+
vis.gam(smooth_interact,
77+
view = c("SampleDepth", "RelativeDepth"),
7678
theta = 50, n.grid = 50, lwd = .4)
7779
```
78-
> Remember, this plot can be rotated by changing the value of the `theta` argument.
7980

8081
:::explanation
82+
Remember, this plot can be rotated by changing the value of the `theta` argument.
83+
8184
You can change the colour of the 3D plot using the `color` argument. Try specifying `color = "cm"` in `vis.gam()` above, and check `?vis.gam` for more color options.
8285
:::
8386

@@ -89,4 +92,4 @@ So, there does seem to be an interaction effect between these smooth terms. Does
8992
AIC(two_smooth_model, smooth_interact)
9093
```
9194

92-
The model with the interaction between `s(SampleDepth)` and `s(RelativeDepth)` has a lower AIC, which means including this interaction improves our model's performance, and our ability to understand the drivers of bioluminescence.
95+
The model with the interaction between `s(SampleDepth)` and `s(RelativeDepth)` has a lower AIC, which means including this interaction improves our model's performance, and our ability to understand the drivers of bioluminescence.

0 commit comments

Comments
 (0)