QCBSRworkshops
diff --git a/‎book-en/02-introduction.Rmd‎
Lines changed: 12 additions & 12 deletions b/‎book-en/02-introduction.Rmd‎
Lines changed: 12 additions & 12 deletions
diff --git a/‎book-en/04-smooth-terms.Rmd‎
Lines changed: 21 additions & 20 deletions b/‎book-en/04-smooth-terms.Rmd‎
Lines changed: 21 additions & 20 deletions
diff --git a/‎book-en/05-interactions.Rmd‎
Lines changed: 13 additions & 10 deletions b/‎book-en/05-interactions.Rmd‎
Lines changed: 13 additions & 10 deletions
@@ -20,20 +20,21 @@ major assumptions:
 A linear model can sometimes accommodate certain types of non-linear responses (e.g. $x^2$), but this approach strongly relies on decisions that can be either arbitrary or well-informed, and is much less flexible than using an additive model. For example, this linear model with multiple predictors can handle a non-linear response, but quickly becomes difficult to interpret and manage:
 
 $$y_i = \beta_0 + \beta_1x_{1,i}+\beta_2x_{2,i}+\beta_3x_{3,i}+...+\beta_kx_{k,i} + \epsilon_i$$
+
 Linear models work very well in certain specific cases where all these criteria are met:
 
 ```{r, fig.align = 'center', out.width = '70%', echo = FALSE, purl = FALSE}
 knitr::include_graphics("images/linreg.png")
 ```
 
-In reality, we often cannot meet these criteria. This means that in many cases, linear models are inappropriate:
+In reality, we often cannot meet these criteria. In many cases, linear models are inappropriate:
 
 ```{r, fig.align = 'center', out.width = '100%', echo = FALSE, purl = FALSE}
 knitr::include_graphics("images/linreg_bad.png")
 ```
 
 So, how can we fit a better model? To answer this question, we must first consider what the regression model is
-trying to do. The linear model is trying to fit the best __straight line__ that passes through the middle of the data, _without __overfitting___ the data, which is what would happen if we simply drew a line between each point and its neighbours. Linear models do this by finding the best fit straight line that passes through the data. 
+trying to do. The linear model is trying to fit the best __straight line__ that passes through the middle of the data, _without __overfitting___ the data, which is what would happen if we simply drew a line between each point and its neighbours. Linear models do this by finding the best fit straight line that passes through the data.
 
 In the same way, additive models fit a curve through the data, while controlling the ___wiggliness___ of this curve to avoid overfitting. This means additive models like GAMs can capture non-linear relationships by fitting a smooth function through the data, rather than a straight line. We will come back to the concept of ___wiggliness___ later!
 
@@ -61,13 +62,13 @@ linear_model <- gam(Sources ~ SampleDepth, data = isit2)
 summary(linear_model)
 ```
 
-The linear model is explaining quite a bit of variance in our dataset ($R_{adj}$ = 0.588), which means it's a pretty good model, right? Well, let's take a look at how our model fits the data:
+The linear model is explaining quite a bit of variance in our dataset ( $R_{adj}$ = 0.588), which means it's a pretty good model, right? Well, let's take a look at how our model fits the data:
 
 ```{r}
-data_plot <- ggplot(data = isit2, aes(y = Sources, x = SampleDepth)) + 
+data_plot <- ggplot(data = isit2, aes(y = Sources, x = SampleDepth)) +
   geom_point() +
   geom_line(aes(y = fitted(linear_model)),
-            colour = "red", size = 1.2) + 
+            colour = "red", size = 1.2) +
   theme_bw()
 data_plot
 ```
@@ -94,7 +95,7 @@ where $y_i$ is the response variable, $x_i$ is the predictor variable, and $f$ i
 Importantly, given that the smooth function $f(x_i)$ is non-linear and
 local, the magnitude of the effect of the explanatory variable can vary
 over its range, depending on the relationship between the variable and
-the response. 
+the response.
 
 That is, as opposed to one fixed coefficient $\beta$, the function $f$ can continually change over the range of $x_i$.
 The degree of smoothness (or wiggliness) of $f$ is controlled using penalized regression determined automatically in `mgcv` using a generalized cross-validation (GCV) routine [@wood_2006].
@@ -122,7 +123,7 @@ Recall: As opposed to one fixed coefficient $\beta$ in a linear model, the funct
 
 :::
 
-The `mgcv` package also includes a default plot to look at the smooths:
+The `mgcv` package also includes a default `plot()` function to look at the smooths:
 
 ```{r}
 plot(gam_model)
@@ -144,7 +145,7 @@ AIC(linear_model, smooth_model)
 Here, the AIC of the smooth GAM is lower, which indicates that adding a smoothing function improves model performance. Linearity is therefore not supported by our data.
 
 :::explanation
-As a brief explanation, the Akaike Information Criterion (AIC) is a comparative metric of model performance, where lower scores indicate that a model is performing "better" compared to other considered models. 
+As a brief explanation, the Akaike Information Criterion (AIC) is a comparative metric of model performance, where lower scores indicate that a model is performing "better" compared to other considered models.
 :::
 
 ## Challenge 1
@@ -165,7 +166,7 @@ We have not discussed effective degrees of freedom (**EDF**) yet, but these are
 
 ```{r, echo = FALSE, include = FALSE}
 # Challenge 1 ----
-# 
+#
 # 1. Fit a linear and smoothed GAM model to the relation between `SampleDepth` and `Sources`.
 # 2. Determine if linearity is justified for this data.
 # 3. How many effective degrees of freedom does the smoothed term have?
@@ -196,13 +197,13 @@ ggplot(isit1, aes(x = SampleDepth, y = Sources)) +
   theme_bw()
 ```
 
-We can supplement this with a quantitative comparison of model performance using `AIC()`. 
+We can supplement this with a quantitative comparison of model performance using `AIC()`.
 
 ```{r}
 AIC(linear_model_s1, smooth_model_s1)
 ```
 
-The lower AIC score indicates that smooth model is performing better than the linear model, which confirms that linearity is not appropriate for our dataset. 
+The lower AIC score indicates that smooth model is performing better than the linear model, which confirms that linearity is not appropriate for our dataset.
 
 __3.__ How many effective degrees of freedom does the smoothed term have?
 
@@ -213,4 +214,3 @@ smooth_model_s1
 ```
 
 The effective degrees of freedom (EDF) are >> 1. Keep this in mind, because we will be coming back to EDF [later](#edf)!
-
@@ -1,9 +1,9 @@
-# GAM with multiple smooth terms 
+# GAM with multiple smooth terms
 
 ## GAM with linear and smooth terms
 
 GAMs make it easy to include both smooth and linear terms, multiple
-smoothed terms, and smoothed interactions. 
+smoothed terms, and smoothed interactions.
 
 For this section, we will use the `ISIT` dataset again. We will try to model the response `Sources` using the predictors `Season` and `SampleDepth` simultaneously.
 
@@ -25,7 +25,7 @@ basic_model <- gam(Sources ~ Season + s(SampleDepth), data = isit, method = "REM
 basic_summary <- summary(basic_model)
 ```
 
-The `p.table` provides information on the linear effects: 
+The `p.table` provides information on the linear effects:
 
 ```{r}
 basic_summary$p.table
@@ -70,9 +70,10 @@ In linear regression, the *model* degrees of freedom are equivalent to the numbe
 
 Because the number of free parameters in GAMs is difficult to define, the **EDF** are instead related to the smoothing parameter $\lambda$, such that the greater the penalty, the smaller the **EDF**.
 
-An upper bound on the **EDF** is determined by the basis dimension $k$ for each smooth function, meaning the **EDF** cannot exceed $k-1$. 
+An upper bound on the **EDF** is determined by the basis dimension $k$ for each smooth function, meaning the **EDF** cannot exceed $k-1$.
+
+In practice, the exact choice of $k$ is arbitrary, but it should be **large enough** to accommodate a sufficiently complex smooth function. We will talk about choosing $k$ in upcoming sections.
 
-In practice, the exact choice of $k$ is arbitrary, but it should be **large enough** to accommodate a sufficiently complex smooth function. We will talk about choosing $k$ in [Chapter 6](#model-checking).
 
 :::explanation
 Higher EDF imply more complex, wiggly splines.
@@ -85,7 +86,7 @@ When a term has an EDF value that is close to 1, it is close to being a linear t
 We can add a second term (`RelativeDepth`) to our basic model, but specify a linear relationship with `Sources`.
 
 ```{r}
-two_term_model <- gam(Sources ~ Season + s(SampleDepth) + RelativeDepth, 
+two_term_model <- gam(Sources ~ Season + s(SampleDepth) + RelativeDepth,
                       data = isit, method = "REML")
 two_term_summary <- summary(two_term_model)
 ```
@@ -116,7 +117,7 @@ If we want to know whether the relationship between `Sources` and `RelativeDepth
 non-linear, we can model `RelativeDepth` as a smooth term instead. In this model, we would have two smooth terms:
 
 ```{r}
-two_smooth_model <- gam(Sources ~ Season + s(SampleDepth) + s(RelativeDepth), 
+two_smooth_model <- gam(Sources ~ Season + s(SampleDepth) + s(RelativeDepth),
                         data = isit, method = "REML")
 two_smooth_summary <- summary(two_smooth_model)
 ```
@@ -133,7 +134,7 @@ In the `s.table`, we will now find two non-linear smoothers, `s(SampleDepth)` an
 two_smooth_summary$s.table
 ```
 
-Let us take a look at the relationships between the linear and non-linear predictors and our response variable. 
+Let us take a look at the relationships between the linear and non-linear predictors and our response variable.
 
 ```{r, fig.height = 8}
 par(mfrow=c(2,2))
@@ -154,16 +155,16 @@ We can see that `two_smooth_model` has the lowest AIC value. The best fit model
 
 ## Challenge 2
 
-For our second challenge, we will be building onto our model by adding variables which we think might be ecologically significant predictors to explain bioluminescence. 
+For our second challenge, we will be building onto our model by adding variables which we think might be ecologically significant predictors to explain bioluminescence.
 
 1. Create two new models: Add `Latitude` to `two_smooth_model`, first as a linear term, then as a smoothed term.
 2. Is `Latitude` an important term to include? Does `Latitude` have a linear or additive effect? Use plots, coefficient tables, and the `AIC()` function to help you answer this question.
 
 ```{r, echo = FALSE, include = FALSE}
 # Challenge 2 ----
-# 
-# For our second challenge, we will be building onto our model by adding variables which we think might be ecologically significant predictors to explain bioluminescence. 
-# 
+#
+# For our second challenge, we will be building onto our model by adding variables which we think might be ecologically significant predictors to explain bioluminescence.
+#
 #
 # 1. Create two new models: Add `Latitude` to `two_smooth_model`, first as a linear term, then as a smoothed term.
 # 2. Is `Latitude` an important term to include? Does `Latitude` have a linear or additive effect? Use plots, coefficient tables, and the `AIC()` function to help you answer this question.
@@ -177,17 +178,17 @@ __1.__ Create two new models: Add `Latitude` to `two_smooth_model`, first as a l
 
 ```{r}
 # Add Latitude as a linear term
-three_term_model <- gam(Sources ~ 
-                          Season + s(SampleDepth) + s(RelativeDepth) + 
-                          Latitude, 
+three_term_model <- gam(Sources ~
+                          Season + s(SampleDepth) + s(RelativeDepth) +
+                          Latitude,
                         data = isit, method = "REML")
 (three_term_summary <- summary(three_term_model))
 ```
 
 ```{r}
 # Add Latitude as a smooth term
-three_smooth_model <- gam(Sources ~ 
-                            Season + s(SampleDepth) + s(RelativeDepth) + 
+three_smooth_model <- gam(Sources ~
+                            Season + s(SampleDepth) + s(RelativeDepth) +
                             s(Latitude),
                           data = isit, method = "REML")
 (three_smooth_summary <- summary(three_smooth_model))
@@ -221,14 +222,14 @@ Before deciding which model is "best", we should test whether the effect of `Lat
 AIC(three_smooth_model, three_term_model)
 ```
 
-Our model including Latitude as a _smooth_ term has a lower AIC score, meaning it performs better than our model including Latitude as a _linear_ term. 
+Our model including Latitude as a _smooth_ term has a lower AIC score, meaning it performs better than our model including Latitude as a _linear_ term.
 
 But, does adding `Latitude` as a smooth predictor actually improve on our last "best" model (`two_smooth_model`)?
 
 ```{r}
 AIC(two_smooth_model, three_smooth_model)
 ```
 
-Our `three_smooth_model`, which includes `SampleDepth`, `RelativeDepth`, and `Latitude` as _smooth_ terms, and Season as a linear term, has a lower AIC score than our previous best model, which did not include `Latitude`. 
+Our `three_smooth_model`, which includes `SampleDepth`, `RelativeDepth`, and `Latitude` as _smooth_ terms, and Season as a linear term, has a lower AIC score than our previous best model, which did not include `Latitude`.
 
-This implies that `Latitude` is indeed an informative non-linear predictor of bioluminescence.
+This implies that `Latitude` is indeed an informative non-linear predictor of bioluminescence.
@@ -14,9 +14,9 @@ There are two ways to include interactions between variables:
 We will examine interaction effects to determine whether the non-linear smoother `s(SampleDepth)` varies across different levels of `Season`.
 
 ```{r}
-factor_interact <- gam(Sources ~ Season + 
-                         s(SampleDepth, by=Season) + 
-                         s(RelativeDepth), 
+factor_interact <- gam(Sources ~ Season +
+                         s(SampleDepth, by=Season) +
+                         s(RelativeDepth),
                        data = isit, method = "REML")
 
 summary(factor_interact)$s.table
@@ -38,8 +38,10 @@ We can also plot the interaction effect in 3D on a single plot, using `vis.gam()
 ```{r}
 vis.gam(factor_interact, theta = 120, n.grid = 50, lwd = .4)
 ```
-> This plot can be rotated by changing the value of the `theta` argument.
 
+:::explanation
+This plot can be rotated by changing the value of the `theta` argument.
+:::
 
 To test our idea that this interaction is important, we will perform a model comparison using AIC to determine whether the interaction term improves our model's performance.
 
@@ -55,12 +57,12 @@ The AIC of our model with a factor interaction between the `SampleDepth` smooth
 Next, we'll look at the interactions between two smoothed terms, `SampleDepth` and `RelativeDepth`.
 
 ```{r}
-smooth_interact <- gam(Sources ~ Season + s(SampleDepth, RelativeDepth), 
+smooth_interact <- gam(Sources ~ Season + s(SampleDepth, RelativeDepth),
                        data = isit, method = "REML")
 summary(smooth_interact)$s.table
 ```
 
-In the previous section, we were able to visualise an interaction effect between a smooth and a factor term by plotting a different smooth function of `SampleDepth` for each level of `Season`. 
+In the previous section, we were able to visualise an interaction effect between a smooth and a factor term by plotting a different smooth function of `SampleDepth` for each level of `Season`.
 
 In this model, we have two smoothed terms, which means that the effect of `SampleDepth` varies smoothly with `RelativeDepth`, and vice-versa. When we visualise this interaction, we instead get a gradient between two continuous smooth functions:
 
@@ -71,13 +73,14 @@ plot(smooth_interact, page = 1, scheme = 2)
 We can also plot this interaction on a 3D surface:
 
 ```{r}
-vis.gam(smooth_interact, 
-        view = c("SampleDepth", "RelativeDepth"), 
+vis.gam(smooth_interact,
+        view = c("SampleDepth", "RelativeDepth"),
         theta = 50, n.grid = 50, lwd = .4)
 ```
-> Remember, this plot can be rotated by changing the value of the `theta` argument.
 
 :::explanation
+Remember, this plot can be rotated by changing the value of the `theta` argument.
+
 You can change the colour of the 3D plot using the `color` argument. Try specifying `color = "cm"` in `vis.gam()` above, and check `?vis.gam` for more color options.
 :::
 
@@ -89,4 +92,4 @@ So, there does seem to be an interaction effect between these smooth terms. Does
 AIC(two_smooth_model, smooth_interact)
 ```
 
-The model with the interaction between `s(SampleDepth)` and `s(RelativeDepth)` has a lower AIC, which means including this interaction improves our model's performance, and our ability to understand the drivers of bioluminescence. 
+The model with the interaction between `s(SampleDepth)` and `s(RelativeDepth)` has a lower AIC, which means including this interaction improves our model's performance, and our ability to understand the drivers of bioluminescence.