Skip to content

Commit 05cc32b

Browse files
final major book updates (EN)
1 parent 645be99 commit 05cc32b

File tree

6 files changed

+475
-447
lines changed

6 files changed

+475
-447
lines changed

book-en/06-generalization.Rmd

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -75,15 +75,15 @@ Have we optimized the tradeoff between smoothness ($\lambda$) and _wiggliness_ (
7575

7676
### Is our model wiggly enough?
7777

78-
Although we have not yet specified a $k$ value in our model, we know that the default for smooth interactions in `gam()` is `k = 30`.
78+
We have not yet specified a $k$ value in our model, but `gam()` sets a default $k$ depending on the number of variables on which the smooth is built.
7979

80-
Is $k$ large enough?
80+
Is the $k$ large enough?
8181

8282
```{r}
8383
k.check(smooth_interact)
8484
```
8585

86-
The **EDF are very close to** `k`. This means the _wiggliness_ of the model is being overly constrained by the default `k = 30`, and could fit the data better with greater wiggliness. In other words, the tradeoff between smoothness and wiggliness is not balanced.
86+
The **EDF are very close to** `k`. This means the _wiggliness_ of the model is being overly constrained by the default `k`, and could fit the data better with greater wiggliness. In other words, the tradeoff between smoothness and wiggliness is not balanced.
8787

8888
We can refit the model with a larger `k`:
8989

book-en/07-changing-basis.Rmd

Lines changed: 90 additions & 78 deletions
Original file line numberDiff line numberDiff line change
@@ -1,89 +1,101 @@
1-
# Changing the basis function
2-
3-
We won't go too much further into this in this workshop, but you should
4-
know you can expand on the basic model we've looked at today with:
5-
6-
- more complicated smooths, by changing the basis,
7-
- other distributions: anything you can do with a GLM using the family
8-
argument,
9-
- mixed effect models, using gamm or the gamm4 package.
10-
11-
We will first have a look at changing the basis, followed by a quick
12-
intro to using other distribution and GAMMs (Generalized Additive Mixed
13-
Models). Let's look at one place where knowing how to change the basis
14-
can help: cyclical data. That is, data where you want the predictor to
15-
match at the ends. Imagine you have a time series of climate data,
16-
broken down by monthly measurement, and you want to see if there's a
17-
time-trend in yearly temperature. We'll use the Nottingham temperature
18-
time series for this:
19-
20-
```{r, echo = TRUE, eval = FALSE}
1+
# Changing the basis
2+
3+
To model a non-linear smooth variable or surface, smooth functions can be built in different ways:
4+
5+
The smooth function `s()` models a 1-dimensional smooth or for modelling interactions among variables measured using the same unit and the same scale. This is the smooth function we have been using throughout this workshop.
6+
7+
There are two other smooth functions: `te()` and `ti()`, which can both be used to model 2- or n-dimensional interaction surfaces of variables. The function `te()` is useful when variables are not on the same scale, and when interactions _include_ main effects. The function `ti()` is best for modelling interaction surfaces that _do not_ include the main effects.
8+
9+
The smooth functions have several parameters that can be set to change their behaviour. The most common parameters are:
10+
11+
`k`: basis dimension
12+
- Determines the maximum number of base functions used to build the curve.
13+
- Sets the wiggliness of a smooth, in a trade-off with the smoothing parameter.
14+
- The $k$ should always be less than the number of unique data points.
15+
- The complexity (i.e. non-linearity) of a smooth function in a fitted model is reflected by its effective degrees of freedom (**EDF**).
16+
17+
`bs` specifies the type of basis functions.
18+
- The default for `s()` is `tp` (thin plate regression spline).
19+
- The default for `te()` and `ti()` is `cr` (cubic regression spline).
20+
21+
When using `te()` and `ti()` basis function, we also need to set the parameter `d`, which specifies that predictors in the interaction are on the same scale or dimension.
22+
- For example, in `te(Time, width, height, d=c(1,2))`, indicates that `width` and `height` are one the same scale, but not `Time`.
23+
24+
## Example: Cyclical data
25+
26+
When modelling cyclical data, you generally want the predictor to match at both ends. To achieve this, we need to change the basis function.
27+
28+
Let's use a time series of climate data, with monthly measurements, to find a temporal trend in yearly temperature. We'll use the Nottingham temperature
29+
time series for this, which is included in `R`:
30+
31+
```{r}
32+
# Nottingham temperature time series
2133
data(nottem)
22-
n_years <- length(nottem)/12
23-
nottem_month <- rep(1:12, times=n_years)
24-
nottem_year <- rep(1920:(1920+n_years-1),each=12)
25-
nottem_plot <- qplot(nottem_month,nottem,
26-
colour=factor(nottem_year),
27-
geom="line") + theme_bw()
28-
print(nottem_plot)
2934
```
3035

31-
Using the nottem data, we have created three new vectors; `n_years`
32-
corresponds to the number of years of data (20 years), `nottem_month` is
33-
a categorical variable coding for the 12 months of the year, for every
34-
year sampled (sequence 1 to 12 repeated 20 times), and `nottem_year` is
35-
a variable where the year corresponding to each month is provided.
36+
:::explanation
37+
See `?nottem` for a more complete description of this dataset.
38+
:::
3639

37-
Monthly data from the years 1920 through to 1940:
40+
Let us begin by plotting the monthly temperature fluctuations for every year in the `nottem` dataset:
3841

39-
![](images//graphic4.1.jpg){width="500"}
42+
```{r, warning = FALSE, message = FALSE}
43+
# the number of years of data (20 years)
44+
n_years <- length(nottem)/12
45+
46+
# categorical variable coding for the 12 months of the year, for every
47+
# year sampled (so, a sequence 1 to 12 repeated for 20 years).
48+
nottem_month <- rep(1:12, times = n_years)
4049
41-
To model this, we need to use what's called a cyclical cubic spline, or
42-
`"cc"`, to model month effects as well as a term for year.
50+
# the year corresponding to each month in nottem_month
51+
nottem_year <- rep(1920:(1920 + n_years - 1), each = 12)
4352
44-
```{r, echo = TRUE, eval = FALSE}
45-
year_gam <- gam(nottem~s(nottem_year)+s(nottem_month, bs="cc"))
53+
# Plot the time series
54+
qplot(x = nottem_month, y = nottem,
55+
colour = factor(nottem_year),
56+
geom = "line") +
57+
theme_bw()
58+
```
59+
60+
We can model both the cyclic change of temperature across months and the non-linear trend through years, using a cyclical cubic spline, or `cc`, for the month variable and a regular smooth for the year variable.
61+
62+
```{r}
63+
year_gam <- gam(nottem ~ s(nottem_year) + s(nottem_month, bs = "cc"), method = "REML")
4664
summary(year_gam)$s.table
47-
plot(year_gam,page=1, scale=0)
4865
```
4966

50-
![](images//graphic4.2.jpg){width="700"}
51-
52-
There is about 1-1.5 degree rise in temperature over the period, but
53-
within a given year there is about 20 degrees variation in temperature,
54-
on average. The actual data vary around these values and that is the
55-
unexplained variance. Here we can see one of the very interesting
56-
bonuses of using GAMs. We can either plot the response surface (fitted
57-
values) or the terms (contribution of each covariate) as shown here. You
58-
can imagine these as plots of the changing regression coefficients, and
59-
how their contribution (or effect size) varies over time. In the first
60-
plot, we see that positive contributions of temperature occurred
61-
post-1930.
62-
63-
Over longer time scales, for example using paleolimnological data,
64-
others ([Simpson & Anderson 2009; Fig.
65-
3c](http://www.aslo.info/lo/toc/vol_54/issue_6_part_2/2529.pdf)) have
66-
used GAMs to plot the contribution (effect) of temperature on algal
67-
assemblages in lakes, to illustrate how significant contributions only
68-
occurred during two extreme cold events (that is, the contribution is
69-
significant when the confidence intervals do not overlap zero, at around
70-
300 and 100 B.C.). This allowed the authors to not only determine how
71-
much variance was explained by temperature over the last few centuries,
72-
but also pinpoint when in time this effect was significant. If of
73-
interest to you, the code to plot either the response surface
74-
(`type = "response"`) or the terms (`type = "terms"`) is given below.
75-
When terms is selected, you obtain the same figure as above.
76-
77-
Contribution plot:
78-
```{r, echo = TRUE, eval = FALSE}
79-
pred<-predict(year_gam, type = "terms", se = TRUE)
80-
I<-order(nottem_year)
81-
plusCI<-I(pred$fit[,1] + 1.96*pred$se[,1])
82-
minusCI<-I(pred$fit[,1] - 1.96*pred$se[,1])
83-
xx <- c(nottem_year[I],rev(nottem_year[I]))
84-
yy <- c(plusCI[I],rev(minusCI[I]))
85-
plot(xx,yy,type="n",cex.axis=1.2)
86-
polygon(xx,yy,col="light grey",border="light grey")
87-
lines(nottem_year[I], pred$fit[,1][I],lty=1)
88-
abline(h=0)
67+
```{r}
68+
plot(year_gam, page = 1, scale = 0)
69+
```
70+
71+
There is about 1-1.5 degree rise in temperature over the period (Panel 1), but within a given year there is about 20 degrees variation in temperature, on average (Panel 2). The actual data vary around these values, and this is the unexplained variance.
72+
73+
Here we can see one of the very interesting bonuses of using GAMs. We can either plot the response surface (fitted values) or the terms (contribution of each covariate) as shown here. You can imagine these as plots of the changing regression coefficients, and how their contribution (or effect size) varies over time. In the first plot, we see that positive contributions of temperature occurred post-1930.
74+
75+
76+
:::explanation
77+
__Visualising variable contributions__
78+
79+
@simpson-2009 modelled paleolimnological data with GAMs (see Fig.3c), and plotted the contribution (effect) of temperature on algal assemblages in lakes to illustrate how significant contributions only occurred during two extreme cold events.
80+
81+
That is, the contribution is significant when the confidence intervals do not overlap zero, which occurs at around 300 and 100 B.C. in @simpson-2009.
82+
83+
This allowed the authors to not only determine how much variance was explained by temperature over the last few centuries, but also to pinpoint when in time this effect was significant.
84+
85+
If of interest to you, the code to plot either the response surface (`type = "response"`) or the terms (`type = "terms"`) is given below. When `type = "terms"` is selected, you obtain the same figure as above.
86+
87+
```{r, purl = FALSE}
88+
pred <- predict(year_gam, type = "terms", se = TRUE)
89+
I <- order(nottem_year)
90+
plusCI <- I(pred$fit[,1] + 1.96*pred$se[,1])
91+
minusCI <- I(pred$fit[,1] - 1.96*pred$se[,1])
92+
xx <- c(nottem_year[I], rev(nottem_year[I]))
93+
yy <- c(plusCI[I], rev(minusCI[I]))
94+
95+
plot(xx, yy, type = "n", cex.axis = 1.2, xlab = "Year", ylab = "Temperature")
96+
polygon(xx, yy, col = "light blue", border = "light blue")
97+
lines(nottem_year[I], pred$fit[,1][I], lty = 1, lwd = 2)
98+
abline(h=0, lty = 2)
8999
```
100+
101+
:::

book-en/07-other-distributions.Rmd

Lines changed: 0 additions & 151 deletions
This file was deleted.

0 commit comments

Comments
 (0)