Skip to content

Commit d10e86a

Browse files
Merge pull request #420 from UBC-DSCI/dev
transfer dev to master
2 parents 4aa2fc0 + 062be9c commit d10e86a

File tree

162 files changed

+1388
-695
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

162 files changed

+1388
-695
lines changed

build_pdf.sh

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
# Copy files
44
cp references.bib pdf/
55
cp authors.Rmd pdf/
6+
cp foreword-text.Rmd pdf/
67
cp preface-text.Rmd pdf/
78
cp acknowledgements.Rmd pdf/
89
cp intro.Rmd pdf/
@@ -29,6 +30,7 @@ docker run --rm -m 5g -v $(pwd):/home/rstudio/introduction-to-datascience ubcdsc
2930
# clean files in pdf dir
3031
rm -rf pdf/references.bib
3132
rm -rf pdf/authors.Rmd
33+
rm -rf pdf/foreword-text.Rmd
3234
rm -rf pdf/preface-text.Rmd
3335
rm -rf pdf/acknowledgements.Rmd
3436
rm -rf pdf/intro.Rmd

classification1.Rmd

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -455,7 +455,7 @@ You will see in the `mutate` \index{mutate} step below, we compute the straight-
455455
distance using the formula above: we square the differences between the two observations' perimeter
456456
and concavity coordinates, add the squared differences, and then take the square root.
457457

458-
```{r 05-multiknn-1, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap="Scatter plot of concavity versus perimeter with new observation represented as a red diamond."}
458+
```{r 05-multiknn-1, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.pos = "H", out.extra="", fig.cap="Scatter plot of concavity versus perimeter with new observation represented as a red diamond."}
459459
perim_concav <- bind_rows(cancer,
460460
tibble(Perimeter = new_point[1],
461461
Concavity = new_point[2],
@@ -1096,7 +1096,7 @@ The new imbalanced data is shown in Figure \@ref(fig:05-unbalanced).
10961096
set.seed(3)
10971097
```
10981098

1099-
```{r 05-unbalanced, fig.height = 3.5, fig.width = 4.5, fig.cap = "Imbalanced data."}
1099+
```{r 05-unbalanced, fig.height = 3.5, fig.width = 4.5, fig.pos = "H", out.extra="", fig.cap = "Imbalanced data."}
11001100
rare_cancer <- bind_rows(
11011101
filter(cancer, Class == "B"),
11021102
cancer |> filter(Class == "M") |> slice_head(n = 3)
@@ -1255,7 +1255,7 @@ classifier would make. We can see that the decision is more reasonable; when the
12551255
to those labeled malignant, the classifier predicts a malignant tumor, and vice versa when they are
12561256
closer to the benign tumor observations.
12571257

1258-
```{r 05-upsample-plot, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap = "Upsampled data with background color indicating the decision of the classifier."}
1258+
```{r 05-upsample-plot, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.pos = "H", out.extra="", fig.cap = "Upsampled data with background color indicating the decision of the classifier."}
12591259
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 7) |>
12601260
set_engine("kknn") |>
12611261
set_mode("classification")

classification2.Rmd

Lines changed: 20 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -643,7 +643,7 @@ Here, $C=5$ different chunks of the data set are used,
643643
resulting in 5 different choices for the **validation set**; we call this
644644
*5-fold* cross-validation.
645645

646-
```{r 06-cv-image, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "5-fold cross-validation.", fig.retina = 2, out.width = "100%"}
646+
```{r 06-cv-image, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "5-fold cross-validation.", fig.pos = "H", out.extra="", fig.retina = 2, out.width = "100%"}
647647
knitr::include_graphics("img/cv.png")
648648
```
649649

@@ -863,24 +863,7 @@ regardless of what the new observation looks like. In general, if the model
863863
*isn't influenced enough* by the training data, it is said to **underfit** the
864864
data.
865865

866-
**Overfitting:** \index{overfitting!classification} In contrast, when we decrease the number of neighbors, each
867-
individual data point has a stronger and stronger vote regarding nearby points.
868-
Since the data themselves are noisy, this causes a more "jagged" boundary
869-
corresponding to a *less simple* model. If you take this case to the extreme,
870-
setting $K = 1$, then the classifier is essentially just matching each new
871-
observation to its closest neighbor in the training data set. This is just as
872-
problematic as the large $K$ case, because the classifier becomes unreliable on
873-
new data: if we had a different training set, the predictions would be
874-
completely different. In general, if the model *is influenced too much* by the
875-
training data, it is said to **overfit** the data.
876-
877-
Both overfitting and underfitting are problematic and will lead to a model
878-
that does not generalize well to new data. When fitting a model, we need to strike
879-
a balance between the two. You can see these two effects in Figure
880-
\@ref(fig:06-decision-grid-K), which shows how the classifier changes as
881-
we set the number of neighbors $K$ to 1, 7, 20, and 300.
882-
883-
```{r 06-decision-grid-K, echo = FALSE, message = FALSE, fig.height = 10, fig.width = 10, fig.cap = "Effect of K in overfitting and underfitting."}
866+
```{r 06-decision-grid-K, echo = FALSE, message = FALSE, fig.height = 10, fig.width = 10, fig.pos = "H", out.extra="", fig.cap = "Effect of K in overfitting and underfitting."}
884867
ks <- c(1, 7, 20, 300)
885868
plots <- list()
886869
@@ -935,6 +918,23 @@ p_grid <- plot_grid(plotlist = p_no_legend, ncol = 2)
935918
plot_grid(p_grid, legend, ncol = 1, rel_heights = c(1, 0.2))
936919
```
937920

921+
**Overfitting:** \index{overfitting!classification} In contrast, when we decrease the number of neighbors, each
922+
individual data point has a stronger and stronger vote regarding nearby points.
923+
Since the data themselves are noisy, this causes a more "jagged" boundary
924+
corresponding to a *less simple* model. If you take this case to the extreme,
925+
setting $K = 1$, then the classifier is essentially just matching each new
926+
observation to its closest neighbor in the training data set. This is just as
927+
problematic as the large $K$ case, because the classifier becomes unreliable on
928+
new data: if we had a different training set, the predictions would be
929+
completely different. In general, if the model *is influenced too much* by the
930+
training data, it is said to **overfit** the data.
931+
932+
Both overfitting and underfitting are problematic and will lead to a model
933+
that does not generalize well to new data. When fitting a model, we need to strike
934+
a balance between the two. You can see these two effects in Figure
935+
\@ref(fig:06-decision-grid-K), which shows how the classifier changes as
936+
we set the number of neighbors $K$ to 1, 7, 20, and 300.
937+
938938
## Summary
939939

940940
Classification algorithms use one or more quantitative variables to predict the
@@ -948,7 +948,7 @@ can tune the classifier (e.g., select the number of neighbors $K$ in $K$-NN)
948948
by maximizing estimated accuracy via cross-validation. The overall
949949
process is summarized in Figure \@ref(fig:06-overview).
950950

951-
```{r 06-overview, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "Overview of KNN classification.", fig.retina = 2, out.width = "100%"}
951+
```{r 06-overview, echo = FALSE, message = FALSE, warning = FALSE, fig.pos = "H", out.extra="", fig.cap = "Overview of KNN classification.", fig.retina = 2, out.width = "100%"}
952952
knitr::include_graphics("img/train-test-overview.jpeg")
953953
```
954954

clustering.Rmd

Lines changed: 18 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -91,6 +91,8 @@ principal component analysis, multidimensional scaling, and more;
9191
see the additional resources section at the end of this chapter
9292
for where to begin learning more about these other methods.
9393

94+
\newpage
95+
9496
> **Note:** There are also so-called *semisupervised* tasks, \index{semisupervised}
9597
> where only some of the data come with response variable labels/values,
9698
> but the vast majority don't.
@@ -164,11 +166,12 @@ penguin_data <- read_csv("data/penguins_standardized.csv")
164166
penguin_data
165167
```
166168

167-
168169
Next, we can create a scatter plot using this data set
169170
to see if we can detect subtypes or groups in our data set.
170171

171-
```{r 10-toy-example-plot, warning = FALSE, fig.height = 3.5, fig.width = 3.75, fig.align = "center", fig.cap = "Scatter plot of standardized bill length versus standardized flipper length."}
172+
\newpage
173+
174+
```{r 10-toy-example-plot, warning = FALSE, fig.height = 3.25, fig.width = 3.5, fig.align = "center", fig.pos = "H", out.extra="", fig.cap = "Scatter plot of standardized bill length versus standardized flipper length."}
172175
ggplot(data, aes(x = flipper_length_standardized,
173176
y = bill_length_standardized)) +
174177
geom_point() +
@@ -203,7 +206,7 @@ This procedure will separate the data into groups;
203206
Figure \@ref(fig:10-toy-example-clustering) shows these groups
204207
denoted by colored scatter points.
205208

206-
```{r 10-toy-example-clustering, echo = FALSE, warning = FALSE, fig.height = 3.5, fig.width = 4.5, fig.align = "center", fig.cap = "Scatter plot of standardized bill length versus standardized flipper length with colored groups."}
209+
```{r 10-toy-example-clustering, echo = FALSE, warning = FALSE, fig.height = 3.25, fig.width = 4.25, fig.align = "center", fig.cap = "Scatter plot of standardized bill length versus standardized flipper length with colored groups."}
207210
ggplot(data, aes(y = bill_length_standardized,
208211
x = flipper_length_standardized, color = cluster)) +
209212
geom_point() +
@@ -261,7 +264,7 @@ in Figure \@ref(fig:10-toy-example-clus1-center).
261264

262265
(ref:10-toy-example-clus1-center) Cluster 1 from the `penguin_data` data set example. Observations are in blue, with the cluster center highlighted in red.
263266

264-
```{r 10-toy-example-clus1-center, echo = FALSE, warning = FALSE, fig.height = 3.5, fig.width = 3.75, fig.align = "center", fig.cap = "(ref:10-toy-example-clus1-center)"}
267+
```{r 10-toy-example-clus1-center, echo = FALSE, warning = FALSE, fig.height = 3.25, fig.width = 3.5, fig.align = "center", fig.cap = "(ref:10-toy-example-clus1-center)"}
265268
base <- ggplot(data, aes(x = flipper_length_standardized, y = bill_length_standardized)) +
266269
geom_point() +
267270
xlab("Flipper Length (standardized)") +
@@ -308,7 +311,7 @@ These distances are denoted by lines in Figure \@ref(fig:10-toy-example-clus1-di
308311

309312
(ref:10-toy-example-clus1-dists) Cluster 1 from the `penguin_data` data set example. Observations are in blue, with the cluster center highlighted in red. The distances from the observations to the cluster center are represented as black lines.
310313

311-
```{r 10-toy-example-clus1-dists, echo = FALSE, warning = FALSE, fig.height = 3.5, fig.width = 3.75, fig.align = "center", fig.cap = "(ref:10-toy-example-clus1-dists)"}
314+
```{r 10-toy-example-clus1-dists, echo = FALSE, warning = FALSE, fig.height = 3.25, fig.width = 3.5, fig.align = "center", fig.cap = "(ref:10-toy-example-clus1-dists)"}
312315
base <- ggplot(clus1) +
313316
geom_point(aes(y = bill_length_standardized,
314317
x = flipper_length_standardized),
@@ -347,7 +350,7 @@ Figure \@ref(fig:10-toy-example-all-clus-dists).
347350

348351
(ref:10-toy-example-all-clus-dists) All clusters from the `penguin_data` data set example. Observations are in orange, blue, and yellow with the cluster center highlighted in red. The distances from the observations to each of the respective cluster centers are represented as black lines.
349352

350-
```{r 10-toy-example-all-clus-dists, echo = FALSE, warning = FALSE, fig.height = 3.5, fig.width = 4.5, fig.align = "center", fig.cap = "(ref:10-toy-example-all-clus-dists)"}
353+
```{r 10-toy-example-all-clus-dists, echo = FALSE, warning = FALSE, fig.height = 3.25, fig.width = 4.25, fig.align = "center", fig.cap = "(ref:10-toy-example-all-clus-dists)"}
351354
352355
353356
all_clusters_base <- data |>
@@ -406,6 +409,8 @@ all_clusters_base <- all_clusters_base +
406409
all_clusters_base
407410
```
408411

412+
\newpage
413+
409414
### The clustering algorithm
410415

411416
We begin the K-means \index{K-means!algorithm} algorithm by picking K,
@@ -597,7 +602,7 @@ These, however, are beyond the scope of this book.
597602
Unlike the classification and regression models we studied in previous chapters, K-means \index{K-means!restart, nstart} can get "stuck" in a bad solution.
598603
For example, Figure \@ref(fig:10-toy-kmeans-bad-init) illustrates an unlucky random initialization by K-means.
599604

600-
```{r 10-toy-kmeans-bad-init, echo = FALSE, warning = FALSE, message = FALSE, fig.height = 3.5, fig.width = 3.75, fig.align = "center", fig.cap = "Random initialization of labels."}
605+
```{r 10-toy-kmeans-bad-init, echo = FALSE, warning = FALSE, message = FALSE, fig.height = 3.25, fig.width = 3.75, fig.pos = "H", out.extra="", fig.align = "center", fig.cap = "Random initialization of labels."}
601606
penguin_data <- penguin_data |>
602607
mutate(label = as_factor(c(3L, 3L, 1L, 1L, 2L, 1L, 2L, 1L, 1L,
603608
1L, 3L, 1L, 2L, 2L, 2L, 3L, 3L, 3L)))
@@ -618,7 +623,7 @@ Figure \@ref(fig:10-toy-kmeans-bad-iter) shows what the iterations of K-means wo
618623

619624
(ref:10-toy-kmeans-bad-iter) First five iterations of K-means clustering on the `penguin_data` example data set with a poor random initialization. Each pair of plots corresponds to an iteration. Within the pair, the first plot depicts the center update, and the second plot depicts the reassignment of data to clusters. Cluster centers are indicated by larger points that are outlined in black.
620625

621-
```{r 10-toy-kmeans-bad-iter, echo = FALSE, warning = FALSE, message = FALSE, fig.height = 6.75, fig.width = 8, fig.align = "center", fig.cap = "(ref:10-toy-kmeans-bad-iter)"}
626+
```{r 10-toy-kmeans-bad-iter, echo = FALSE, warning = FALSE, message = FALSE, fig.height = 6.75, fig.width = 8, fig.pos = "H", out.extra="", fig.align = "center", fig.cap = "(ref:10-toy-kmeans-bad-iter)"}
622627
list_plot_cntrs <- vector(mode = "list", length = 5)
623628
list_plot_lbls <- vector(mode = "list", length = 5)
624629
@@ -776,7 +781,7 @@ Figure \@ref(fig:10-toy-kmeans-vary-k) illustrates the impact of K
776781
on K-means clustering of our penguin flipper and bill length data
777782
by showing the different clusterings for K's ranging from 1 to 9.
778783

779-
```{r 10-toy-kmeans-vary-k, echo = FALSE, warning = FALSE, fig.height = 6.25, fig.width = 6, fig.cap = "Clustering of the penguin data for K clusters ranging from 1 to 9. Cluster centers are indicated by larger points that are outlined in black."}
784+
```{r 10-toy-kmeans-vary-k, echo = FALSE, warning = FALSE, fig.height = 6.25, fig.width = 6, fig.pos = "H", out.extra="", fig.cap = "Clustering of the penguin data for K clusters ranging from 1 to 9. Cluster centers are indicated by larger points that are outlined in black."}
780785
set.seed(3)
781786
782787
kclusts <- tibble(k = 1:9) |>
@@ -840,7 +845,7 @@ decrease the total WSSD, but by only a *diminishing amount*. If we plot the tota
840845
clusters, we see that the decrease in total WSSD levels off (or forms an "elbow shape") \index{elbow method} when we reach roughly
841846
the right number of clusters (Figure \@ref(fig:10-toy-kmeans-elbow)).
842847

843-
```{r 10-toy-kmeans-elbow, echo = FALSE, warning = FALSE, fig.align = 'center', fig.height = 3.5, fig.width = 4.5, fig.cap = "Total WSSD for K clusters ranging from 1 to 9."}
848+
```{r 10-toy-kmeans-elbow, echo = FALSE, warning = FALSE, fig.align = 'center', fig.height = 3.25, fig.width = 4.25, fig.pos = "H", out.extra="", fig.cap = "Total WSSD for K clusters ranging from 1 to 9."}
844849
p2 <- ggplot(clusterings, aes(x = k, y = tot.withinss)) +
845850
geom_point(size = 2) +
846851
geom_line() +
@@ -931,7 +936,7 @@ clustered_data
931936
Now that we have this information in a tidy data frame, we can make a visualization
932937
of the cluster assignments for each point, as shown in Figure \@ref(fig:10-plot-clusters-2).
933938

934-
```{r 10-plot-clusters-2, fig.height = 3.5, fig.width = 4.5, fig.align = "center", fig.cap = "The data colored by the cluster assignments returned by K-means."}
939+
```{r 10-plot-clusters-2, fig.height = 3.25, fig.width = 4.25, fig.align = "center", fig.pos = "H", out.extra="", fig.cap = "The data colored by the cluster assignments returned by K-means."}
935940
cluster_plot <- ggplot(clustered_data,
936941
aes(x = flipper_length_mm,
937942
y = bill_length_mm,
@@ -1040,7 +1045,7 @@ clustering_statistics
10401045
Now that we have `tot.withinss` and `k` as columns in a data frame, we can make a line plot
10411046
(Figure \@ref(fig:10-plot-choose-k)) and search for the "elbow" to find which value of K to use.
10421047

1043-
```{r 10-plot-choose-k, fig.height = 3.5, fig.width = 4.5, fig.align = "center", fig.cap = "A plot showing the total WSSD versus the number of clusters."}
1048+
```{r 10-plot-choose-k, fig.height = 3.25, fig.width = 4.25, fig.align = "center", fig.pos = "H", out.extra="", fig.cap = "A plot showing the total WSSD versus the number of clusters."}
10441049
elbow_plot <- ggplot(clustering_statistics, aes(x = k, y = tot.withinss)) +
10451050
geom_point() +
10461051
geom_line() +
@@ -1075,7 +1080,7 @@ but there is a trade-off that doing many clusterings
10751080
could take a long time.
10761081
So this is something that needs to be balanced.
10771082

1078-
```{r 10-choose-k-nstart, fig.height = 3.5, fig.width = 4.5, message= FALSE, warning = FALSE, fig.align = "center", fig.cap = "A plot showing the total WSSD versus the number of clusters when K-means is run with 10 restarts."}
1083+
```{r 10-choose-k-nstart, fig.height = 3.25, fig.width = 4.25, fig.pos = "H", out.extra="", message= FALSE, warning = FALSE, fig.align = "center", fig.cap = "A plot showing the total WSSD versus the number of clusters when K-means is run with 10 restarts."}
10791084
penguin_clust_ks <- tibble(k = 1:9) |>
10801085
rowwise() |>
10811086
mutate(penguin_clusts = list(kmeans(standardized_data, nstart = 10, k)),

0 commit comments

Comments
 (0)