Skip to content

Commit b1c6e2b

Browse files
Merge pull request #410 from UBC-DSCI/dev
dev to master
2 parents b30d8ef + 06f1e8c commit b1c6e2b

File tree

152 files changed

+2516
-10385
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

152 files changed

+2516
-10385
lines changed

acknowledgements.Rmd

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,12 +7,12 @@ for DSCI 100, a new introductory data science course
77
at the University of British Columbia (UBC).
88
Several faculty members in the UBC Department of Statistics
99
were pivotal in shaping the direction of that course,
10-
and as such contributed greatly to the broad structure and
10+
and as such, contributed greatly to the broad structure and
1111
list of topics in this book. We would especially like to thank Matías
1212
Salibían-Barrera for his mentorship during the initial development and roll-out
1313
of both DSCI 100 and this book. His door was always open when
1414
we needed to chat about how to
15-
best introduce and teach data science our first year students.
15+
best introduce and teach data science to our first-year students.
1616

1717
We also owe a debt of gratitude to all of the students of DSCI 100 over the past
1818
few years. They provided invaluable feedback on the book and worksheets;

authors.Rmd

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# About the authors {-}
22

3-
Tiffany Timbers is an Assistant Professor of Teaching in the Department of
3+
**Tiffany Timbers** is an Assistant Professor of Teaching in the Department of
44
Statistics and Co-Director for the Master of Data Science program (Vancouver
55
Option) at the University of British Columbia. In these roles she teaches and
66
develops curriculum around the responsible application of Data Science to solve
@@ -9,7 +9,7 @@ course on collaborative software development, which focuses on teaching how to
99
create R and Python packages using modern tools and workflows.
1010

1111

12-
Trevor Campbell is an Assistant Professor in the Department of Statistics at
12+
**Trevor Campbell** is an Assistant Professor in the Department of Statistics at
1313
the University of British Columbia. His research focuses on automated, scalable
1414
Bayesian inference algorithms, Bayesian nonparametrics, streaming data, and
1515
Bayesian theory. He was previously a postdoctoral associate advised by Tamara
@@ -20,7 +20,7 @@ Systems (LIDS) at MIT, and before that he was in the Engineering Science
2020
program at the University of Toronto.
2121

2222

23-
Melissa Lee is an Assistant Professor of Teaching in the Department of
23+
**Melissa Lee** is an Assistant Professor of Teaching in the Department of
2424
Statistics at the University of British Columbia. She teaches and develops
2525
curriculum for undergraduate statistics and data science courses. Her work
2626
focuses on student-centered approaches to teaching, developing and assessing

classification1.Rmd

Lines changed: 53 additions & 45 deletions
Large diffs are not rendered by default.

classification2.Rmd

Lines changed: 95 additions & 69 deletions
Large diffs are not rendered by default.

clustering.Rmd

Lines changed: 52 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,8 @@ knitr::opts_chunk$set(warning = FALSE, fig.align = "default")
1616
# some graphs with the code shown to students are hard coded
1717
cbbPalette <- c(brewer.pal(9, "Paired"))
1818
cbpalette <- c("darkorange3", "dodgerblue3", "goldenrod1")
19+
20+
theme_update(axis.title = element_text(size = 12)) # modify axis label size in plots
1921
```
2022

2123
## Overview
@@ -28,7 +30,7 @@ using the K-means algorithm,
2830
including techniques to choose the number of clusters.
2931

3032
## Chapter learning objectives
31-
By the end of the chapter, readers will be able to:
33+
By the end of the chapter, readers will be able to do the following:
3234

3335
* Describe a case where clustering is appropriate,
3436
and what insight it might extract from the data.
@@ -83,7 +85,7 @@ courses.
8385
As in the case of classification,
8486
there are many possible methods that we could use to cluster our observations
8587
to look for subgroups.
86-
In this book, we will focus on the widely used K-means \index{K-means} algorithm.
88+
In this book, we will focus on the widely used K-means \index{K-means} algorithm [@kmeans].
8789
In your future studies, you might encounter hierarchical clustering,
8890
principal component analysis, multidimensional scaling, and more;
8991
see the additional resources section at the end of this chapter
@@ -101,13 +103,13 @@ for where to begin learning more about these other methods.
101103
102104
**An illustrative example**
103105

104-
Here we will present an illustrative example using a data set \index{Palmer penguins} from the
105-
[{palmerpenguins} R data package](https://allisonhorst.github.io/palmerpenguins/). This data set was
106-
collected by [Dr. Kristen Gorman](https://www.uaf.edu/cfos/people/faculty/detail/kristen-gorman.php) and
107-
the [Palmer Station, Antarctica Long Term Ecological Research Site](https://pal.lternet.edu/) and includes
108-
measurements for adult penguins found near there [@palmerpenguins]. We have
106+
Here we will present an illustrative example using a data set \index{Palmer penguins} from
107+
[the `palmerpenguins` R package](https://allisonhorst.github.io/palmerpenguins/) [@palmerpenguins]. This
108+
data set was collected by Dr. Kristen Gorman and
109+
the Palmer Station, Antarctica Long Term Ecological Research Site, and includes
110+
measurements for adult penguins found near there [@penguinpaper]. We have
109111
modified the data set for use in this chapter. Here we will focus on using two
110-
variables---penguin bill and flipper length, both in millimeters---to determine whether
112+
variables&mdash;penguin bill and flipper length, both in millimeters&mdash;to determine whether
111113
there are distinct types of penguins in our data.
112114
Understanding this might help us with species discovery and classification in a data-driven
113115
way.
@@ -171,7 +173,8 @@ ggplot(data, aes(x = flipper_length_standardized,
171173
y = bill_length_standardized)) +
172174
geom_point() +
173175
xlab("Flipper Length (standardized)") +
174-
ylab("Bill Length (standardized)")
176+
ylab("Bill Length (standardized)") +
177+
theme(text = element_text(size = 12))
175178
```
176179

177180
Based \index{ggplot}\index{ggplot!geom\_point} on the visualization
@@ -184,7 +187,7 @@ including:
184187
2. a small flipper length, but large bill length group, and
185188
3. a large flipper and bill length group.
186189

187-
Data visualization is a great tool to give us a rough sense for such patterns
190+
Data visualization is a great tool to give us a rough sense of such patterns
188191
when we have a small number of variables.
189192
But if we are to group data&mdash;and select the number of groups&mdash;as part of
190193
a reproducible analysis, we need something a bit more automated.
@@ -193,7 +196,7 @@ as we increase the number of variables we consider when clustering.
193196
The way to rigorously separate the data into groups
194197
is to use a clustering algorithm.
195198
In this chapter, we will focus on the *K-means* algorithm,
196-
\index{K-means} a widely-used and often very effective clustering method,
199+
\index{K-means} a widely used and often very effective clustering method,
197200
combined with the *elbow method* \index{elbow method}
198201
for selecting the number of clusters.
199202
This procedure will separate the data into groups;
@@ -332,7 +335,7 @@ base <- base +
332335
base
333336
```
334337

335-
The larger the value of $S^2$, the more spread-out the cluster is, since large $S^2$ means that points are far from the cluster center.
338+
The larger the value of $S^2$, the more spread out the cluster is, since large $S^2$ means that points are far from the cluster center.
336339
Note, however, that "large" is relative to *both* the scale of the variables for clustering *and* the number of points in the cluster. A cluster where points are very close to the center might still have a large $S^2$ if there are many data points in the cluster.
337340

338341
After we have calculated the WSSD for all the clusters,
@@ -464,13 +467,13 @@ for (i in 1:4) {
464467
aes(y = bill_length_standardized,
465468
x = flipper_length_standardized,
466469
fill = label),
467-
size = 3,
470+
size = 4,
468471
shape = 21,
469472
stroke = 1,
470473
color = "black",
471474
fill = cbpalette) +
472475
annotate("text", x = -0.5, y = 1.5, label = paste0("Iteration ", i), size = 5)+
473-
theme(text = element_text(size = 14))
476+
theme(text = element_text(size = 14), axis.title=element_text(size=14))
474477
475478
if (i == 1 | i == 2) {
476479
plt_ctr <- plt_ctr +
@@ -498,13 +501,13 @@ for (i in 1:4) {
498501
geom_point(data = centers,
499502
aes(y = bill_length_standardized,
500503
x = flipper_length_standardized, fill = label),
501-
size = 3,
504+
size = 4,
502505
shape = 21,
503506
stroke = 1,
504507
color = "black",
505508
fill = cbpalette) +
506-
annotate("text", x = -0.5, y = 1.5, label = paste0("Iteration ", i), size = 5)+
507-
theme(text = element_text(size = 14))
509+
annotate("text", x = -0.5, y = 1.5, label = paste0("Iteration ", i), size = 5) +
510+
theme(text = element_text(size = 14), axis.title=element_text(size=14))
508511
509512
if (i == 1 | i ==2) {
510513
plt_lbl <- plt_lbl +
@@ -591,7 +594,7 @@ These, however, are beyond the scope of this book.
591594

592595
### Random restarts
593596

594-
Unlike the classification and regression models we studied in previous chapters, K-means \index{K-means!restart,nstart} can get "stuck" in a bad solution.
597+
Unlike the classification and regression models we studied in previous chapters, K-means \index{K-means!restart, nstart} can get "stuck" in a bad solution.
595598
For example, Figure \@ref(fig:10-toy-kmeans-bad-init) illustrates an unlucky random initialization by K-means.
596599

597600
```{r 10-toy-kmeans-bad-init, echo = FALSE, warning = FALSE, message = FALSE, fig.height = 3.5, fig.width = 3.75, fig.align = "center", fig.cap = "Random initialization of labels."}
@@ -637,13 +640,13 @@ for (i in 1:5) {
637640
geom_point(data = centers, aes(y = bill_length_standardized,
638641
x = flipper_length_standardized,
639642
fill = label),
640-
size = 3,
643+
size = 4,
641644
shape = 21,
642645
stroke = 1,
643646
color = "black",
644647
fill = cbpalette) +
645648
annotate("text", x = -0.5, y = 1.5, label = paste0("Iteration ", i), size = 5) +
646-
theme(text = element_text(size = 14))
649+
theme(text = element_text(size = 14), axis.title=element_text(size=14))
647650
648651
if (i == 1 | i == 2) {
649652
plt_ctr <- plt_ctr +
@@ -670,13 +673,13 @@ for (i in 1:5) {
670673
geom_point(data = centers, aes(y = bill_length_standardized,
671674
x = flipper_length_standardized,
672675
fill = label),
673-
size = 3,
676+
size = 4,
674677
shape = 21,
675678
stroke = 1,
676679
color = "black",
677680
fill = cbpalette) +
678681
annotate("text", x = -0.5, y = 1.5, label = paste0("Iteration ", i), size = 5) +
679-
theme(text = element_text(size = 14))
682+
theme(text = element_text(size = 14), axis.title=element_text(size=14))
680683
681684
if (i == 1 | i == 2) {
682685
plt_lbl <- plt_lbl +
@@ -726,15 +729,15 @@ ggarrange(iter_plot_list[[1]] +
726729
theme(axis.text.x = element_blank(),
727730
axis.ticks.x = element_blank(),
728731
axis.title.x = element_blank(),
729-
plot.margin = margin(r = 2, t = 2, b = 1)),
732+
plot.margin = margin(r = 2, t = 2, b = 2)),
730733
iter_plot_list[[6]] +
731734
theme(axis.text.y = element_blank(),
732735
axis.ticks.y = element_blank(),
733736
axis.title.y = element_blank(),
734737
axis.text.x = element_blank(),
735738
axis.ticks.x = element_blank(),
736739
axis.title.x = element_blank(),
737-
plot.margin = margin(r = 2, l = 2, t = 2, b = 1) ),
740+
plot.margin = margin(r = 2, l = 2, t = 2, b = 2) ),
738741
iter_plot_list[[7]] +
739742
theme(axis.text.y = element_blank(),
740743
axis.ticks.y = element_blank(),
@@ -811,7 +814,7 @@ levels(clusters$k) <- clusters_levels
811814
812815
p1 <- ggplot(assignments, aes(flipper_length_standardized,
813816
bill_length_standardized)) +
814-
geom_point(aes(color = .cluster, size = 1)) +
817+
geom_point(aes(color = .cluster, size = I(2))) +
815818
facet_wrap(~k) + scale_color_manual(values = cbbPalette) +
816819
labs(x = "Flipper Length (standardized)",
817820
y = "Bill Length (standardized)",
@@ -820,10 +823,12 @@ p1 <- ggplot(assignments, aes(flipper_length_standardized,
820823
geom_point(data = clusters,
821824
aes(fill = cluster),
822825
color = "black",
823-
size = 5,
826+
size = 4,
824827
shape = 21,
825828
stroke = 1) +
826-
scale_fill_manual(values = cbbPalette)
829+
scale_fill_manual(values = cbbPalette) +
830+
theme(text = element_text(size = 12), axis.title=element_text(size=12))
831+
827832
828833
p1
829834
```
@@ -859,7 +864,7 @@ each other. Therefore, the *scale* of each of the variables in the data
859864
will influence which cluster data points end up being assigned.
860865
Variables with a large scale will have a much larger
861866
effect on deciding cluster assignment than variables with a small scale.
862-
To address this problem, we typically standardize \index{standardization!K-means}\index{K-means!stanardization} our data before clustering,
867+
To address this problem, we typically standardize \index{standardization!K-means}\index{K-means!standardization} our data before clustering,
863868
which ensures that each variable has a mean of 0 and standard deviation of 1.
864869
The `scale` function in R can be used to do this.
865870
We show an example of how to use this function
@@ -911,7 +916,7 @@ As you can see above, the clustering object returned by `kmeans` has a lot of in
911916
that can be used to visualize the clusters, pick K, and evaluate the total WSSD.
912917
To obtain this information in a tidy format, we will call in help
913918
from the `broom` package. \index{broom} Let's start by visualizing the clustering
914-
as a colored scatter plot. To do that
919+
as a colored scatter plot. To do that,
915920
we use the `augment` function, \index{K-means!augment} \index{augment} which takes in the model and the original data
916921
frame, and returns a data frame with the data and the cluster assignments for
917922
each point:
@@ -938,7 +943,8 @@ cluster_plot <- ggplot(clustered_data,
938943
color = "Cluster") +
939944
scale_color_manual(values = c("dodgerblue3",
940945
"darkorange3",
941-
"goldenrod1"))
946+
"goldenrod1")) +
947+
theme(text = element_text(size = 12))
942948
943949
cluster_plot
944950
```
@@ -965,7 +971,7 @@ Then we use `rowwise` \index{rowwise} + `mutate` to apply the `kmeans` function
965971
within each row to each K.
966972
However, given that the `kmeans` function
967973
returns a model object to us (not a vector),
968-
we will need to store the results as a list columm.
974+
we will need to store the results as a list column.
969975
This works because both vectors and lists are legitimate
970976
data structures for data frame columns.
971977
To make this work,
@@ -1040,7 +1046,8 @@ elbow_plot <- ggplot(clustering_statistics, aes(x = k, y = tot.withinss)) +
10401046
geom_line() +
10411047
xlab("K") +
10421048
ylab("Total within-cluster sum of squares") +
1043-
scale_x_continuous(breaks = 1:9)
1049+
scale_x_continuous(breaks = 1:9) +
1050+
theme(text = element_text(size = 12))
10441051
10451052
elbow_plot
10461053
```
@@ -1050,7 +1057,7 @@ But why is there a "bump" in the total WSSD plot here?
10501057
Shouldn't total WSSD always decrease as we add more clusters?
10511058
Technically yes, but remember: K-means can get "stuck" in a bad solution.
10521059
Unfortunately, for K = 8 we had an unlucky initialization
1053-
and found a bad clustering! \index{K-means!restart,nstart}
1060+
and found a bad clustering! \index{K-means!restart, nstart}
10541061
We can help prevent finding a bad clustering
10551062
by trying a few different random initializations
10561063
via the `nstart` argument (Figure \@ref(fig:10-choose-k-nstart)
@@ -1082,20 +1089,29 @@ elbow_plot <- ggplot(clustering_statistics, aes(x = k, y = tot.withinss)) +
10821089
geom_line() +
10831090
xlab("K") +
10841091
ylab("Total within-cluster sum of squares") +
1085-
scale_x_continuous(breaks = 1:9)
1092+
scale_x_continuous(breaks = 1:9) +
1093+
theme(text = element_text(size = 12))
10861094
10871095
elbow_plot
10881096
```
10891097

10901098
## Exercises
10911099

10921100
Practice exercises for the material covered in this chapter
1093-
can be found in the accompanying [worksheet](https://github.com/UBC-DSCI/data-science-a-first-intro-worksheets/blob/main/worksheet_10/worksheet_10.ipynb).
1101+
can be found in the accompanying [worksheet](https://github.com/UBC-DSCI/data-science-a-first-intro-worksheets/blob/main/worksheet_clustering/worksheet_clustering.ipynb).
10941102
The worksheet tries to provide automated feedback
10951103
and help guide you through the problems.
10961104
To make sure this functionality works as intended,
10971105
please follow the instructions for computer setup needed to run the worksheets
10981106
found in Chapter \@ref(move-to-your-own-machine).
10991107

11001108
## Additional resources
1101-
- Chapter 10 of [An Introduction to Statistical Learning](https://www.statlearning.com/) [-@james2013introduction] provides a great next stop in the process of learning about clustering and unsupervised learning in general. In the realm of clustering specifically, it provides a great companion introduction to K-means, but also covers *hierarchical* clustering for when you expect there to be subgroups, and then subgroups within subgroups, etc. in your data. In the realm of more general unsupervised learning, it covers *principal components analysis (PCA)*, which is a very popular technique in scientific applications for reducing the number of predictors in a dataset.
1109+
- Chapter 10 of *An Introduction to Statistical
1110+
Learning* [@james2013introduction] provides a
1111+
great next stop in the process of learning about clustering and unsupervised
1112+
learning in general. In the realm of clustering specifically, it provides a
1113+
great companion introduction to K-means, but also covers *hierarchical*
1114+
clustering for when you expect there to be subgroups, and then subgroups within
1115+
subgroups, etc., in your data. In the realm of more general unsupervised
1116+
learning, it covers *principal components analysis (PCA)*, which is a very
1117+
popular technique for reducing the number of predictors in a dataset.

0 commit comments

Comments
 (0)