You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Based \index{ggplot}\index{ggplot!geom\_point} on the visualization
@@ -184,7 +187,7 @@ including:
184
187
2. a small flipper length, but large bill length group, and
185
188
3. a large flipper and bill length group.
186
189
187
-
Data visualization is a great tool to give us a rough sense for such patterns
190
+
Data visualization is a great tool to give us a rough sense of such patterns
188
191
when we have a small number of variables.
189
192
But if we are to group data—and select the number of groups—as part of
190
193
a reproducible analysis, we need something a bit more automated.
@@ -193,7 +196,7 @@ as we increase the number of variables we consider when clustering.
193
196
The way to rigorously separate the data into groups
194
197
is to use a clustering algorithm.
195
198
In this chapter, we will focus on the *K-means* algorithm,
196
-
\index{K-means} a widely-used and often very effective clustering method,
199
+
\index{K-means} a widelyused and often very effective clustering method,
197
200
combined with the *elbow method* \index{elbow method}
198
201
for selecting the number of clusters.
199
202
This procedure will separate the data into groups;
@@ -332,7 +335,7 @@ base <- base +
332
335
base
333
336
```
334
337
335
-
The larger the value of $S^2$, the more spread-out the cluster is, since large $S^2$ means that points are far from the cluster center.
338
+
The larger the value of $S^2$, the more spreadout the cluster is, since large $S^2$ means that points are far from the cluster center.
336
339
Note, however, that "large" is relative to *both* the scale of the variables for clustering *and* the number of points in the cluster. A cluster where points are very close to the center might still have a large $S^2$ if there are many data points in the cluster.
337
340
338
341
After we have calculated the WSSD for all the clusters,
@@ -464,13 +467,13 @@ for (i in 1:4) {
464
467
aes(y = bill_length_standardized,
465
468
x = flipper_length_standardized,
466
469
fill = label),
467
-
size = 3,
470
+
size = 4,
468
471
shape = 21,
469
472
stroke = 1,
470
473
color = "black",
471
474
fill = cbpalette) +
472
475
annotate("text", x = -0.5, y = 1.5, label = paste0("Iteration ", i), size = 5)+
@@ -591,7 +594,7 @@ These, however, are beyond the scope of this book.
591
594
592
595
### Random restarts
593
596
594
-
Unlike the classification and regression models we studied in previous chapters, K-means \index{K-means!restart,nstart} can get "stuck" in a bad solution.
597
+
Unlike the classification and regression models we studied in previous chapters, K-means \index{K-means!restart,nstart} can get "stuck" in a bad solution.
595
598
For example, Figure \@ref(fig:10-toy-kmeans-bad-init) illustrates an unlucky random initialization by K-means.
Practice exercises for the material covered in this chapter
1093
-
can be found in the accompanying [worksheet](https://github.com/UBC-DSCI/data-science-a-first-intro-worksheets/blob/main/worksheet_10/worksheet_10.ipynb).
1101
+
can be found in the accompanying [worksheet](https://github.com/UBC-DSCI/data-science-a-first-intro-worksheets/blob/main/worksheet_clustering/worksheet_clustering.ipynb).
1094
1102
The worksheet tries to provide automated feedback
1095
1103
and help guide you through the problems.
1096
1104
To make sure this functionality works as intended,
1097
1105
please follow the instructions for computer setup needed to run the worksheets
1098
1106
found in Chapter \@ref(move-to-your-own-machine).
1099
1107
1100
1108
## Additional resources
1101
-
- Chapter 10 of [An Introduction to Statistical Learning](https://www.statlearning.com/)[-@james2013introduction] provides a great next stop in the process of learning about clustering and unsupervised learning in general. In the realm of clustering specifically, it provides a great companion introduction to K-means, but also covers *hierarchical* clustering for when you expect there to be subgroups, and then subgroups within subgroups, etc. in your data. In the realm of more general unsupervised learning, it covers *principal components analysis (PCA)*, which is a very popular technique in scientific applications for reducing the number of predictors in a dataset.
1109
+
- Chapter 10 of *An Introduction to Statistical
1110
+
Learning*[@james2013introduction] provides a
1111
+
great next stop in the process of learning about clustering and unsupervised
1112
+
learning in general. In the realm of clustering specifically, it provides a
1113
+
great companion introduction to K-means, but also covers *hierarchical*
1114
+
clustering for when you expect there to be subgroups, and then subgroups within
1115
+
subgroups, etc., in your data. In the realm of more general unsupervised
1116
+
learning, it covers *principal components analysis (PCA)*, which is a very
1117
+
popular technique for reducing the number of predictors in a dataset.
0 commit comments