@@ -194,8 +194,9 @@ In this chapter, we will focus on the *K-means* algorithm,
194
194
\index{K-means} a widely-used and often very effective clustering method,
195
195
combined with the * elbow method* \index{elbow method}
196
196
for selecting the number of clusters.
197
- This procedure will separate the data into the following groups
198
- denoted by color:
197
+ This procedure will separate the data into groups;
198
+ Figure \@ ref(fig:10-toy-example-clustering) shows these groups
199
+ denoted by colored scatter points.
199
200
200
201
``` {r 10-toy-example-clustering, echo = FALSE, warning = FALSE, fig.height = 4, fig.width = 5, fig.cap = "Scatter plot of standardized bill length versus standardized flipper length with colored groups."}
201
202
ggplot(data, aes(y = bill_length_standardized,
@@ -399,15 +400,7 @@ all_clusters_base
399
400
We begin the K-means \index{K-means!algorithm} algorithm by picking K,
400
401
and randomly assigning a roughly equal number of observations
401
402
to each of the K clusters.
402
- Then K-means consists of two major steps that attempt to minimize the
403
- sum of WSSDs over all the clusters, i.e., the \index{WSSD!total} *total WSSD*:
404
-
405
- 1. **Center update:** Compute the center of each cluster.
406
- 2. **Label update:** Reassign each data point to the cluster with the nearest center.
407
-
408
- These two steps are repeated until the cluster assignments no longer change.
409
- For example, in the penguin data example,
410
- our random initialization might look like this:
403
+ An example random initialization is shown in Figure \@ref(fig:10-toy-kmeans-init).
411
404
412
405
```{r 10-toy-kmeans-init, echo = FALSE, message = FALSE, warning = FALSE, fig.height = 4, fig.width = 4.35, fig.cap = "Random initialization of labels."}
413
406
set.seed(14)
@@ -425,7 +418,14 @@ plt_lbl <- ggplot(penguin_data, aes(y = bill_length_standardized,
425
418
plt_lbl
426
419
```
427
420
428
- And we show what the first four iterations of K-means would look like in
421
+ Then K-means consists of two major steps that attempt to minimize the
422
+ sum of WSSDs over all the clusters, i.e., the \index{WSSD!total} * total WSSD* :
423
+
424
+ 1 . ** Center update:** Compute the center of each cluster.
425
+ 2 . ** Label update:** Reassign each data point to the cluster with the nearest center.
426
+
427
+ These two steps are repeated until the cluster assignments no longer change.
428
+ We show what the first four iterations of K-means would look like in
429
429
Figure \@ ref(fig:10-toy-kmeans-iter).
430
430
There each row corresponds to an iteration,
431
431
where the left column depicts the center update,
@@ -775,9 +775,9 @@ clustered_data
775
775
```
776
776
777
777
Now that we have this information in a tidy data frame, we can make a visualization
778
- of the cluster assignments for each point:
778
+ of the cluster assignments for each point, as shown in Figure \@ ref(fig:10-plot-clusters-2).
779
779
780
- ``` {r 10-plot-clusters-2, fig.height = 4, fig.width = 4.35}
780
+ ``` {r 10-plot-clusters-2, fig.height = 4, fig.width = 4.35, fig.align = "center" }
781
781
cluster_plot <- ggplot(clustered_data,
782
782
aes(x = flipper_length_mm,
783
783
y = bill_length_mm,
@@ -843,8 +843,9 @@ we could use a familiar friend: `pull`.
843
843
` pull ` will return to us a data frame column as a simpler data structure,
844
844
here that would be a list.
845
845
And then to extract the first item of the list,
846
- we can use the ` pluck ` function;
847
- passing it the index for the element we would like to extract (here 1).
846
+ we can use the ` pluck ` function. We pass
847
+ it the index for the element we would like to extract
848
+ (here, ` 1 ` ).
848
849
849
850
``` {r}
850
851
penguin_clust_ks |>
@@ -882,10 +883,9 @@ clustering_statistics
882
883
```
883
884
884
885
Now that we have ` tot.withinss ` and ` k ` as columns in a data frame, we can make a line plot
885
- and search for the "elbow" to find which value of K to use.
886
-
886
+ (Figure \@ ref(fig:10-plot-choose-k)) and search for the "elbow" to find which value of K to use.
887
887
888
- ``` {r 10-plot-choose-k, fig.height = 4, fig.width = 4.35}
888
+ ``` {r 10-plot-choose-k, fig.height = 4, fig.width = 4.35, fig.align = "center" }
889
889
elbow_plot <- ggplot(clustering_statistics, aes(x = k, y = tot.withinss)) +
890
890
geom_point() +
891
891
geom_line() +
@@ -904,15 +904,16 @@ Unfortunately, for K = 8 we had an unlucky initialization
904
904
and found a bad clustering! \index{K-means!restart,nstart}
905
905
We can help prevent finding a bad clustering
906
906
by trying a few different random initializations
907
- via the ` nstart ` argument (here we use 10 restarts).
907
+ via the ` nstart ` argument (Figure \@ ref(fig:10-choose-k-nstart)
908
+ shows a setup where we use 10 restarts).
908
909
When we do this, K-means clustering will be performed
909
910
the number of times specified by the ` nstart ` argument,
910
911
and R will return to us the best clustering from this.
911
912
The more times we perform K-means clustering,
912
913
the more likely we are to find a good clustering (if one exists).
913
- What value should you choose for ` nstart ` ? The answer is it depends.
914
- It depends on the size of your data set,
915
- and the speed and size of your computer.
914
+ What value should you choose for ` nstart ` ? The answer is that it depends
915
+ on many factors: the size and characteristics of your data set,
916
+ as well as the speed and size of your computer.
916
917
The larger the ` nstart ` value the better from an analysis perspective,
917
918
but there is a trade-off that doing many clusterings
918
919
could take a long time.
0 commit comments