You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Unlike the classification and regression models we studied in previous chapters, K-means can get "stuck" in a bad solution.
647
-
For example, {numref}`toy-kmeans-bad-init-1` illustrates an unlucky random initialization by K-means.
648
-
649
-
650
649
:::{glue:figure} toy-kmeans-bad-init-1
651
650
:figwidth: 700px
652
651
:name: toy-kmeans-bad-init-1
@@ -674,6 +673,19 @@ This looks like a relatively bad clustering of the data, but K-means cannot impr
674
673
To solve this problem when clustering data using K-means, we should randomly re-initialize the labels a few times, run K-means for each initialization,
675
674
and pick the clustering that has the lowest final total WSSD.
676
675
676
+
### Choosing K
677
+
678
+
In order to cluster data using K-means,
679
+
we also have to pick the number of clusters, K.
680
+
But unlike in classification, we have no response variable
681
+
and cannot perform cross-validation with some measure of model prediction error.
682
+
Further, if K is chosen too small, then multiple clusters get grouped together;
683
+
if K is too large, then clusters get subdivided.
684
+
In both cases, we will potentially miss interesting structure in the data.
685
+
{numref}`toy-kmeans-vary-k-1` illustrates the impact of K
686
+
on K-means clustering of our penguin flipper and bill length data
687
+
by showing the different clusterings for K's ranging from 1 to 9.
688
+
677
689
```{code-cell} ipython3
678
690
:tags: [remove-cell]
679
691
@@ -722,18 +734,7 @@ vary_k = alt.layer(
722
734
glue('toy-kmeans-vary-k-1', vary_k, display=True)
723
735
```
724
736
725
-
### Choosing K
726
737
727
-
In order to cluster data using K-means,
728
-
we also have to pick the number of clusters, K.
729
-
But unlike in classification, we have no response variable
730
-
and cannot perform cross-validation with some measure of model prediction error.
731
-
Further, if K is chosen too small, then multiple clusters get grouped together;
732
-
if K is too large, then clusters get subdivided.
733
-
In both cases, we will potentially miss interesting structure in the data.
734
-
{numref}`toy-kmeans-vary-k-1` illustrates the impact of K
735
-
on K-means clustering of our penguin flipper and bill length data
736
-
by showing the different clusterings for K's ranging from 1 to 9.
737
738
738
739
:::{glue:figure} toy-kmeans-vary-k-1
739
740
:figwidth: 700px
@@ -742,6 +743,17 @@ by showing the different clusterings for K's ranging from 1 to 9.
742
743
Clustering of the penguin data for K clusters ranging from 1 to 9. Cluster centers are indicated by larger points that are outlined in black.
743
744
:::
744
745
746
+
747
+
```{index} elbow method
748
+
```
749
+
750
+
If we set K less than 3, then the clustering merges separate groups of data; this causes a large
751
+
total WSSD, since the cluster center (denoted by large shapes with black outlines) is not close to any of the data in the cluster. On
752
+
the other hand, if we set K greater than 3, the clustering subdivides subgroups of data; this does indeed still
753
+
decrease the total WSSD, but by only a *diminishing amount*. If we plot the total WSSD versus the number of
754
+
clusters, we see that the decrease in total WSSD levels off (or forms an "elbow shape") when we reach roughly
755
+
the right number of clusters ({numref}`toy-kmeans-elbow`)).
0 commit comments