You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -362,7 +372,7 @@ in {numref}`toy-example-clus1-center`
362
372
:figwidth: 700px
363
373
:name: toy-example-clus1-center
364
374
365
-
Cluster 0 from the `penguin_data` data set example. Observations are in blue, with the cluster center highlighted in orange.
375
+
Cluster 0 from the `penguins_standardized` data set example. Observations are in blue, with the cluster center highlighted in orange.
366
376
:::
367
377
368
378
```{code-cell} ipython3
@@ -406,30 +416,30 @@ These distances are denoted by lines in {numref}`toy-example-clus1-dists` for th
406
416
:figwidth: 700px
407
417
:name: toy-example-clus1-dists
408
418
409
-
Cluster 0 from the `penguin_data` data set example. Observations are in blue, with the cluster center highlighted in orange. The distances from the observations to the cluster center are represented as black lines.
419
+
Cluster 0 from the `penguins_standardized` data set example. Observations are in blue, with the cluster center highlighted in orange. The distances from the observations to the cluster center are represented as black lines.
The larger the value of $S^2$, the more spread out the cluster is, since large $S^2$ means that points are far from the cluster center.
446
-
Note, however, that "large" is relative to *both* the scale of the variables for clustering *and* the number of points in the cluster. A cluster where points are very close to the center might still have a large $S^2$ if there are many data points in the cluster.
455
+
The larger the value of $S^2$, the more spread out the cluster is, since large $S^2$ means
456
+
that points are far from the cluster center. Note, however, that "large" is relative to *both* the
457
+
scale of the variables for clustering *and* the number of points in the cluster. A cluster where points
458
+
are very close to the center might still have a large $S^2$ if there are many data points in the cluster.
447
459
448
460
After we have calculated the WSSD for all the clusters,
449
-
we sum them together to get the *total WSSD*.
450
-
For our example,
461
+
we sum them together to get the *total WSSD*. For our example,
451
462
this means adding up all the squared distances for the 18 observations.
452
463
These distances are denoted by black lines in
453
-
{numref}`toy-example-all-clus-dists`
464
+
{numref}`toy-example-all-clus-dists`.
454
465
455
466
:::{glue:figure} toy-example-all-clus-dists
456
467
:figwidth: 700px
457
468
:name: toy-example-all-clus-dists
458
469
459
-
All clusters from the `penguin_data` data set example. Observations are in blue, orange, and red with the cluster center highlighted in orange. The distances from the observations to each of the respective cluster centers are represented as black lines.
470
+
All clusters from the `penguins_standardized` data set example. Observations are in blue, orange, and red with the cluster center highlighted in orange. The distances from the observations to each of the respective cluster centers are represented as black lines.
460
471
:::
461
472
473
+
Since K-means uses the straight-line distance to measure the quality of a clustering,
474
+
it is limited to clustering based on quantitative variables.
475
+
However, note that there are variants of the K-means algorithm,
476
+
as well as other clustering algorithms entirely,
477
+
that use other distance metrics
478
+
to allow for non-quantitative data to be clustered.
479
+
These are beyond the scope of this book.
480
+
462
481
+++
463
482
464
483
### The clustering algorithm
@@ -574,17 +593,15 @@ sum of WSSDs over all the clusters, i.e., the *total WSSD*:
574
593
575
594
These two steps are repeated until the cluster assignments no longer change.
576
595
We show what the first three iterations of K-means would look like in
577
-
{numref}`toy-kmeans-iter-1`
578
-
There each row corresponds to an iteration,
596
+
{numref}`toy-kmeans-iter-1`. Each row corresponds to an iteration,
579
597
where the left column depicts the center update,
580
-
and the right column depicts the reassignment of data to clusters.
581
-
598
+
and the right column depicts the label update (i.e., the reassignment of data to clusters).
582
599
583
600
:::{glue:figure} toy-kmeans-iter-1
584
601
:figwidth: 700px
585
602
:name: toy-kmeans-iter-1
586
603
587
-
First three iterations of K-means clustering on the `penguin_data` example data set. Each pair of plots corresponds to an iteration. Within the pair, the first plot depicts the center update, and the second plot depicts the reassignment of data to clusters. Cluster centers are indicated by larger points that are outlined in black.
604
+
First three iterations of K-means clustering on the `penguins_standardized` example data set. Each pair of plots corresponds to an iteration. Within the pair, the first plot depicts the center update, and the second plot depicts the reassignment of data to clusters. Cluster centers are indicated by larger points that are outlined in black.
588
605
:::
589
606
590
607
+++
@@ -604,17 +621,6 @@ ways to assign the data to clusters. So at some point, the total WSSD must stop
604
621
are changing, and the algorithm terminates.
605
622
```
606
623
607
-
What kind of data is suitable for K-means clustering?
608
-
In the simplest version of K-means clustering that we have presented here,
609
-
the straight-line distance is used to measure the
610
-
distance between observations and cluster centers.
611
-
This means that only quantitative data should be used with this algorithm.
612
-
There are variants on the K-means algorithm,
613
-
as well as other clustering algorithms entirely,
614
-
that use other distance metrics
615
-
to allow for non-quantitative data to be clustered.
616
-
These, however, are beyond the scope of this book.
First five iterations of K-means clustering on the `penguin_data` example data set with a poor random initialization. Each pair of plots corresponds to an iteration. Within the pair, the first plot depicts the center update, and the second plot depicts the reassignment of data to clusters. Cluster centers are indicated by larger points that are outlined in black.
672
+
First four iterations of K-means clustering on the `penguins_standardized` example data set with a poor random initialization. Each pair of plots corresponds to an iteration. Within the pair, the first plot depicts the center update, and the second plot depicts the reassignment of data to clusters. Cluster centers are indicated by larger points that are outlined in black.
667
673
:::
668
674
669
675
This looks like a relatively bad clustering of the data, but K-means cannot improve it.
@@ -790,23 +796,9 @@ Total WSSD for K clusters ranging from 1 to 9.
790
796
```
791
797
792
798
We can perform K-means in Python using a workflow similar to those
793
-
in the earlier classification and regression chapters. We will begin
794
-
by reading the original (i.e., unstandardized) subset of 18 observations
0 commit comments