Skip to content

Commit e3bd759

Browse files
cleaner code, minor improvement on bill/flipper coln ordering
1 parent 3b8a249 commit e3bd759

File tree

1 file changed

+35
-33
lines changed

1 file changed

+35
-33
lines changed

source/clustering.md

Lines changed: 35 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -182,8 +182,8 @@ in the clustering pipeline.
182182
```{code-cell} ipython3
183183
:tags: [remove-cell]
184184
penguins_standardized = penguins.assign(
185-
flipper_length_standardized = (penguins["flipper_length_mm"] - penguins["flipper_length_mm"].mean())/penguins["flipper_length_mm"].std(),
186-
bill_length_standardized = (penguins["bill_length_mm"] - penguins["bill_length_mm"].mean())/penguins["bill_length_mm"].std()
185+
bill_length_standardized = (penguins["bill_length_mm"] - penguins["bill_length_mm"].mean())/penguins["bill_length_mm"].std(),
186+
flipper_length_standardized = (penguins["flipper_length_mm"] - penguins["flipper_length_mm"].mean())/penguins["flipper_length_mm"].std()
187187
).drop(
188188
columns = ["bill_length_mm", "flipper_length_mm"]
189189
)
@@ -537,7 +537,7 @@ def plot_kmean_iterations(iterations, data, centroid_init):
537537
538538
data['iteration'] = f'Iteration {i}'
539539
data['update_type'] = 'Label Update'
540-
cluster_columns = ['flipper_length_standardized', 'bill_length_standardized']
540+
cluster_columns = ['bill_length_standardized', 'flipper_length_standardized']
541541
data['label'] = np.argmin(euclidean_distances(data[cluster_columns], centroid_init), axis=1)
542542
data['flipper_centroid'] = data['label'].map(centroid_init['flipper_length_standardized'])
543543
data['bill_centroid'] = data['label'].map(centroid_init['bill_length_standardized'])
@@ -620,6 +620,14 @@ ways to assign the data to clusters. So at some point, the total WSSD must stop
620620
are changing, and the algorithm terminates.
621621
```
622622

623+
### Random restarts
624+
625+
```{index} K-means; init argument
626+
```
627+
628+
Unlike the classification and regression models we studied in previous chapters, K-means can get "stuck" in a bad solution.
629+
For example, {numref}`toy-kmeans-bad-init-1` illustrates an unlucky random initialization by K-means.
630+
623631
```{code-cell} ipython3
624632
:tags: [remove-cell]
625633
@@ -638,15 +646,6 @@ points_kmeans_init = alt.Chart(penguins_standardized).mark_point(size=75, filled
638646
glue('toy-kmeans-bad-init-1', points_kmeans_init, display=True)
639647
```
640648

641-
### Random restarts
642-
643-
```{index} K-means; init argument
644-
```
645-
646-
Unlike the classification and regression models we studied in previous chapters, K-means can get "stuck" in a bad solution.
647-
For example, {numref}`toy-kmeans-bad-init-1` illustrates an unlucky random initialization by K-means.
648-
649-
650649
:::{glue:figure} toy-kmeans-bad-init-1
651650
:figwidth: 700px
652651
:name: toy-kmeans-bad-init-1
@@ -674,6 +673,19 @@ This looks like a relatively bad clustering of the data, but K-means cannot impr
674673
To solve this problem when clustering data using K-means, we should randomly re-initialize the labels a few times, run K-means for each initialization,
675674
and pick the clustering that has the lowest final total WSSD.
676675

676+
### Choosing K
677+
678+
In order to cluster data using K-means,
679+
we also have to pick the number of clusters, K.
680+
But unlike in classification, we have no response variable
681+
and cannot perform cross-validation with some measure of model prediction error.
682+
Further, if K is chosen too small, then multiple clusters get grouped together;
683+
if K is too large, then clusters get subdivided.
684+
In both cases, we will potentially miss interesting structure in the data.
685+
{numref}`toy-kmeans-vary-k-1` illustrates the impact of K
686+
on K-means clustering of our penguin flipper and bill length data
687+
by showing the different clusterings for K's ranging from 1 to 9.
688+
677689
```{code-cell} ipython3
678690
:tags: [remove-cell]
679691
@@ -722,18 +734,7 @@ vary_k = alt.layer(
722734
glue('toy-kmeans-vary-k-1', vary_k, display=True)
723735
```
724736

725-
### Choosing K
726737

727-
In order to cluster data using K-means,
728-
we also have to pick the number of clusters, K.
729-
But unlike in classification, we have no response variable
730-
and cannot perform cross-validation with some measure of model prediction error.
731-
Further, if K is chosen too small, then multiple clusters get grouped together;
732-
if K is too large, then clusters get subdivided.
733-
In both cases, we will potentially miss interesting structure in the data.
734-
{numref}`toy-kmeans-vary-k-1` illustrates the impact of K
735-
on K-means clustering of our penguin flipper and bill length data
736-
by showing the different clusterings for K's ranging from 1 to 9.
737738

738739
:::{glue:figure} toy-kmeans-vary-k-1
739740
:figwidth: 700px
@@ -742,6 +743,17 @@ by showing the different clusterings for K's ranging from 1 to 9.
742743
Clustering of the penguin data for K clusters ranging from 1 to 9. Cluster centers are indicated by larger points that are outlined in black.
743744
:::
744745

746+
747+
```{index} elbow method
748+
```
749+
750+
If we set K less than 3, then the clustering merges separate groups of data; this causes a large
751+
total WSSD, since the cluster center (denoted by large shapes with black outlines) is not close to any of the data in the cluster. On
752+
the other hand, if we set K greater than 3, the clustering subdivides subgroups of data; this does indeed still
753+
decrease the total WSSD, but by only a *diminishing amount*. If we plot the total WSSD versus the number of
754+
clusters, we see that the decrease in total WSSD levels off (or forms an "elbow shape") when we reach roughly
755+
the right number of clusters ({numref}`toy-kmeans-elbow`)).
756+
745757
```{code-cell} ipython3
746758
:tags: [remove-cell]
747759
@@ -770,16 +782,6 @@ elbow_plot = alt.layer(
770782
glue('toy-kmeans-elbow', elbow_plot, display=True)
771783
```
772784

773-
```{index} elbow method
774-
```
775-
776-
If we set K less than 3, then the clustering merges separate groups of data; this causes a large
777-
total WSSD, since the cluster center (denoted by large shapes with black outlines) is not close to any of the data in the cluster. On
778-
the other hand, if we set K greater than 3, the clustering subdivides subgroups of data; this does indeed still
779-
decrease the total WSSD, but by only a *diminishing amount*. If we plot the total WSSD versus the number of
780-
clusters, we see that the decrease in total WSSD levels off (or forms an "elbow shape") when we reach roughly
781-
the right number of clusters ({numref}`toy-kmeans-elbow`)).
782-
783785
:::{glue:figure} toy-kmeans-elbow
784786
:figwidth: 700px
785787
:name: toy-kmeans-elbow

0 commit comments

Comments
 (0)