Skip to content

Commit 3b8a249

Browse files
bugfixes
1 parent aec3320 commit 3b8a249

File tree

1 file changed

+10
-13
lines changed

1 file changed

+10
-13
lines changed

source/clustering.md

Lines changed: 10 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -488,14 +488,13 @@ These are beyond the scope of this book.
488488
```{code-cell} ipython3
489489
:tags: [remove-cell]
490490
491-
penguin_data = pd.read_csv("data/penguins_standardized.csv")
492491
# Set up the initial "random" label assignment the same as in the R book
493-
penguin_data['label'] = [
492+
penguins_standardized['label'] = [
494493
2, 2, 1, 1, 0, 0, 0, 1,
495494
2, 2, 1, 2, 1, 2,
496495
0, 1, 2, 2
497496
]
498-
points_kmeans_init = alt.Chart(penguin_data).mark_point(size=75, filled=True, opacity=1).encode(
497+
points_kmeans_init = alt.Chart(penguins_standardized).mark_point(size=75, filled=True, opacity=1).encode(
499498
alt.X("flipper_length_standardized").title("Flipper Length (standardized)"),
500499
alt.Y("bill_length_standardized").title("Bill Length (standardized)"),
501500
alt.Color('label:N').legend(None),
@@ -577,9 +576,9 @@ def plot_kmean_iterations(iterations, data, centroid_init):
577576
```{code-cell} ipython3
578577
:tags: [remove-cell]
579578
580-
centroid_init = penguin_data.groupby('label').mean()
579+
centroid_init = penguins_standardized.groupby('label').mean()
581580
582-
glue('toy-kmeans-iter-1', plot_kmean_iterations(3, penguin_data.copy(), centroid_init.copy()), display=True)
581+
glue('toy-kmeans-iter-1', plot_kmean_iterations(3, penguins_standardized.copy(), centroid_init.copy()), display=True)
583582
```
584583

585584
```{index} WSSD; total
@@ -624,12 +623,11 @@ are changing, and the algorithm terminates.
624623
```{code-cell} ipython3
625624
:tags: [remove-cell]
626625
627-
penguin_data = pd.read_csv("data/penguins_standardized.csv")
628626
# Set up the initial "random" label assignment the same as in the R book
629-
penguin_data['label'] = [1, 1, 2, 2, 0, 2, 0, 2, 2, 2, 1, 2, 0, 0, 0, 1, 1, 1]
630-
centroid_init = penguin_data.groupby('label').mean()
627+
penguins_standardized['label'] = [1, 1, 2, 2, 0, 2, 0, 2, 2, 2, 1, 2, 0, 0, 0, 1, 1, 1]
628+
centroid_init = penguins_standardized.groupby('label').mean()
631629
632-
points_kmeans_init = alt.Chart(penguin_data).mark_point(size=75, filled=True, opacity=1).encode(
630+
points_kmeans_init = alt.Chart(penguins_standardized).mark_point(size=75, filled=True, opacity=1).encode(
633631
alt.X("flipper_length_standardized").title("Flipper Length (standardized)"),
634632
alt.Y("bill_length_standardized").title("Bill Length (standardized)"),
635633
alt.Color('label:N').legend(None),
@@ -659,7 +657,7 @@ Random initialization of labels.
659657
```{code-cell} ipython3
660658
:tags: [remove-cell]
661659
662-
glue('toy-kmeans-bad-iter-1', plot_kmean_iterations(4, penguin_data.copy(), centroid_init.copy()), display=True)
660+
glue('toy-kmeans-bad-iter-1', plot_kmean_iterations(4, penguins_standardized.copy(), centroid_init.copy()), display=True)
663661
```
664662

665663
{numref}`toy-kmeans-bad-iter-1` shows what the iterations of K-means would look like with the unlucky random initialization shown in {numref}`toy-kmeans-bad-init-1`
@@ -681,13 +679,12 @@ and pick the clustering that has the lowest final total WSSD.
681679
682680
from sklearn.cluster import KMeans
683681
684-
685-
penguin_data = pd.read_csv("data/penguins_standardized.csv")
682+
penguins_standardized = penguins_standardized.drop(columns=["label"])
686683
687684
dfs = []
688685
inertias = []
689686
for i in range(1, 10):
690-
data = penguin_data.copy()
687+
data = penguins_standardized.copy()
691688
knn = KMeans(n_clusters=i, n_init='auto')
692689
knn.fit(data)
693690
data['n_clusters'] = f'{i} Cluster' + ('' if i == 1 else 's')

0 commit comments

Comments
 (0)