Skip to content

Commit 1d4aef6

Browse files
all figures referenced and centered in clustering
1 parent 47942c6 commit 1d4aef6

File tree

1 file changed

+24
-23
lines changed

1 file changed

+24
-23
lines changed

clustering.Rmd

Lines changed: 24 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -194,8 +194,9 @@ In this chapter, we will focus on the *K-means* algorithm,
194194
\index{K-means} a widely-used and often very effective clustering method,
195195
combined with the *elbow method* \index{elbow method}
196196
for selecting the number of clusters.
197-
This procedure will separate the data into the following groups
198-
denoted by color:
197+
This procedure will separate the data into groups;
198+
Figure \@ref(fig:10-toy-example-clustering) shows these groups
199+
denoted by colored scatter points.
199200

200201
```{r 10-toy-example-clustering, echo = FALSE, warning = FALSE, fig.height = 4, fig.width = 5, fig.cap = "Scatter plot of standardized bill length versus standardized flipper length with colored groups."}
201202
ggplot(data, aes(y = bill_length_standardized,
@@ -399,15 +400,7 @@ all_clusters_base
399400
We begin the K-means \index{K-means!algorithm} algorithm by picking K,
400401
and randomly assigning a roughly equal number of observations
401402
to each of the K clusters.
402-
Then K-means consists of two major steps that attempt to minimize the
403-
sum of WSSDs over all the clusters, i.e., the \index{WSSD!total} *total WSSD*:
404-
405-
1. **Center update:** Compute the center of each cluster.
406-
2. **Label update:** Reassign each data point to the cluster with the nearest center.
407-
408-
These two steps are repeated until the cluster assignments no longer change.
409-
For example, in the penguin data example,
410-
our random initialization might look like this:
403+
An example random initialization is shown in Figure \@ref(fig:10-toy-kmeans-init).
411404
412405
```{r 10-toy-kmeans-init, echo = FALSE, message = FALSE, warning = FALSE, fig.height = 4, fig.width = 4.35, fig.cap = "Random initialization of labels."}
413406
set.seed(14)
@@ -425,7 +418,14 @@ plt_lbl <- ggplot(penguin_data, aes(y = bill_length_standardized,
425418
plt_lbl
426419
```
427420

428-
And we show what the first four iterations of K-means would look like in
421+
Then K-means consists of two major steps that attempt to minimize the
422+
sum of WSSDs over all the clusters, i.e., the \index{WSSD!total} *total WSSD*:
423+
424+
1. **Center update:** Compute the center of each cluster.
425+
2. **Label update:** Reassign each data point to the cluster with the nearest center.
426+
427+
These two steps are repeated until the cluster assignments no longer change.
428+
We show what the first four iterations of K-means would look like in
429429
Figure \@ref(fig:10-toy-kmeans-iter).
430430
There each row corresponds to an iteration,
431431
where the left column depicts the center update,
@@ -775,9 +775,9 @@ clustered_data
775775
```
776776

777777
Now that we have this information in a tidy data frame, we can make a visualization
778-
of the cluster assignments for each point:
778+
of the cluster assignments for each point, as shown in Figure \@ref(fig:10-plot-clusters-2).
779779

780-
```{r 10-plot-clusters-2, fig.height = 4, fig.width = 4.35}
780+
```{r 10-plot-clusters-2, fig.height = 4, fig.width = 4.35, fig.align = "center"}
781781
cluster_plot <- ggplot(clustered_data,
782782
aes(x = flipper_length_mm,
783783
y = bill_length_mm,
@@ -843,8 +843,9 @@ we could use a familiar friend: `pull`.
843843
`pull` will return to us a data frame column as a simpler data structure,
844844
here that would be a list.
845845
And then to extract the first item of the list,
846-
we can use the `pluck` function;
847-
passing it the index for the element we would like to extract (here 1).
846+
we can use the `pluck` function. We pass
847+
it the index for the element we would like to extract
848+
(here, `1`).
848849

849850
```{r}
850851
penguin_clust_ks |>
@@ -882,10 +883,9 @@ clustering_statistics
882883
```
883884

884885
Now that we have `tot.withinss` and `k` as columns in a data frame, we can make a line plot
885-
and search for the "elbow" to find which value of K to use.
886-
886+
(Figure \@ref(fig:10-plot-choose-k)) and search for the "elbow" to find which value of K to use.
887887

888-
```{r 10-plot-choose-k, fig.height = 4, fig.width = 4.35}
888+
```{r 10-plot-choose-k, fig.height = 4, fig.width = 4.35, fig.align = "center"}
889889
elbow_plot <- ggplot(clustering_statistics, aes(x = k, y = tot.withinss)) +
890890
geom_point() +
891891
geom_line() +
@@ -904,15 +904,16 @@ Unfortunately, for K = 8 we had an unlucky initialization
904904
and found a bad clustering! \index{K-means!restart,nstart}
905905
We can help prevent finding a bad clustering
906906
by trying a few different random initializations
907-
via the `nstart` argument (here we use 10 restarts).
907+
via the `nstart` argument (Figure \@ref(fig:10-choose-k-nstart)
908+
shows a setup where we use 10 restarts).
908909
When we do this, K-means clustering will be performed
909910
the number of times specified by the `nstart` argument,
910911
and R will return to us the best clustering from this.
911912
The more times we perform K-means clustering,
912913
the more likely we are to find a good clustering (if one exists).
913-
What value should you choose for `nstart`? The answer is it depends.
914-
It depends on the size of your data set,
915-
and the speed and size of your computer.
914+
What value should you choose for `nstart`? The answer is that it depends
915+
on many factors: the size and characteristics of your data set,
916+
as well as the speed and size of your computer.
916917
The larger the `nstart` value the better from an analysis perspective,
917918
but there is a trade-off that doing many clusterings
918919
could take a long time.

0 commit comments

Comments
 (0)