You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Next, we can create a scatter plot using this data set
169
170
to see if we can detect subtypes or groups in our data set.
170
171
171
-
```{r 10-toy-example-plot, warning = FALSE, fig.height = 3.5, fig.width = 3.75, fig.align = "center", fig.pos = "H", out.extra="", fig.cap = "Scatter plot of standardized bill length versus standardized flipper length."}
172
+
\newpage
173
+
174
+
```{r 10-toy-example-plot, warning = FALSE, fig.height = 3.25, fig.width = 3.5, fig.align = "center", fig.pos = "H", out.extra="", fig.cap = "Scatter plot of standardized bill length versus standardized flipper length."}
172
175
ggplot(data, aes(x = flipper_length_standardized,
173
176
y = bill_length_standardized)) +
174
177
geom_point() +
@@ -203,7 +206,7 @@ This procedure will separate the data into groups;
203
206
Figure \@ref(fig:10-toy-example-clustering) shows these groups
204
207
denoted by colored scatter points.
205
208
206
-
```{r 10-toy-example-clustering, echo = FALSE, warning = FALSE, fig.height = 3.5, fig.width = 4.5, fig.align = "center", fig.cap = "Scatter plot of standardized bill length versus standardized flipper length with colored groups."}
209
+
```{r 10-toy-example-clustering, echo = FALSE, warning = FALSE, fig.height = 3.25, fig.width = 4.25, fig.align = "center", fig.cap = "Scatter plot of standardized bill length versus standardized flipper length with colored groups."}
207
210
ggplot(data, aes(y = bill_length_standardized,
208
211
x = flipper_length_standardized, color = cluster)) +
209
212
geom_point() +
@@ -261,7 +264,7 @@ in Figure \@ref(fig:10-toy-example-clus1-center).
261
264
262
265
(ref:10-toy-example-clus1-center) Cluster 1 from the `penguin_data` data set example. Observations are in blue, with the cluster center highlighted in red.
base <- ggplot(data, aes(x = flipper_length_standardized, y = bill_length_standardized)) +
266
269
geom_point() +
267
270
xlab("Flipper Length (standardized)") +
@@ -308,7 +311,7 @@ These distances are denoted by lines in Figure \@ref(fig:10-toy-example-clus1-di
308
311
309
312
(ref:10-toy-example-clus1-dists) Cluster 1 from the `penguin_data` data set example. Observations are in blue, with the cluster center highlighted in red. The distances from the observations to the cluster center are represented as black lines.
(ref:10-toy-example-all-clus-dists) All clusters from the `penguin_data` data set example. Observations are in orange, blue, and yellow with the cluster center highlighted in red. The distances from the observations to each of the respective cluster centers are represented as black lines.
@@ -599,7 +602,7 @@ These, however, are beyond the scope of this book.
599
602
Unlike the classification and regression models we studied in previous chapters, K-means \index{K-means!restart, nstart} can get "stuck" in a bad solution.
600
603
For example, Figure \@ref(fig:10-toy-kmeans-bad-init) illustrates an unlucky random initialization by K-means.
@@ -620,7 +623,7 @@ Figure \@ref(fig:10-toy-kmeans-bad-iter) shows what the iterations of K-means wo
620
623
621
624
(ref:10-toy-kmeans-bad-iter) First five iterations of K-means clustering on the `penguin_data` example data set with a poor random initialization. Each pair of plots corresponds to an iteration. Within the pair, the first plot depicts the center update, and the second plot depicts the reassignment of data to clusters. Cluster centers are indicated by larger points that are outlined in black.
@@ -778,7 +781,7 @@ Figure \@ref(fig:10-toy-kmeans-vary-k) illustrates the impact of K
778
781
on K-means clustering of our penguin flipper and bill length data
779
782
by showing the different clusterings for K's ranging from 1 to 9.
780
783
781
-
```{r 10-toy-kmeans-vary-k, echo = FALSE, warning = FALSE, fig.height = 6.25, fig.width = 6, fig.cap = "Clustering of the penguin data for K clusters ranging from 1 to 9. Cluster centers are indicated by larger points that are outlined in black."}
784
+
```{r 10-toy-kmeans-vary-k, echo = FALSE, warning = FALSE, fig.height = 6.25, fig.width = 6, fig.pos = "H", out.extra="", fig.cap = "Clustering of the penguin data for K clusters ranging from 1 to 9. Cluster centers are indicated by larger points that are outlined in black."}
782
785
set.seed(3)
783
786
784
787
kclusts <- tibble(k = 1:9) |>
@@ -842,7 +845,7 @@ decrease the total WSSD, but by only a *diminishing amount*. If we plot the tota
842
845
clusters, we see that the decrease in total WSSD levels off (or forms an "elbow shape") \index{elbow method} when we reach roughly
843
846
the right number of clusters (Figure \@ref(fig:10-toy-kmeans-elbow)).
844
847
845
-
```{r 10-toy-kmeans-elbow, echo = FALSE, warning = FALSE, fig.align = 'center', fig.height = 3.5, fig.width = 4.5, fig.cap = "Total WSSD for K clusters ranging from 1 to 9."}
848
+
```{r 10-toy-kmeans-elbow, echo = FALSE, warning = FALSE, fig.align = 'center', fig.height = 3.25, fig.width = 4.25, fig.pos = "H", out.extra="", fig.cap = "Total WSSD for K clusters ranging from 1 to 9."}
846
849
p2 <- ggplot(clusterings, aes(x = k, y = tot.withinss)) +
847
850
geom_point(size = 2) +
848
851
geom_line() +
@@ -933,7 +936,7 @@ clustered_data
933
936
Now that we have this information in a tidy data frame, we can make a visualization
934
937
of the cluster assignments for each point, as shown in Figure \@ref(fig:10-plot-clusters-2).
935
938
936
-
```{r 10-plot-clusters-2, fig.height = 3.5, fig.width = 4.5, fig.align = "center", fig.cap = "The data colored by the cluster assignments returned by K-means."}
939
+
```{r 10-plot-clusters-2, fig.height = 3.25, fig.width = 4.25, fig.align = "center", fig.pos = "H", out.extra="", fig.cap = "The data colored by the cluster assignments returned by K-means."}
937
940
cluster_plot <- ggplot(clustered_data,
938
941
aes(x = flipper_length_mm,
939
942
y = bill_length_mm,
@@ -1042,7 +1045,7 @@ clustering_statistics
1042
1045
Now that we have `tot.withinss` and `k` as columns in a data frame, we can make a line plot
1043
1046
(Figure \@ref(fig:10-plot-choose-k)) and search for the "elbow" to find which value of K to use.
1044
1047
1045
-
```{r 10-plot-choose-k, fig.height = 3.5, fig.width = 4.5, fig.align = "center", fig.cap = "A plot showing the total WSSD versus the number of clusters."}
1048
+
```{r 10-plot-choose-k, fig.height = 3.25, fig.width = 4.25, fig.align = "center", fig.pos = "H", out.extra="", fig.cap = "A plot showing the total WSSD versus the number of clusters."}
1046
1049
elbow_plot <- ggplot(clustering_statistics, aes(x = k, y = tot.withinss)) +
1047
1050
geom_point() +
1048
1051
geom_line() +
@@ -1077,7 +1080,7 @@ but there is a trade-off that doing many clusterings
1077
1080
could take a long time.
1078
1081
So this is something that needs to be balanced.
1079
1082
1080
-
```{r 10-choose-k-nstart, fig.height = 3.5, fig.width = 4.5, message= FALSE, warning = FALSE, fig.align = "center", fig.cap = "A plot showing the total WSSD versus the number of clusters when K-means is run with 10 restarts."}
1083
+
```{r 10-choose-k-nstart, fig.height = 3.25, fig.width = 4.25, fig.pos = "H", out.extra="", message= FALSE, warning = FALSE, fig.align = "center", fig.cap = "A plot showing the total WSSD versus the number of clusters when K-means is run with 10 restarts."}
0 commit comments