You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: classification1.Rmd
+3-3Lines changed: 3 additions & 3 deletions
Original file line number
Diff line number
Diff line change
@@ -455,7 +455,7 @@ You will see in the `mutate` \index{mutate} step below, we compute the straight-
455
455
distance using the formula above: we square the differences between the two observations' perimeter
456
456
and concavity coordinates, add the squared differences, and then take the square root.
457
457
458
-
```{r 05-multiknn-1, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap="Scatter plot of concavity versus perimeter with new observation represented as a red diamond."}
458
+
```{r 05-multiknn-1, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.pos = "H", out.extra="", fig.cap="Scatter plot of concavity versus perimeter with new observation represented as a red diamond."}
459
459
perim_concav <- bind_rows(cancer,
460
460
tibble(Perimeter = new_point[1],
461
461
Concavity = new_point[2],
@@ -1096,7 +1096,7 @@ The new imbalanced data is shown in Figure \@ref(fig:05-unbalanced).
cancer |> filter(Class == "M") |> slice_head(n = 3)
@@ -1255,7 +1255,7 @@ classifier would make. We can see that the decision is more reasonable; when the
1255
1255
to those labeled malignant, the classifier predicts a malignant tumor, and vice versa when they are
1256
1256
closer to the benign tumor observations.
1257
1257
1258
-
```{r 05-upsample-plot, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap = "Upsampled data with background color indicating the decision of the classifier."}
1258
+
```{r 05-upsample-plot, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.pos = "H", out.extra="", fig.cap = "Upsampled data with background color indicating the decision of the classifier."}
Next, we can create a scatter plot using this data set
169
170
to see if we can detect subtypes or groups in our data set.
170
171
171
-
```{r 10-toy-example-plot, warning = FALSE, fig.height = 3.5, fig.width = 3.75, fig.align = "center", fig.cap = "Scatter plot of standardized bill length versus standardized flipper length."}
172
+
\newpage
173
+
174
+
```{r 10-toy-example-plot, warning = FALSE, fig.height = 3.25, fig.width = 3.5, fig.align = "center", fig.pos = "H", out.extra="", fig.cap = "Scatter plot of standardized bill length versus standardized flipper length."}
172
175
ggplot(data, aes(x = flipper_length_standardized,
173
176
y = bill_length_standardized)) +
174
177
geom_point() +
@@ -203,7 +206,7 @@ This procedure will separate the data into groups;
203
206
Figure \@ref(fig:10-toy-example-clustering) shows these groups
204
207
denoted by colored scatter points.
205
208
206
-
```{r 10-toy-example-clustering, echo = FALSE, warning = FALSE, fig.height = 3.5, fig.width = 4.5, fig.align = "center", fig.cap = "Scatter plot of standardized bill length versus standardized flipper length with colored groups."}
209
+
```{r 10-toy-example-clustering, echo = FALSE, warning = FALSE, fig.height = 3.25, fig.width = 4.25, fig.align = "center", fig.cap = "Scatter plot of standardized bill length versus standardized flipper length with colored groups."}
207
210
ggplot(data, aes(y = bill_length_standardized,
208
211
x = flipper_length_standardized, color = cluster)) +
209
212
geom_point() +
@@ -261,7 +264,7 @@ in Figure \@ref(fig:10-toy-example-clus1-center).
261
264
262
265
(ref:10-toy-example-clus1-center) Cluster 1 from the `penguin_data` data set example. Observations are in blue, with the cluster center highlighted in red.
base <- ggplot(data, aes(x = flipper_length_standardized, y = bill_length_standardized)) +
266
269
geom_point() +
267
270
xlab("Flipper Length (standardized)") +
@@ -308,7 +311,7 @@ These distances are denoted by lines in Figure \@ref(fig:10-toy-example-clus1-di
308
311
309
312
(ref:10-toy-example-clus1-dists) Cluster 1 from the `penguin_data` data set example. Observations are in blue, with the cluster center highlighted in red. The distances from the observations to the cluster center are represented as black lines.
(ref:10-toy-example-all-clus-dists) All clusters from the `penguin_data` data set example. Observations are in orange, blue, and yellow with the cluster center highlighted in red. The distances from the observations to each of the respective cluster centers are represented as black lines.
We begin the K-means \index{K-means!algorithm} algorithm by picking K,
@@ -597,7 +602,7 @@ These, however, are beyond the scope of this book.
597
602
Unlike the classification and regression models we studied in previous chapters, K-means \index{K-means!restart, nstart} can get "stuck" in a bad solution.
598
603
For example, Figure \@ref(fig:10-toy-kmeans-bad-init) illustrates an unlucky random initialization by K-means.
@@ -618,7 +623,7 @@ Figure \@ref(fig:10-toy-kmeans-bad-iter) shows what the iterations of K-means wo
618
623
619
624
(ref:10-toy-kmeans-bad-iter) First five iterations of K-means clustering on the `penguin_data` example data set with a poor random initialization. Each pair of plots corresponds to an iteration. Within the pair, the first plot depicts the center update, and the second plot depicts the reassignment of data to clusters. Cluster centers are indicated by larger points that are outlined in black.
@@ -776,7 +781,7 @@ Figure \@ref(fig:10-toy-kmeans-vary-k) illustrates the impact of K
776
781
on K-means clustering of our penguin flipper and bill length data
777
782
by showing the different clusterings for K's ranging from 1 to 9.
778
783
779
-
```{r 10-toy-kmeans-vary-k, echo = FALSE, warning = FALSE, fig.height = 6.25, fig.width = 6, fig.cap = "Clustering of the penguin data for K clusters ranging from 1 to 9. Cluster centers are indicated by larger points that are outlined in black."}
784
+
```{r 10-toy-kmeans-vary-k, echo = FALSE, warning = FALSE, fig.height = 6.25, fig.width = 6, fig.pos = "H", out.extra="", fig.cap = "Clustering of the penguin data for K clusters ranging from 1 to 9. Cluster centers are indicated by larger points that are outlined in black."}
780
785
set.seed(3)
781
786
782
787
kclusts <- tibble(k = 1:9) |>
@@ -840,7 +845,7 @@ decrease the total WSSD, but by only a *diminishing amount*. If we plot the tota
840
845
clusters, we see that the decrease in total WSSD levels off (or forms an "elbow shape") \index{elbow method} when we reach roughly
841
846
the right number of clusters (Figure \@ref(fig:10-toy-kmeans-elbow)).
842
847
843
-
```{r 10-toy-kmeans-elbow, echo = FALSE, warning = FALSE, fig.align = 'center', fig.height = 3.5, fig.width = 4.5, fig.cap = "Total WSSD for K clusters ranging from 1 to 9."}
848
+
```{r 10-toy-kmeans-elbow, echo = FALSE, warning = FALSE, fig.align = 'center', fig.height = 3.25, fig.width = 4.25, fig.pos = "H", out.extra="", fig.cap = "Total WSSD for K clusters ranging from 1 to 9."}
844
849
p2 <- ggplot(clusterings, aes(x = k, y = tot.withinss)) +
845
850
geom_point(size = 2) +
846
851
geom_line() +
@@ -931,7 +936,7 @@ clustered_data
931
936
Now that we have this information in a tidy data frame, we can make a visualization
932
937
of the cluster assignments for each point, as shown in Figure \@ref(fig:10-plot-clusters-2).
933
938
934
-
```{r 10-plot-clusters-2, fig.height = 3.5, fig.width = 4.5, fig.align = "center", fig.cap = "The data colored by the cluster assignments returned by K-means."}
939
+
```{r 10-plot-clusters-2, fig.height = 3.25, fig.width = 4.25, fig.align = "center", fig.pos = "H", out.extra="", fig.cap = "The data colored by the cluster assignments returned by K-means."}
935
940
cluster_plot <- ggplot(clustered_data,
936
941
aes(x = flipper_length_mm,
937
942
y = bill_length_mm,
@@ -1040,7 +1045,7 @@ clustering_statistics
1040
1045
Now that we have `tot.withinss` and `k` as columns in a data frame, we can make a line plot
1041
1046
(Figure \@ref(fig:10-plot-choose-k)) and search for the "elbow" to find which value of K to use.
1042
1047
1043
-
```{r 10-plot-choose-k, fig.height = 3.5, fig.width = 4.5, fig.align = "center", fig.cap = "A plot showing the total WSSD versus the number of clusters."}
1048
+
```{r 10-plot-choose-k, fig.height = 3.25, fig.width = 4.25, fig.align = "center", fig.pos = "H", out.extra="", fig.cap = "A plot showing the total WSSD versus the number of clusters."}
1044
1049
elbow_plot <- ggplot(clustering_statistics, aes(x = k, y = tot.withinss)) +
1045
1050
geom_point() +
1046
1051
geom_line() +
@@ -1075,7 +1080,7 @@ but there is a trade-off that doing many clusterings
1075
1080
could take a long time.
1076
1081
So this is something that needs to be balanced.
1077
1082
1078
-
```{r 10-choose-k-nstart, fig.height = 3.5, fig.width = 4.5, message= FALSE, warning = FALSE, fig.align = "center", fig.cap = "A plot showing the total WSSD versus the number of clusters when K-means is run with 10 restarts."}
1083
+
```{r 10-choose-k-nstart, fig.height = 3.25, fig.width = 4.25, fig.pos = "H", out.extra="", message= FALSE, warning = FALSE, fig.align = "center", fig.cap = "A plot showing the total WSSD versus the number of clusters when K-means is run with 10 restarts."}
0 commit comments