You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: source/classification1.Rmd
+15-15Lines changed: 15 additions & 15 deletions
Original file line number
Diff line number
Diff line change
@@ -243,7 +243,7 @@ we select our own colorblind-friendly colors—`"darkorange"`
243
243
for orange and `"steelblue"` for blue—and
244
244
pass them as the `values` argument to the `scale_color_manual` function.
245
245
246
-
```{r 05-scatter, fig.height = 3.5, fig.width = 4.5, fig.cap= "Scatter plot of concavity versus perimeter colored by diagnosis label."}
246
+
```{r 05-scatter, fig.height = 3.5, fig.width = 4.5, fig.align = "center", fig.cap= "Scatter plot of concavity versus perimeter colored by diagnosis label."}
247
247
perim_concav <- cancer |>
248
248
ggplot(aes(x = Perimeter, y = Concavity, color = Class)) +
249
249
geom_point(alpha = 0.6) +
@@ -320,7 +320,7 @@ new observation, with standardized perimeter of `r new_point[1]` and standardize
320
320
diagnosis "Class" is unknown. This new observation is depicted by the red, diamond point in
321
321
Figure \@ref(fig:05-knn-1).
322
322
323
-
```{r 05-knn-1, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap="Scatter plot of concavity versus perimeter with new observation represented as a red diamond."}
323
+
```{r 05-knn-1, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.align = "center", fig.cap="Scatter plot of concavity versus perimeter with new observation represented as a red diamond."}
324
324
perim_concav_with_new_point <- bind_rows(cancer,
325
325
tibble(Perimeter = new_point[1],
326
326
Concavity = new_point[2],
@@ -349,7 +349,7 @@ then the perimeter and concavity values are similar, and so we may expect that
349
349
they would have the same diagnosis.
350
350
351
351
352
-
```{r 05-knn-2, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap="Scatter plot of concavity versus perimeter. The new observation is represented as a red diamond with a line to the one nearest neighbor, which has a malignant label."}
352
+
```{r 05-knn-2, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.align = "center", fig.cap="Scatter plot of concavity versus perimeter. The new observation is represented as a red diamond with a line to the one nearest neighbor, which has a malignant label."}
353
353
perim_concav_with_new_point +
354
354
geom_segment(aes(
355
355
x = new_point[1],
@@ -374,7 +374,7 @@ Does this seem like the right prediction to make for this observation? Probably
374
374
not, if you consider the other nearby points.
375
375
376
376
377
-
```{r 05-knn-4, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap="Scatter plot of concavity versus perimeter. The new observation is represented as a red diamond with a line to the one nearest neighbor, which has a benign label."}
377
+
```{r 05-knn-4, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.align = "center", fig.cap="Scatter plot of concavity versus perimeter. The new observation is represented as a red diamond with a line to the one nearest neighbor, which has a benign label."}
378
378
379
379
perim_concav_with_new_point2 <- bind_rows(cancer,
380
380
tibble(Perimeter = new_point[1],
@@ -411,7 +411,7 @@ see that the diagnoses of 2 of the 3 nearest neighbors to our new observation
411
411
are malignant. Therefore we take majority vote and classify our new red, diamond
412
412
observation as malignant.
413
413
414
-
```{r 05-knn-5, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap="Scatter plot of concavity versus perimeter with three nearest neighbors."}
414
+
```{r 05-knn-5, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.align = "center", fig.cap="Scatter plot of concavity versus perimeter with three nearest neighbors."}
415
415
perim_concav_with_new_point2 +
416
416
geom_segment(aes(
417
417
x = new_point[1], y = new_point[2],
@@ -462,7 +462,7 @@ distance using the formula above: we square the differences between the two obse
462
462
and concavity coordinates, add the squared differences, and then take the square root.
463
463
In order to find the $K=5$ nearest neighbors, we will use the `slice_min` function. \index{slice\_min}
464
464
465
-
```{r 05-multiknn-1, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.pos = "H", out.extra="", fig.cap="Scatter plot of concavity versus perimeter with new observation represented as a red diamond."}
465
+
```{r 05-multiknn-1, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.pos = "H", out.extra="", fig.align = "center", fig.cap="Scatter plot of concavity versus perimeter with new observation represented as a red diamond."}
466
466
perim_concav <- bind_rows(cancer,
467
467
tibble(Perimeter = new_point[1],
468
468
Concavity = new_point[2],
@@ -540,7 +540,7 @@ The result of this computation shows that 3 of the 5 nearest neighbors to our ne
540
540
malignant; since this is the majority, we classify our new observation as malignant.
541
541
These 5 neighbors are circled in Figure \@ref(fig:05-multiknn-3).
542
542
543
-
```{r 05-multiknn-3, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap="Scatter plot of concavity versus perimeter with 5 nearest neighbors circled."}
543
+
```{r 05-multiknn-3, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.align = "center", fig.cap="Scatter plot of concavity versus perimeter with 5 nearest neighbors circled."}
544
544
perim_concav + annotate("path",
545
545
x = new_point[1] + 1.4 * cos(seq(0, 2 * pi,
546
546
length.out = 100
@@ -598,7 +598,7 @@ the new observation as malignant since 4 out of 5 of the nearest neighbors are f
598
598
Figure \@ref(fig:05-more) shows what the data look like when we visualize them
599
599
as a 3-dimensional scatter with lines from the new observation to its five nearest neighbors.
600
600
601
-
```{r 05-more, echo = FALSE, message = FALSE, fig.cap = "3D scatter plot of the standardized symmetry, concavity, and perimeter variables. Note that in general we recommend against using 3D visualizations; here we show the data in 3D only to illustrate what higher dimensions and nearest neighbors look like, for learning purposes.", fig.retina=2, out.width="100%"}
601
+
```{r 05-more, echo = FALSE, message = FALSE, fig.align = "center", fig.cap = "3D scatter plot of the standardized symmetry, concavity, and perimeter variables. Note that in general we recommend against using 3D visualizations; here we show the data in 3D only to illustrate what higher dimensions and nearest neighbors look like, for learning purposes.", fig.retina=2, out.width="100%"}
602
602
attrs <- c("Perimeter", "Concavity", "Symmetry")
603
603
604
604
# create new scaled obs and get NNs
@@ -945,7 +945,7 @@ Standardizing your data should be a part of the preprocessing you do
945
945
before predictive modeling and you should always think carefully about your problem domain and
946
946
whether you need to standardize your data.
947
947
948
-
```{r 05-scaling-plt, echo = FALSE, fig.height = 4, fig.cap = "Comparison of K = 3 nearest neighbors with standardized and unstandardized data."}
948
+
```{r 05-scaling-plt, echo = FALSE, fig.height = 4, fig.align = "center", fig.cap = "Comparison of K = 3 nearest neighbors with standardized and unstandardized data."}
cancer |> filter(Class == "Malignant") |> slice_head(n = 3)
@@ -1130,7 +1130,7 @@ benign, and the benign vote will always win. For example, Figure \@ref(fig:05-up
1130
1130
shows what happens for a new tumor observation that is quite close to three observations
1131
1131
in the training data that were tagged as malignant.
1132
1132
1133
-
```{r 05-upsample, echo=FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap = "Imbalanced data with 7 nearest neighbors to a new observation highlighted."}
1133
+
```{r 05-upsample, echo=FALSE, fig.height = 3.5, fig.width = 4.5, fig.align = "center", fig.cap = "Imbalanced data with 7 nearest neighbors to a new observation highlighted."}
@@ -1179,7 +1179,7 @@ each area of the plot to the predictions the K-nearest neighbors
1179
1179
classifier would make. We can see that the decision is
1180
1180
always "benign," corresponding to the blue color.
1181
1181
1182
-
```{r 05-upsample-2, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap = "Imbalanced data with background color indicating the decision of the classifier and the points represent the labeled data."}
1182
+
```{r 05-upsample-2, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.align = "center", fig.cap = "Imbalanced data with background color indicating the decision of the classifier and the points represent the labeled data."}
@@ -1259,7 +1259,7 @@ classifier would make. We can see that the decision is more reasonable; when the
1259
1259
to those labeled malignant, the classifier predicts a malignant tumor, and vice versa when they are
1260
1260
closer to the benign tumor observations.
1261
1261
1262
-
```{r 05-upsample-plot, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.pos = "H", out.extra="", fig.cap = "Upsampled data with background color indicating the decision of the classifier."}
1262
+
```{r 05-upsample-plot, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.pos = "H", out.extra="", fig.align = "center", fig.cap = "Upsampled data with background color indicating the decision of the classifier."}
@@ -1455,7 +1455,7 @@ predict the label of each, and visualize the predictions with a colored scatter
1455
1455
> textbook. It is included for those readers who would like to use similar
1456
1456
> visualizations in their own data analyses.
1457
1457
1458
-
```{r 05-workflow-plot-show, fig.height = 3.5, fig.width = 4.6, fig.cap = "Scatter plot of smoothness versus area where background color indicates the decision of the classifier."}
1458
+
```{r 05-workflow-plot-show, fig.height = 3.5, fig.width = 4.6, fig.align = "center", fig.cap = "Scatter plot of smoothness versus area where background color indicates the decision of the classifier."}
1459
1459
# create the grid of area/smoothness vals, and arrange in a data frame
Copy file name to clipboardExpand all lines: source/classification2.Rmd
+12-12Lines changed: 12 additions & 12 deletions
Original file line number
Diff line number
Diff line change
@@ -94,7 +94,7 @@ labels for new observations without known class labels.
94
94
> is. Imagine how bad it would be to overestimate your classifier's accuracy
95
95
> when predicting whether a patient's tumor is malignant or benign!
96
96
97
-
```{r 06-training-test, echo = FALSE, warning = FALSE, fig.cap = "Splitting the data into training and testing sets.", fig.retina = 2, out.width = "100%"}
97
+
```{r 06-training-test, echo = FALSE, warning = FALSE, fig.align = "center", fig.cap = "Splitting the data into training and testing sets.", fig.retina = 2, out.width = "100%"}
We can decide which number of neighbors is best by plotting the accuracy versus $K$,
990
990
as shown in Figure \@ref(fig:06-find-k).
991
991
992
-
```{r 06-find-k, fig.height = 3.5, fig.width = 4, fig.cap= "Plot of estimated accuracy versus the number of neighbors."}
992
+
```{r 06-find-k, fig.height = 3.5, fig.width = 4, fig.align = "center", fig.cap= "Plot of estimated accuracy versus the number of neighbors."}
993
993
accuracy_vs_k <- ggplot(accuracies, aes(x = neighbors, y = mean)) +
994
994
geom_point() +
995
995
geom_line() +
@@ -1049,7 +1049,7 @@ we vary $K$ from 1 to almost the number of observations in the training set.
1049
1049
set.seed(1)
1050
1050
```
1051
1051
1052
-
```{r 06-lots-of-ks, message = FALSE, fig.height = 3.5, fig.width = 4, fig.cap="Plot of accuracy estimate versus number of neighbors for many K values."}
1052
+
```{r 06-lots-of-ks, message = FALSE, fig.height = 3.5, fig.width = 4, fig.align = "center", fig.cap="Plot of accuracy estimate versus number of neighbors for many K values."}
1053
1053
k_lots <- tibble(neighbors = seq(from = 1, to = 385, by = 10))
1054
1054
1055
1055
knn_results <- workflow() |>
@@ -1093,7 +1093,7 @@ new data: if we had a different training set, the predictions would be
1093
1093
completely different. In general, if the model *is influenced too much* by the
1094
1094
training data, it is said to **overfit** the data.
1095
1095
1096
-
```{r 06-decision-grid-K, echo = FALSE, message = FALSE, fig.height = 10, fig.width = 10, fig.pos = "H", out.extra="", fig.cap = "Effect of K in overfitting and underfitting."}
1096
+
```{r 06-decision-grid-K, echo = FALSE, message = FALSE, fig.height = 10, fig.width = 10, fig.pos = "H", out.extra="", fig.align = "center", fig.cap = "Effect of K in overfitting and underfitting."}
1097
1097
ks <- c(1, 7, 20, 300)
1098
1098
plots <- list()
1099
1099
@@ -1256,7 +1256,7 @@ by maximizing estimated accuracy via cross-validation. After we have tuned the
1256
1256
model we can use the test set to estimate its accuracy.
1257
1257
The overall process is summarized in Figure \@ref(fig:06-overview).
```{r 06-fixed-irrelevant-features, echo = FALSE, warning = FALSE, fig.retina = 2, out.width = "75%", fig.cap = "Accuracy versus number of irrelevant predictors for tuned and untuned number of neighbors."}
1431
+
```{r 06-fixed-irrelevant-features, echo = FALSE, warning = FALSE, fig.retina = 2, out.width = "75%", fig.align = "center", fig.cap = "Accuracy versus number of irrelevant predictors for tuned and untuned number of neighbors."}
1432
1432
res_tmp <- res %>% pivot_longer(cols=c("accs", "fixedaccs"),
1433
1433
names_to="Type",
1434
1434
values_to="accuracy")
@@ -1657,7 +1657,7 @@ predictors from the model! It is always worth remembering, however, that what cr
1657
1657
is an *estimate* of the true accuracy; you have to use your judgement when looking at this plot to decide
1658
1658
where the elbow occurs, and whether adding a variable provides a meaningful increase in accuracy.
1659
1659
1660
-
```{r 06-fwdsel-3, echo = FALSE, warning = FALSE, fig.retina = 2, out.width = "65%", fig.cap = "Estimated accuracy versus the number of predictors for the sequence of models built using forward selection.", fig.pos = "H"}
1660
+
```{r 06-fwdsel-3, echo = FALSE, warning = FALSE, fig.retina = 2, out.width = "65%", fig.align = "center", fig.cap = "Estimated accuracy versus the number of predictors for the sequence of models built using forward selection.", fig.pos = "H"}
0 commit comments