Skip to content

Commit 7f7db50

Browse files
centering all figs; new figs in wrangling (just names; not committing new img files yet)
1 parent a42ed8f commit 7f7db50

13 files changed

+201
-212
lines changed

source/classification1.Rmd

Lines changed: 15 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -243,7 +243,7 @@ we select our own colorblind-friendly colors—`"darkorange"`
243243
for orange and `"steelblue"` for blue—and
244244
pass them as the `values` argument to the `scale_color_manual` function.
245245

246-
```{r 05-scatter, fig.height = 3.5, fig.width = 4.5, fig.cap= "Scatter plot of concavity versus perimeter colored by diagnosis label."}
246+
```{r 05-scatter, fig.height = 3.5, fig.width = 4.5, fig.align = "center", fig.cap= "Scatter plot of concavity versus perimeter colored by diagnosis label."}
247247
perim_concav <- cancer |>
248248
ggplot(aes(x = Perimeter, y = Concavity, color = Class)) +
249249
geom_point(alpha = 0.6) +
@@ -320,7 +320,7 @@ new observation, with standardized perimeter of `r new_point[1]` and standardize
320320
diagnosis "Class" is unknown. This new observation is depicted by the red, diamond point in
321321
Figure \@ref(fig:05-knn-1).
322322

323-
```{r 05-knn-1, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap="Scatter plot of concavity versus perimeter with new observation represented as a red diamond."}
323+
```{r 05-knn-1, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.align = "center", fig.cap="Scatter plot of concavity versus perimeter with new observation represented as a red diamond."}
324324
perim_concav_with_new_point <- bind_rows(cancer,
325325
tibble(Perimeter = new_point[1],
326326
Concavity = new_point[2],
@@ -349,7 +349,7 @@ then the perimeter and concavity values are similar, and so we may expect that
349349
they would have the same diagnosis.
350350

351351

352-
```{r 05-knn-2, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap="Scatter plot of concavity versus perimeter. The new observation is represented as a red diamond with a line to the one nearest neighbor, which has a malignant label."}
352+
```{r 05-knn-2, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.align = "center", fig.cap="Scatter plot of concavity versus perimeter. The new observation is represented as a red diamond with a line to the one nearest neighbor, which has a malignant label."}
353353
perim_concav_with_new_point +
354354
geom_segment(aes(
355355
x = new_point[1],
@@ -374,7 +374,7 @@ Does this seem like the right prediction to make for this observation? Probably
374374
not, if you consider the other nearby points.
375375

376376

377-
```{r 05-knn-4, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap="Scatter plot of concavity versus perimeter. The new observation is represented as a red diamond with a line to the one nearest neighbor, which has a benign label."}
377+
```{r 05-knn-4, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.align = "center", fig.cap="Scatter plot of concavity versus perimeter. The new observation is represented as a red diamond with a line to the one nearest neighbor, which has a benign label."}
378378
379379
perim_concav_with_new_point2 <- bind_rows(cancer,
380380
tibble(Perimeter = new_point[1],
@@ -411,7 +411,7 @@ see that the diagnoses of 2 of the 3 nearest neighbors to our new observation
411411
are malignant. Therefore we take majority vote and classify our new red, diamond
412412
observation as malignant.
413413

414-
```{r 05-knn-5, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap="Scatter plot of concavity versus perimeter with three nearest neighbors."}
414+
```{r 05-knn-5, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.align = "center", fig.cap="Scatter plot of concavity versus perimeter with three nearest neighbors."}
415415
perim_concav_with_new_point2 +
416416
geom_segment(aes(
417417
x = new_point[1], y = new_point[2],
@@ -462,7 +462,7 @@ distance using the formula above: we square the differences between the two obse
462462
and concavity coordinates, add the squared differences, and then take the square root.
463463
In order to find the $K=5$ nearest neighbors, we will use the `slice_min` function. \index{slice\_min}
464464

465-
```{r 05-multiknn-1, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.pos = "H", out.extra="", fig.cap="Scatter plot of concavity versus perimeter with new observation represented as a red diamond."}
465+
```{r 05-multiknn-1, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.pos = "H", out.extra="", fig.align = "center", fig.cap="Scatter plot of concavity versus perimeter with new observation represented as a red diamond."}
466466
perim_concav <- bind_rows(cancer,
467467
tibble(Perimeter = new_point[1],
468468
Concavity = new_point[2],
@@ -540,7 +540,7 @@ The result of this computation shows that 3 of the 5 nearest neighbors to our ne
540540
malignant; since this is the majority, we classify our new observation as malignant.
541541
These 5 neighbors are circled in Figure \@ref(fig:05-multiknn-3).
542542

543-
```{r 05-multiknn-3, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap="Scatter plot of concavity versus perimeter with 5 nearest neighbors circled."}
543+
```{r 05-multiknn-3, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.align = "center", fig.cap="Scatter plot of concavity versus perimeter with 5 nearest neighbors circled."}
544544
perim_concav + annotate("path",
545545
x = new_point[1] + 1.4 * cos(seq(0, 2 * pi,
546546
length.out = 100
@@ -598,7 +598,7 @@ the new observation as malignant since 4 out of 5 of the nearest neighbors are f
598598
Figure \@ref(fig:05-more) shows what the data look like when we visualize them
599599
as a 3-dimensional scatter with lines from the new observation to its five nearest neighbors.
600600

601-
```{r 05-more, echo = FALSE, message = FALSE, fig.cap = "3D scatter plot of the standardized symmetry, concavity, and perimeter variables. Note that in general we recommend against using 3D visualizations; here we show the data in 3D only to illustrate what higher dimensions and nearest neighbors look like, for learning purposes.", fig.retina=2, out.width="100%"}
601+
```{r 05-more, echo = FALSE, message = FALSE, fig.align = "center", fig.cap = "3D scatter plot of the standardized symmetry, concavity, and perimeter variables. Note that in general we recommend against using 3D visualizations; here we show the data in 3D only to illustrate what higher dimensions and nearest neighbors look like, for learning purposes.", fig.retina=2, out.width="100%"}
602602
attrs <- c("Perimeter", "Concavity", "Symmetry")
603603
604604
# create new scaled obs and get NNs
@@ -945,7 +945,7 @@ Standardizing your data should be a part of the preprocessing you do
945945
before predictive modeling and you should always think carefully about your problem domain and
946946
whether you need to standardize your data.
947947

948-
```{r 05-scaling-plt, echo = FALSE, fig.height = 4, fig.cap = "Comparison of K = 3 nearest neighbors with standardized and unstandardized data."}
948+
```{r 05-scaling-plt, echo = FALSE, fig.height = 4, fig.align = "center", fig.cap = "Comparison of K = 3 nearest neighbors with standardized and unstandardized data."}
949949
950950
attrs <- c("Area", "Smoothness")
951951
@@ -1034,7 +1034,7 @@ ggarrange(unscaled, scaled, ncol = 2, common.legend = TRUE, legend = "bottom")
10341034
10351035
```
10361036

1037-
```{r 05-scaling-plt-zoomed, fig.height = 4.5, fig.width = 9, echo = FALSE, fig.cap = "Close-up of three nearest neighbors for unstandardized data."}
1037+
```{r 05-scaling-plt-zoomed, fig.height = 4.5, fig.width = 9, echo = FALSE, fig.align = "center", fig.cap = "Close-up of three nearest neighbors for unstandardized data."}
10381038
library(ggforce)
10391039
ggplot(unscaled_cancer, aes(x = Area,
10401040
y = Smoothness,
@@ -1102,7 +1102,7 @@ The new imbalanced data is shown in Figure \@ref(fig:05-unbalanced).
11021102
set.seed(3)
11031103
```
11041104

1105-
```{r 05-unbalanced, fig.height = 3.5, fig.width = 4.5, fig.pos = "H", out.extra="", fig.cap = "Imbalanced data."}
1105+
```{r 05-unbalanced, fig.height = 3.5, fig.width = 4.5, fig.pos = "H", out.extra="", fig.align = "center", fig.cap = "Imbalanced data."}
11061106
rare_cancer <- bind_rows(
11071107
filter(cancer, Class == "Benign"),
11081108
cancer |> filter(Class == "Malignant") |> slice_head(n = 3)
@@ -1130,7 +1130,7 @@ benign, and the benign vote will always win. For example, Figure \@ref(fig:05-up
11301130
shows what happens for a new tumor observation that is quite close to three observations
11311131
in the training data that were tagged as malignant.
11321132

1133-
```{r 05-upsample, echo=FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap = "Imbalanced data with 7 nearest neighbors to a new observation highlighted."}
1133+
```{r 05-upsample, echo=FALSE, fig.height = 3.5, fig.width = 4.5, fig.align = "center", fig.cap = "Imbalanced data with 7 nearest neighbors to a new observation highlighted."}
11341134
new_point <- c(2, 2)
11351135
attrs <- c("Perimeter", "Concavity")
11361136
my_distances <- table_with_distances(rare_cancer[, attrs], new_point)
@@ -1179,7 +1179,7 @@ each area of the plot to the predictions the K-nearest neighbors
11791179
classifier would make. We can see that the decision is
11801180
always "benign," corresponding to the blue color.
11811181

1182-
```{r 05-upsample-2, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap = "Imbalanced data with background color indicating the decision of the classifier and the points represent the labeled data."}
1182+
```{r 05-upsample-2, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.align = "center", fig.cap = "Imbalanced data with background color indicating the decision of the classifier and the points represent the labeled data."}
11831183
11841184
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 7) |>
11851185
set_engine("kknn") |>
@@ -1259,7 +1259,7 @@ classifier would make. We can see that the decision is more reasonable; when the
12591259
to those labeled malignant, the classifier predicts a malignant tumor, and vice versa when they are
12601260
closer to the benign tumor observations.
12611261

1262-
```{r 05-upsample-plot, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.pos = "H", out.extra="", fig.cap = "Upsampled data with background color indicating the decision of the classifier."}
1262+
```{r 05-upsample-plot, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.pos = "H", out.extra="", fig.align = "center", fig.cap = "Upsampled data with background color indicating the decision of the classifier."}
12631263
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 7) |>
12641264
set_engine("kknn") |>
12651265
set_mode("classification")
@@ -1455,7 +1455,7 @@ predict the label of each, and visualize the predictions with a colored scatter
14551455
> textbook. It is included for those readers who would like to use similar
14561456
> visualizations in their own data analyses.
14571457
1458-
```{r 05-workflow-plot-show, fig.height = 3.5, fig.width = 4.6, fig.cap = "Scatter plot of smoothness versus area where background color indicates the decision of the classifier."}
1458+
```{r 05-workflow-plot-show, fig.height = 3.5, fig.width = 4.6, fig.align = "center", fig.cap = "Scatter plot of smoothness versus area where background color indicates the decision of the classifier."}
14591459
# create the grid of area/smoothness vals, and arrange in a data frame
14601460
are_grid <- seq(min(unscaled_cancer$Area),
14611461
max(unscaled_cancer$Area),

source/classification2.Rmd

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -94,7 +94,7 @@ labels for new observations without known class labels.
9494
> is. Imagine how bad it would be to overestimate your classifier's accuracy
9595
> when predicting whether a patient's tumor is malignant or benign!
9696
97-
```{r 06-training-test, echo = FALSE, warning = FALSE, fig.cap = "Splitting the data into training and testing sets.", fig.retina = 2, out.width = "100%"}
97+
```{r 06-training-test, echo = FALSE, warning = FALSE, fig.align = "center", fig.cap = "Splitting the data into training and testing sets.", fig.retina = 2, out.width = "100%"}
9898
knitr::include_graphics("img/classification2/training_test.png")
9999
```
100100

@@ -108,7 +108,7 @@ test set is illustrated in Figure \@ref(fig:06-ML-paradigm-test).
108108

109109
$$\mathrm{accuracy} = \frac{\mathrm{number \; of \; correct \; predictions}}{\mathrm{total \; number \; of \; predictions}}$$
110110

111-
```{r 06-ML-paradigm-test, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "Process for splitting the data and finding the prediction accuracy.", fig.retina = 2, out.width = "100%"}
111+
```{r 06-ML-paradigm-test, echo = FALSE, message = FALSE, warning = FALSE, fig.align = "center", fig.cap = "Process for splitting the data and finding the prediction accuracy.", fig.retina = 2, out.width = "100%"}
112112
knitr::include_graphics("img/classification2/ML-paradigm-test.png")
113113
```
114114

@@ -322,7 +322,7 @@ tumor cell concavity versus smoothness colored by diagnosis in Figure \@ref(fig:
322322
You will also notice that we set the random seed here at the beginning of the analysis
323323
using the `set.seed` function, as described in Section \@ref(randomseeds).
324324

325-
```{r 06-precode, fig.height = 3.5, fig.width = 4.5, fig.cap="Scatter plot of tumor cell concavity versus smoothness colored by diagnosis label.", message = F, warning = F}
325+
```{r 06-precode, fig.height = 3.5, fig.width = 4.5, fig.align = "center", fig.cap="Scatter plot of tumor cell concavity versus smoothness colored by diagnosis label.", message = F, warning = F}
326326
# load packages
327327
library(tidyverse)
328328
library(tidymodels)
@@ -793,7 +793,7 @@ Here, $C=5$ different chunks of the data set are used,
793793
resulting in 5 different choices for the **validation set**; we call this
794794
*5-fold* cross-validation.
795795

796-
```{r 06-cv-image, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "5-fold cross-validation.", fig.pos = "H", out.extra="", fig.retina = 2, out.width = "100%"}
796+
```{r 06-cv-image, echo = FALSE, message = FALSE, warning = FALSE, fig.align = "center", fig.cap = "5-fold cross-validation.", fig.pos = "H", out.extra="", fig.retina = 2, out.width = "100%"}
797797
knitr::include_graphics("img/classification2/cv.png")
798798
```
799799

@@ -989,7 +989,7 @@ accuracies
989989
We can decide which number of neighbors is best by plotting the accuracy versus $K$,
990990
as shown in Figure \@ref(fig:06-find-k).
991991

992-
```{r 06-find-k, fig.height = 3.5, fig.width = 4, fig.cap= "Plot of estimated accuracy versus the number of neighbors."}
992+
```{r 06-find-k, fig.height = 3.5, fig.width = 4, fig.align = "center", fig.cap= "Plot of estimated accuracy versus the number of neighbors."}
993993
accuracy_vs_k <- ggplot(accuracies, aes(x = neighbors, y = mean)) +
994994
geom_point() +
995995
geom_line() +
@@ -1049,7 +1049,7 @@ we vary $K$ from 1 to almost the number of observations in the training set.
10491049
set.seed(1)
10501050
```
10511051

1052-
```{r 06-lots-of-ks, message = FALSE, fig.height = 3.5, fig.width = 4, fig.cap="Plot of accuracy estimate versus number of neighbors for many K values."}
1052+
```{r 06-lots-of-ks, message = FALSE, fig.height = 3.5, fig.width = 4, fig.align = "center", fig.cap="Plot of accuracy estimate versus number of neighbors for many K values."}
10531053
k_lots <- tibble(neighbors = seq(from = 1, to = 385, by = 10))
10541054
10551055
knn_results <- workflow() |>
@@ -1093,7 +1093,7 @@ new data: if we had a different training set, the predictions would be
10931093
completely different. In general, if the model *is influenced too much* by the
10941094
training data, it is said to **overfit** the data.
10951095

1096-
```{r 06-decision-grid-K, echo = FALSE, message = FALSE, fig.height = 10, fig.width = 10, fig.pos = "H", out.extra="", fig.cap = "Effect of K in overfitting and underfitting."}
1096+
```{r 06-decision-grid-K, echo = FALSE, message = FALSE, fig.height = 10, fig.width = 10, fig.pos = "H", out.extra="", fig.align = "center", fig.cap = "Effect of K in overfitting and underfitting."}
10971097
ks <- c(1, 7, 20, 300)
10981098
plots <- list()
10991099
@@ -1256,7 +1256,7 @@ by maximizing estimated accuracy via cross-validation. After we have tuned the
12561256
model we can use the test set to estimate its accuracy.
12571257
The overall process is summarized in Figure \@ref(fig:06-overview).
12581258

1259-
```{r 06-overview, echo = FALSE, message = FALSE, warning = FALSE, fig.pos = "H", out.extra="", fig.cap = "Overview of K-NN classification.", fig.retina = 2, out.width = "100%"}
1259+
```{r 06-overview, echo = FALSE, message = FALSE, warning = FALSE, fig.pos = "H", out.extra="", fig.align = "center", fig.cap = "Overview of K-NN classification.", fig.retina = 2, out.width = "100%"}
12601260
knitr::include_graphics("img/classification2/train-test-overview.png")
12611261
```
12621262

@@ -1344,7 +1344,7 @@ variables there are, the more (random) influence they have, and the more they
13441344
corrupt the set of nearest neighbors that vote on the class of the new
13451345
observation to predict.
13461346

1347-
```{r 06-performance-irrelevant-features, echo = FALSE, warning = FALSE, fig.retina = 2, out.width = "65%", fig.cap = "Effect of inclusion of irrelevant predictors."}
1347+
```{r 06-performance-irrelevant-features, echo = FALSE, warning = FALSE, fig.retina = 2, out.width = "65%", fig.align = "center", fig.cap = "Effect of inclusion of irrelevant predictors."}
13481348
# get accuracies after including k irrelevant features
13491349
ks <- c(0, 5, 10, 15, 20, 40)
13501350
fixedaccs <- list()
@@ -1418,7 +1418,7 @@ variables, the number of neighbors does not increase smoothly; but the general t
14181418
Figure \@ref(fig:06-fixed-irrelevant-features) corroborates
14191419
this evidence; if we fix the number of neighbors to $K=3$, the accuracy falls off more quickly.
14201420

1421-
```{r 06-neighbors-irrelevant-features, echo = FALSE, warning = FALSE, fig.retina = 2, out.width = "65%", fig.cap = "Tuned number of neighbors for varying number of irrelevant predictors."}
1421+
```{r 06-neighbors-irrelevant-features, echo = FALSE, warning = FALSE, fig.retina = 2, out.width = "65%", fig.align = "center", fig.cap = "Tuned number of neighbors for varying number of irrelevant predictors."}
14221422
plt_irrelevant_nghbrs <- ggplot(res) +
14231423
geom_line(mapping = aes(x=ks, y=nghbrs)) +
14241424
labs(x = "Number of Irrelevant Predictors",
@@ -1428,7 +1428,7 @@ plt_irrelevant_nghbrs <- ggplot(res) +
14281428
plt_irrelevant_nghbrs
14291429
```
14301430

1431-
```{r 06-fixed-irrelevant-features, echo = FALSE, warning = FALSE, fig.retina = 2, out.width = "75%", fig.cap = "Accuracy versus number of irrelevant predictors for tuned and untuned number of neighbors."}
1431+
```{r 06-fixed-irrelevant-features, echo = FALSE, warning = FALSE, fig.retina = 2, out.width = "75%", fig.align = "center", fig.cap = "Accuracy versus number of irrelevant predictors for tuned and untuned number of neighbors."}
14321432
res_tmp <- res %>% pivot_longer(cols=c("accs", "fixedaccs"),
14331433
names_to="Type",
14341434
values_to="accuracy")
@@ -1657,7 +1657,7 @@ predictors from the model! It is always worth remembering, however, that what cr
16571657
is an *estimate* of the true accuracy; you have to use your judgement when looking at this plot to decide
16581658
where the elbow occurs, and whether adding a variable provides a meaningful increase in accuracy.
16591659

1660-
```{r 06-fwdsel-3, echo = FALSE, warning = FALSE, fig.retina = 2, out.width = "65%", fig.cap = "Estimated accuracy versus the number of predictors for the sequence of models built using forward selection.", fig.pos = "H"}
1660+
```{r 06-fwdsel-3, echo = FALSE, warning = FALSE, fig.retina = 2, out.width = "65%", fig.align = "center", fig.cap = "Estimated accuracy versus the number of predictors for the sequence of models built using forward selection.", fig.pos = "H"}
16611661
16621662
fwd_sel_accuracies_plot <- accuracies |>
16631663
ggplot(aes(x = size, y = accuracy)) +

0 commit comments

Comments
 (0)