UBC-DSCI
diff --git a/‎source/classification1.Rmd
Lines changed: 15 additions & 15 deletions b/‎source/classification1.Rmd
Lines changed: 15 additions & 15 deletions
diff --git a/‎source/classification2.Rmd
Lines changed: 12 additions & 12 deletions b/‎source/classification2.Rmd
Lines changed: 12 additions & 12 deletions
@@ -243,7 +243,7 @@ we select our own colorblind-friendly colors&mdash;`"darkorange"`
 for orange and `"steelblue"` for blue&mdash;and
  pass them as the `values` argument to the `scale_color_manual` function.
 
-```{r 05-scatter, fig.height = 3.5, fig.width = 4.5, fig.cap= "Scatter plot of concavity versus perimeter colored by diagnosis label."}
+```{r 05-scatter, fig.height = 3.5, fig.width = 4.5, fig.align = "center", fig.cap= "Scatter plot of concavity versus perimeter colored by diagnosis label."}
 perim_concav <- cancer |>
   ggplot(aes(x = Perimeter, y = Concavity, color = Class)) +
   geom_point(alpha = 0.6) +
@@ -320,7 +320,7 @@ new observation, with standardized perimeter of `r new_point[1]` and standardize
 diagnosis "Class" is unknown. This new observation is depicted by the red, diamond point in
 Figure \@ref(fig:05-knn-1).
 
-```{r 05-knn-1, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap="Scatter plot of concavity versus perimeter with new observation represented as a red diamond."}
+```{r 05-knn-1, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.align = "center", fig.cap="Scatter plot of concavity versus perimeter with new observation represented as a red diamond."}
 perim_concav_with_new_point <-  bind_rows(cancer,
                                           tibble(Perimeter = new_point[1],
                                                  Concavity = new_point[2],
@@ -349,7 +349,7 @@ then the perimeter and concavity values are similar, and so we may expect that
 they would have the same diagnosis.
 
 
-```{r 05-knn-2, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap="Scatter plot of concavity versus perimeter. The new observation is represented as a red diamond with a line to the one nearest neighbor, which has a malignant label."}
+```{r 05-knn-2, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.align = "center", fig.cap="Scatter plot of concavity versus perimeter. The new observation is represented as a red diamond with a line to the one nearest neighbor, which has a malignant label."}
 perim_concav_with_new_point +
   geom_segment(aes(
     x = new_point[1],
@@ -374,7 +374,7 @@ Does this seem like the right prediction to make for this observation? Probably
 not, if you consider the other nearby points.
 
 
-```{r 05-knn-4, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap="Scatter plot of concavity versus perimeter. The new observation is represented as a red diamond with a line to the one nearest neighbor, which has a benign label."}
+```{r 05-knn-4, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.align = "center", fig.cap="Scatter plot of concavity versus perimeter. The new observation is represented as a red diamond with a line to the one nearest neighbor, which has a benign label."}
 
 perim_concav_with_new_point2 <- bind_rows(cancer,
                                           tibble(Perimeter = new_point[1],
@@ -411,7 +411,7 @@ see that the diagnoses of 2 of the 3 nearest neighbors to our new observation
 are malignant. Therefore we take majority vote and classify our new red, diamond
 observation as malignant.
 
-```{r 05-knn-5, echo =  FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap="Scatter plot of concavity versus perimeter with three nearest neighbors."}
+```{r 05-knn-5, echo =  FALSE, fig.height = 3.5, fig.width = 4.5, fig.align = "center", fig.cap="Scatter plot of concavity versus perimeter with three nearest neighbors."}
 perim_concav_with_new_point2 +
   geom_segment(aes(
     x = new_point[1], y = new_point[2],
@@ -462,7 +462,7 @@ distance using the formula above: we square the differences between the two obse
 and concavity coordinates, add the squared differences, and then take the square root.
 In order to find the $K=5$ nearest neighbors, we will use the `slice_min` function. \index{slice\_min}
 
-```{r 05-multiknn-1, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.pos = "H", out.extra="", fig.cap="Scatter plot of concavity versus perimeter with new observation represented as a red diamond."}
+```{r 05-multiknn-1, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.pos = "H", out.extra="", fig.align = "center", fig.cap="Scatter plot of concavity versus perimeter with new observation represented as a red diamond."}
 perim_concav <- bind_rows(cancer,
                           tibble(Perimeter = new_point[1],
                                  Concavity = new_point[2],
@@ -540,7 +540,7 @@ The result of this computation shows that 3 of the 5 nearest neighbors to our ne
 malignant; since this is the majority, we classify our new observation as malignant.
 These 5 neighbors are circled in Figure \@ref(fig:05-multiknn-3).
 
-```{r 05-multiknn-3, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap="Scatter plot of concavity versus perimeter with 5 nearest neighbors circled."}
+```{r 05-multiknn-3, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.align = "center", fig.cap="Scatter plot of concavity versus perimeter with 5 nearest neighbors circled."}
 perim_concav + annotate("path",
   x = new_point[1] + 1.4 * cos(seq(0, 2 * pi,
     length.out = 100
@@ -598,7 +598,7 @@ the new observation as malignant since 4 out of 5 of the nearest neighbors are f
 Figure \@ref(fig:05-more) shows what the data look like when we visualize them
 as a 3-dimensional scatter with lines from the new observation to its five nearest neighbors.
 
-```{r 05-more, echo = FALSE, message = FALSE, fig.cap = "3D scatter plot of the standardized symmetry, concavity, and perimeter variables. Note that in general we recommend against using 3D visualizations; here we show the data in 3D only to illustrate what higher dimensions and nearest neighbors look like, for learning purposes.", fig.retina=2, out.width="100%"}
+```{r 05-more, echo = FALSE, message = FALSE, fig.align = "center", fig.cap = "3D scatter plot of the standardized symmetry, concavity, and perimeter variables. Note that in general we recommend against using 3D visualizations; here we show the data in 3D only to illustrate what higher dimensions and nearest neighbors look like, for learning purposes.", fig.retina=2, out.width="100%"}
 attrs <- c("Perimeter", "Concavity", "Symmetry")
 
 # create new scaled obs and get NNs
@@ -945,7 +945,7 @@ Standardizing your data should be a part of the preprocessing you do
 before predictive modeling and you should always think carefully about your problem domain and
 whether you need to standardize your data.
 
-```{r 05-scaling-plt, echo = FALSE, fig.height = 4, fig.cap = "Comparison of K = 3 nearest neighbors with standardized and unstandardized data."}
+```{r 05-scaling-plt, echo = FALSE, fig.height = 4, fig.align = "center", fig.cap = "Comparison of K = 3 nearest neighbors with standardized and unstandardized data."}
 
 attrs <- c("Area", "Smoothness")
 
@@ -1034,7 +1034,7 @@ ggarrange(unscaled, scaled, ncol = 2, common.legend = TRUE, legend = "bottom")
 
 ```
 
-```{r 05-scaling-plt-zoomed, fig.height = 4.5, fig.width = 9, echo = FALSE, fig.cap = "Close-up of three nearest neighbors for unstandardized data."}
+```{r 05-scaling-plt-zoomed, fig.height = 4.5, fig.width = 9, echo = FALSE, fig.align = "center", fig.cap = "Close-up of three nearest neighbors for unstandardized data."}
 library(ggforce)
 ggplot(unscaled_cancer, aes(x = Area,
                             y = Smoothness,
@@ -1102,7 +1102,7 @@ The new imbalanced data is shown in Figure \@ref(fig:05-unbalanced).
 set.seed(3)
 ```
 
-```{r 05-unbalanced, fig.height = 3.5, fig.width = 4.5, fig.pos = "H", out.extra="", fig.cap = "Imbalanced data."}
+```{r 05-unbalanced, fig.height = 3.5, fig.width = 4.5, fig.pos = "H", out.extra="", fig.align = "center", fig.cap = "Imbalanced data."}
 rare_cancer <- bind_rows(
       filter(cancer, Class == "Benign"),
       cancer |> filter(Class == "Malignant") |> slice_head(n = 3)
@@ -1130,7 +1130,7 @@ benign, and the benign vote will always win. For example, Figure \@ref(fig:05-up
 shows what happens for a new tumor observation that is quite close to three observations
 in the training data that were tagged as malignant.
 
-```{r 05-upsample, echo=FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap = "Imbalanced data with 7 nearest neighbors to a new observation highlighted."}
+```{r 05-upsample, echo=FALSE, fig.height = 3.5, fig.width = 4.5, fig.align = "center", fig.cap = "Imbalanced data with 7 nearest neighbors to a new observation highlighted."}
 new_point <- c(2, 2)
 attrs <- c("Perimeter", "Concavity")
 my_distances <- table_with_distances(rare_cancer[, attrs], new_point)
@@ -1179,7 +1179,7 @@ each area of the plot to the predictions the K-nearest neighbors
 classifier would make. We can see that the decision is
 always "benign," corresponding to the blue color.
 
-```{r 05-upsample-2, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap = "Imbalanced data with background color indicating the decision of the classifier and the points represent the labeled data."}
+```{r 05-upsample-2, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.align = "center", fig.cap = "Imbalanced data with background color indicating the decision of the classifier and the points represent the labeled data."}
 
 knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 7) |>
   set_engine("kknn") |>
@@ -1259,7 +1259,7 @@ classifier would make. We can see that the decision is more reasonable; when the
 to those labeled malignant, the classifier predicts a malignant tumor, and vice versa when they are
 closer to the benign tumor observations.
 
-```{r 05-upsample-plot, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.pos = "H", out.extra="", fig.cap = "Upsampled data with background color indicating the decision of the classifier."}
+```{r 05-upsample-plot, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.pos = "H", out.extra="", fig.align = "center", fig.cap = "Upsampled data with background color indicating the decision of the classifier."}
 knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 7) |>
   set_engine("kknn") |>
   set_mode("classification")
@@ -1455,7 +1455,7 @@ predict the label of each, and visualize the predictions with a colored scatter
 > textbook. It is included for those readers who would like to use similar
 > visualizations in their own data analyses.
 
-```{r 05-workflow-plot-show, fig.height = 3.5, fig.width = 4.6, fig.cap = "Scatter plot of smoothness versus area where background color indicates the decision of the classifier."}
+```{r 05-workflow-plot-show, fig.height = 3.5, fig.width = 4.6, fig.align = "center", fig.cap = "Scatter plot of smoothness versus area where background color indicates the decision of the classifier."}
 # create the grid of area/smoothness vals, and arrange in a data frame
 are_grid <- seq(min(unscaled_cancer$Area),
                 max(unscaled_cancer$Area),
 
@@ -94,7 +94,7 @@ labels for new observations without known class labels.
 > is. Imagine how bad it would be to overestimate your classifier's accuracy
 > when predicting whether a patient's tumor is malignant or benign!
 
-```{r 06-training-test, echo = FALSE, warning = FALSE, fig.cap = "Splitting the data into training and testing sets.", fig.retina = 2, out.width = "100%"}
+```{r 06-training-test, echo = FALSE, warning = FALSE, fig.align = "center", fig.cap = "Splitting the data into training and testing sets.", fig.retina = 2, out.width = "100%"}
 knitr::include_graphics("img/classification2/training_test.png")
 ```
 
@@ -108,7 +108,7 @@ test set is illustrated in Figure \@ref(fig:06-ML-paradigm-test).
 
 $$\mathrm{accuracy} = \frac{\mathrm{number \; of  \; correct  \; predictions}}{\mathrm{total \;  number \;  of  \; predictions}}$$
 
-```{r 06-ML-paradigm-test, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "Process for splitting the data and finding the prediction accuracy.", fig.retina = 2, out.width = "100%"}
+```{r 06-ML-paradigm-test, echo = FALSE, message = FALSE, warning = FALSE, fig.align = "center", fig.cap = "Process for splitting the data and finding the prediction accuracy.", fig.retina = 2, out.width = "100%"}
 knitr::include_graphics("img/classification2/ML-paradigm-test.png")
 ```
 
@@ -322,7 +322,7 @@ tumor cell concavity versus smoothness colored by diagnosis in Figure \@ref(fig:
 You will also notice that we set the random seed here at the beginning of the analysis
 using the `set.seed` function, as described in Section \@ref(randomseeds).
 
-```{r 06-precode, fig.height = 3.5, fig.width = 4.5, fig.cap="Scatter plot of tumor cell concavity versus smoothness colored by diagnosis label.", message = F, warning = F}
+```{r 06-precode, fig.height = 3.5, fig.width = 4.5, fig.align = "center", fig.cap="Scatter plot of tumor cell concavity versus smoothness colored by diagnosis label.", message = F, warning = F}
 # load packages
 library(tidyverse)
 library(tidymodels)
@@ -793,7 +793,7 @@ Here, $C=5$ different chunks of the data set are used,
 resulting in 5 different choices for the **validation set**; we call this
 *5-fold* cross-validation.
 
-```{r 06-cv-image, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "5-fold cross-validation.", fig.pos = "H", out.extra="", fig.retina = 2, out.width = "100%"}
+```{r 06-cv-image, echo = FALSE, message = FALSE, warning = FALSE, fig.align = "center", fig.cap = "5-fold cross-validation.", fig.pos = "H", out.extra="", fig.retina = 2, out.width = "100%"}
 knitr::include_graphics("img/classification2/cv.png")
 ```
 
@@ -989,7 +989,7 @@ accuracies
 We can decide which number of neighbors is best by plotting the accuracy versus $K$,
 as shown in Figure \@ref(fig:06-find-k).
 
-```{r 06-find-k,  fig.height = 3.5, fig.width = 4, fig.cap= "Plot of estimated accuracy versus the number of neighbors."}
+```{r 06-find-k,  fig.height = 3.5, fig.width = 4, fig.align = "center", fig.cap= "Plot of estimated accuracy versus the number of neighbors."}
 accuracy_vs_k <- ggplot(accuracies, aes(x = neighbors, y = mean)) +
   geom_point() +
   geom_line() +
@@ -1049,7 +1049,7 @@ we vary $K$ from 1 to almost the number of observations in the training set.
 set.seed(1)
 ```
 
-```{r 06-lots-of-ks, message = FALSE, fig.height = 3.5, fig.width = 4, fig.cap="Plot of accuracy estimate versus number of neighbors for many K values."}
+```{r 06-lots-of-ks, message = FALSE, fig.height = 3.5, fig.width = 4, fig.align = "center", fig.cap="Plot of accuracy estimate versus number of neighbors for many K values."}
 k_lots <- tibble(neighbors = seq(from = 1, to = 385, by = 10))
 
 knn_results <- workflow() |>
@@ -1093,7 +1093,7 @@ new data: if we had a different training set, the predictions would be
 completely different.  In general, if the model *is influenced too much* by the
 training data, it is said to **overfit** the data.
 
-```{r 06-decision-grid-K, echo = FALSE, message = FALSE, fig.height = 10, fig.width = 10, fig.pos = "H", out.extra="", fig.cap = "Effect of K in overfitting and underfitting."}
+```{r 06-decision-grid-K, echo = FALSE, message = FALSE, fig.height = 10, fig.width = 10, fig.pos = "H", out.extra="", fig.align = "center", fig.cap = "Effect of K in overfitting and underfitting."}
 ks <- c(1, 7, 20, 300)
 plots <- list()
 
@@ -1256,7 +1256,7 @@ by maximizing estimated accuracy via cross-validation. After we have tuned the
 model we can use the test set to estimate its accuracy.
 The overall process is summarized in Figure \@ref(fig:06-overview).
 
-```{r 06-overview, echo = FALSE, message = FALSE, warning = FALSE, fig.pos = "H", out.extra="", fig.cap = "Overview of K-NN classification.", fig.retina = 2, out.width = "100%"}
+```{r 06-overview, echo = FALSE, message = FALSE, warning = FALSE, fig.pos = "H", out.extra="", fig.align = "center", fig.cap = "Overview of K-NN classification.", fig.retina = 2, out.width = "100%"}
 knitr::include_graphics("img/classification2/train-test-overview.png")
 ```
 
@@ -1344,7 +1344,7 @@ variables there are, the more (random) influence they have, and the more they
 corrupt the set of nearest neighbors that vote on the class of the new
 observation to predict.
 
-```{r 06-performance-irrelevant-features, echo = FALSE, warning = FALSE, fig.retina = 2, out.width = "65%", fig.cap = "Effect of inclusion of irrelevant predictors."}
+```{r 06-performance-irrelevant-features, echo = FALSE, warning = FALSE, fig.retina = 2, out.width = "65%", fig.align = "center", fig.cap = "Effect of inclusion of irrelevant predictors."}
 # get accuracies after including k irrelevant features
 ks <- c(0, 5, 10, 15, 20, 40)
 fixedaccs <- list()
@@ -1418,7 +1418,7 @@ variables, the number of neighbors does not increase smoothly; but the general t
 Figure \@ref(fig:06-fixed-irrelevant-features) corroborates
 this evidence; if we fix the number of neighbors to $K=3$, the accuracy falls off more quickly.
 
-```{r 06-neighbors-irrelevant-features, echo = FALSE, warning = FALSE, fig.retina = 2, out.width = "65%", fig.cap = "Tuned number of neighbors for varying number of irrelevant predictors."}
+```{r 06-neighbors-irrelevant-features, echo = FALSE, warning = FALSE, fig.retina = 2, out.width = "65%", fig.align = "center", fig.cap = "Tuned number of neighbors for varying number of irrelevant predictors."}
 plt_irrelevant_nghbrs <- ggplot(res) +
               geom_line(mapping = aes(x=ks, y=nghbrs)) +
               labs(x = "Number of Irrelevant Predictors",
@@ -1428,7 +1428,7 @@ plt_irrelevant_nghbrs <- ggplot(res) +
 plt_irrelevant_nghbrs
 ```
 
-```{r 06-fixed-irrelevant-features, echo = FALSE, warning = FALSE, fig.retina = 2, out.width = "75%", fig.cap = "Accuracy versus number of irrelevant predictors for tuned and untuned number of neighbors."}
+```{r 06-fixed-irrelevant-features, echo = FALSE, warning = FALSE, fig.retina = 2, out.width = "75%", fig.align = "center", fig.cap = "Accuracy versus number of irrelevant predictors for tuned and untuned number of neighbors."}
 res_tmp <- res %>% pivot_longer(cols=c("accs", "fixedaccs"),
                                 names_to="Type",
                                 values_to="accuracy")
@@ -1657,7 +1657,7 @@ predictors from the model! It is always worth remembering, however, that what cr
 is an *estimate* of the true accuracy; you have to use your judgement when looking at this plot to decide
 where the elbow occurs, and whether adding a variable provides a meaningful increase in accuracy.
 
-```{r 06-fwdsel-3, echo = FALSE, warning = FALSE, fig.retina = 2, out.width = "65%", fig.cap = "Estimated accuracy versus the number of predictors for the sequence of models built using forward selection.", fig.pos = "H"}
+```{r 06-fwdsel-3, echo = FALSE, warning = FALSE, fig.retina = 2, out.width = "65%", fig.align = "center", fig.cap = "Estimated accuracy versus the number of predictors for the sequence of models built using forward selection.", fig.pos = "H"}
 
 fwd_sel_accuracies_plot <- accuracies |>
   ggplot(aes(x = size, y = accuracy)) +