Merge pull request #429 from UBC-DSCI/dev

leem44 · web-flow · commit 886c03e4bd25 · 2022-03-02T11:26:38.000-08:00
merging dev into master
diff --git a/classification2.Rmd b/classification2.Rmd
@@ -653,6 +653,11 @@ automatically. We set the `strata` argument to the categorical label variable
 (here, `Class`) to ensure that the training and validation subsets contain the
 right proportions of each category of observation.
 
+```{r 06-vfold-seed, echo = FALSE, warning = FALSE, message = FALSE}
+# hidden seed
+set.seed(14) 
+```
+
 ```{r 06-vfold}
 cancer_vfold <- vfold_cv(cancer_train, v = 5, strata = Class)
 cancer_vfold
@@ -689,9 +694,9 @@ of the classifier's validation accuracy across the folds. You will find results
 related to the accuracy in the row with `accuracy` listed under the `.metric` column. 
 You should consider the mean (`mean`) to be the estimated accuracy, while the standard 
 error (`std_err`) is a measure of how uncertain we are in the mean value. A detailed treatment of this
-is beyond the scope of this chapter; but roughly, if your estimated mean is 0.88 and standard
-error is 0.02, you can expect the *true* average accuracy of the 
-classifier to be somewhere roughly between 86% and 90% (although it may
+is beyond the scope of this chapter; but roughly, if your estimated mean is `r round(filter(collect_metrics(knn_fit), .metric == "accuracy")$mean,2)` and standard
+error is `r round(filter(collect_metrics(knn_fit), .metric == "accuracy")$std_err,2)`, you can expect the *true* average accuracy of the 
+classifier to be somewhere roughly between `r (round(filter(collect_metrics(knn_fit), .metric == "accuracy")$mean,2) - round(filter(collect_metrics(knn_fit), .metric == "accuracy")$std_err,2))*100`% and `r (round(filter(collect_metrics(knn_fit), .metric == "accuracy")$mean,2) + round(filter(collect_metrics(knn_fit), .metric == "accuracy")$std_err,2))*100`% (although it may
 fall outside this range). You may ignore the other columns in the metrics data frame,
 as they do not provide any additional insight.
 You can also ignore the entire second row with `roc_auc` in the `.metric` column,
@@ -707,11 +712,10 @@ accuracy estimate will be (lower standard error). However, we are limited
 by computational power: the
 more folds we choose, the  more computation it takes, and hence the more time
 it takes to run the analysis. So when you do cross-validation, you need to
-consider the size of the data, and the speed of the algorithm (e.g., $K$-nearest
-neighbor) and the speed of your computer. In practice, this is a 
-trial-and-error process, but typically $C$ is chosen to be either 5 or 10. Here we show
-how the standard error decreases when we use 10-fold cross-validation rather
-than 5-fold:
+consider the size of the data, the speed of the algorithm (e.g., $K$-nearest
+neighbors), and the speed of your computer. In practice, this is a 
+trial-and-error process, but typically $C$ is chosen to be either 5 or 10. Here 
+we will try 10-fold cross-validation to see if we get a lower standard error:
 
 ```{r 06-10-fold}
 cancer_vfold <- vfold_cv(cancer_train, v = 10, strata = Class)
@@ -724,7 +728,29 @@ vfold_metrics <- workflow() |>
 
 vfold_metrics
 ```
+In this case, using 10-fold instead of 5-fold cross validation did reduce the standard error, although
+by only an insignificant amount. In fact, due to the randomness in how the data are split, sometimes
+you might even end up with a *higher* standard error when increasing the number of folds!
+We can make the reduction in standard error more dramatic by increasing the number of folds 
+by a large amount. In the following code we show the result when $C = 50$; 
+picking such a large number of folds often takes a long time to run in practice, 
+so we usually stick to 5 or 10.
+
+```{r 06-50-fold-seed, echo = FALSE, warning = FALSE, message = FALSE}
+# hidden seed
+set.seed(1)
+```
 
+```{r 06-50-fold}
+cancer_vfold_50 <- vfold_cv(cancer_train, v = 50, strata = Class)
+
+vfold_metrics_50 <- workflow() |>
+                  add_recipe(cancer_recipe) |>
+                  add_model(knn_spec) |>
+                  fit_resamples(resamples = cancer_vfold_50) |>
+                  collect_metrics()
+vfold_metrics_50
+```
 
 ### Parameter value selection
 
@@ -863,6 +889,17 @@ regardless of what the new observation looks like. In general, if the model
 *isn't influenced enough* by the training data, it is said to **underfit** the
 data.
 
+**Overfitting:** \index{overfitting!classification} In contrast, when we decrease the number of neighbors, each
+individual data point has a stronger and stronger vote regarding nearby points.
+Since the data themselves are noisy, this causes a more "jagged" boundary
+corresponding to a *less simple* model.  If you take this case to the extreme,
+setting $K = 1$, then the classifier is essentially just matching each new
+observation to its closest neighbor in the training data set. This is just as
+problematic as the large $K$ case, because the classifier becomes unreliable on
+new data: if we had a different training set, the predictions would be
+completely different.  In general, if the model *is influenced too much* by the
+training data, it is said to **overfit** the data.
+
 ```{r 06-decision-grid-K, echo = FALSE, message = FALSE, fig.height = 10, fig.width = 10, fig.pos = "H", out.extra="", fig.cap = "Effect of K in overfitting and underfitting."}
 ks <- c(1, 7, 20, 300)
 plots <- list()
@@ -918,17 +955,6 @@ p_grid <- plot_grid(plotlist = p_no_legend, ncol = 2)
 plot_grid(p_grid, legend, ncol = 1, rel_heights = c(1, 0.2))
 ```
 
-**Overfitting:** \index{overfitting!classification} In contrast, when we decrease the number of neighbors, each
-individual data point has a stronger and stronger vote regarding nearby points.
-Since the data themselves are noisy, this causes a more "jagged" boundary
-corresponding to a *less simple* model.  If you take this case to the extreme,
-setting $K = 1$, then the classifier is essentially just matching each new
-observation to its closest neighbor in the training data set. This is just as
-problematic as the large $K$ case, because the classifier becomes unreliable on
-new data: if we had a different training set, the predictions would be
-completely different.  In general, if the model *is influenced too much* by the
-training data, it is said to **overfit** the data.
-
 Both overfitting and underfitting are problematic and will lead to a model 
 that does not generalize well to new data. When fitting a model, we need to strike
 a balance between the two. You can see these two effects in Figure 
@@ -1349,7 +1375,6 @@ for (i in 1:n_total) {
     selected <- c(selected, names[[jstar]])
     names <- names[-jstar]
 }
-
 accuracies
 ```
 
@@ -1369,11 +1394,8 @@ predictors from the model! It is always worth remembering, however, that what cr
 is an *estimate* of the true accuracy; you have to use your judgement when looking at this plot to decide
 where the elbow occurs, and whether adding a variable provides a meaningful increase in accuracy.
 
-> **Note:** Since the choice of which variables to include as predictors is
-> part of tuning your classifier, you *cannot use your test data* for this
-> process! 
+```{r 06-fwdsel-3, echo = FALSE, warning = FALSE, fig.retina = 2, out.width = "65%", fig.cap = "Estimated accuracy versus the number of predictors for the sequence of models built using forward selection.", fig.pos = "H"}
 
-```{r 06-fwdsel-3, echo = FALSE, warning = FALSE, fig.retina = 2, out.width = "65%", fig.cap = "Estimated accuracy versus the number of predictors for the sequence of models built using forward selection."}
 fwd_sel_accuracies_plot <- accuracies |>
   ggplot(aes(x = size, y = accuracy)) +
   geom_line() +
@@ -1383,6 +1405,10 @@ fwd_sel_accuracies_plot <- accuracies |>
 fwd_sel_accuracies_plot
 ```
 
+> **Note:** Since the choice of which variables to include as predictors is
+> part of tuning your classifier, you *cannot use your test data* for this
+> process! 
+
 ## Exercises
 
 Practice exercises for the material covered in this chapter