Skip to content

Commit 886c03e

Browse files
authored
Merge pull request #429 from UBC-DSCI/dev
merging dev into master
2 parents fb1dc8e + 23b8755 commit 886c03e

File tree

1 file changed

+50
-24
lines changed

1 file changed

+50
-24
lines changed

classification2.Rmd

Lines changed: 50 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -653,6 +653,11 @@ automatically. We set the `strata` argument to the categorical label variable
653653
(here, `Class`) to ensure that the training and validation subsets contain the
654654
right proportions of each category of observation.
655655

656+
```{r 06-vfold-seed, echo = FALSE, warning = FALSE, message = FALSE}
657+
# hidden seed
658+
set.seed(14)
659+
```
660+
656661
```{r 06-vfold}
657662
cancer_vfold <- vfold_cv(cancer_train, v = 5, strata = Class)
658663
cancer_vfold
@@ -689,9 +694,9 @@ of the classifier's validation accuracy across the folds. You will find results
689694
related to the accuracy in the row with `accuracy` listed under the `.metric` column.
690695
You should consider the mean (`mean`) to be the estimated accuracy, while the standard
691696
error (`std_err`) is a measure of how uncertain we are in the mean value. A detailed treatment of this
692-
is beyond the scope of this chapter; but roughly, if your estimated mean is 0.88 and standard
693-
error is 0.02, you can expect the *true* average accuracy of the
694-
classifier to be somewhere roughly between 86% and 90% (although it may
697+
is beyond the scope of this chapter; but roughly, if your estimated mean is `r round(filter(collect_metrics(knn_fit), .metric == "accuracy")$mean,2)` and standard
698+
error is `r round(filter(collect_metrics(knn_fit), .metric == "accuracy")$std_err,2)`, you can expect the *true* average accuracy of the
699+
classifier to be somewhere roughly between `r (round(filter(collect_metrics(knn_fit), .metric == "accuracy")$mean,2) - round(filter(collect_metrics(knn_fit), .metric == "accuracy")$std_err,2))*100`% and `r (round(filter(collect_metrics(knn_fit), .metric == "accuracy")$mean,2) + round(filter(collect_metrics(knn_fit), .metric == "accuracy")$std_err,2))*100`% (although it may
695700
fall outside this range). You may ignore the other columns in the metrics data frame,
696701
as they do not provide any additional insight.
697702
You can also ignore the entire second row with `roc_auc` in the `.metric` column,
@@ -707,11 +712,10 @@ accuracy estimate will be (lower standard error). However, we are limited
707712
by computational power: the
708713
more folds we choose, the more computation it takes, and hence the more time
709714
it takes to run the analysis. So when you do cross-validation, you need to
710-
consider the size of the data, and the speed of the algorithm (e.g., $K$-nearest
711-
neighbor) and the speed of your computer. In practice, this is a
712-
trial-and-error process, but typically $C$ is chosen to be either 5 or 10. Here we show
713-
how the standard error decreases when we use 10-fold cross-validation rather
714-
than 5-fold:
715+
consider the size of the data, the speed of the algorithm (e.g., $K$-nearest
716+
neighbors), and the speed of your computer. In practice, this is a
717+
trial-and-error process, but typically $C$ is chosen to be either 5 or 10. Here
718+
we will try 10-fold cross-validation to see if we get a lower standard error:
715719

716720
```{r 06-10-fold}
717721
cancer_vfold <- vfold_cv(cancer_train, v = 10, strata = Class)
@@ -724,7 +728,29 @@ vfold_metrics <- workflow() |>
724728
725729
vfold_metrics
726730
```
731+
In this case, using 10-fold instead of 5-fold cross validation did reduce the standard error, although
732+
by only an insignificant amount. In fact, due to the randomness in how the data are split, sometimes
733+
you might even end up with a *higher* standard error when increasing the number of folds!
734+
We can make the reduction in standard error more dramatic by increasing the number of folds
735+
by a large amount. In the following code we show the result when $C = 50$;
736+
picking such a large number of folds often takes a long time to run in practice,
737+
so we usually stick to 5 or 10.
738+
739+
```{r 06-50-fold-seed, echo = FALSE, warning = FALSE, message = FALSE}
740+
# hidden seed
741+
set.seed(1)
742+
```
727743

744+
```{r 06-50-fold}
745+
cancer_vfold_50 <- vfold_cv(cancer_train, v = 50, strata = Class)
746+
747+
vfold_metrics_50 <- workflow() |>
748+
add_recipe(cancer_recipe) |>
749+
add_model(knn_spec) |>
750+
fit_resamples(resamples = cancer_vfold_50) |>
751+
collect_metrics()
752+
vfold_metrics_50
753+
```
728754

729755
### Parameter value selection
730756

@@ -863,6 +889,17 @@ regardless of what the new observation looks like. In general, if the model
863889
*isn't influenced enough* by the training data, it is said to **underfit** the
864890
data.
865891

892+
**Overfitting:** \index{overfitting!classification} In contrast, when we decrease the number of neighbors, each
893+
individual data point has a stronger and stronger vote regarding nearby points.
894+
Since the data themselves are noisy, this causes a more "jagged" boundary
895+
corresponding to a *less simple* model. If you take this case to the extreme,
896+
setting $K = 1$, then the classifier is essentially just matching each new
897+
observation to its closest neighbor in the training data set. This is just as
898+
problematic as the large $K$ case, because the classifier becomes unreliable on
899+
new data: if we had a different training set, the predictions would be
900+
completely different. In general, if the model *is influenced too much* by the
901+
training data, it is said to **overfit** the data.
902+
866903
```{r 06-decision-grid-K, echo = FALSE, message = FALSE, fig.height = 10, fig.width = 10, fig.pos = "H", out.extra="", fig.cap = "Effect of K in overfitting and underfitting."}
867904
ks <- c(1, 7, 20, 300)
868905
plots <- list()
@@ -918,17 +955,6 @@ p_grid <- plot_grid(plotlist = p_no_legend, ncol = 2)
918955
plot_grid(p_grid, legend, ncol = 1, rel_heights = c(1, 0.2))
919956
```
920957

921-
**Overfitting:** \index{overfitting!classification} In contrast, when we decrease the number of neighbors, each
922-
individual data point has a stronger and stronger vote regarding nearby points.
923-
Since the data themselves are noisy, this causes a more "jagged" boundary
924-
corresponding to a *less simple* model. If you take this case to the extreme,
925-
setting $K = 1$, then the classifier is essentially just matching each new
926-
observation to its closest neighbor in the training data set. This is just as
927-
problematic as the large $K$ case, because the classifier becomes unreliable on
928-
new data: if we had a different training set, the predictions would be
929-
completely different. In general, if the model *is influenced too much* by the
930-
training data, it is said to **overfit** the data.
931-
932958
Both overfitting and underfitting are problematic and will lead to a model
933959
that does not generalize well to new data. When fitting a model, we need to strike
934960
a balance between the two. You can see these two effects in Figure
@@ -1349,7 +1375,6 @@ for (i in 1:n_total) {
13491375
selected <- c(selected, names[[jstar]])
13501376
names <- names[-jstar]
13511377
}
1352-
13531378
accuracies
13541379
```
13551380

@@ -1369,11 +1394,8 @@ predictors from the model! It is always worth remembering, however, that what cr
13691394
is an *estimate* of the true accuracy; you have to use your judgement when looking at this plot to decide
13701395
where the elbow occurs, and whether adding a variable provides a meaningful increase in accuracy.
13711396

1372-
> **Note:** Since the choice of which variables to include as predictors is
1373-
> part of tuning your classifier, you *cannot use your test data* for this
1374-
> process!
1397+
```{r 06-fwdsel-3, echo = FALSE, warning = FALSE, fig.retina = 2, out.width = "65%", fig.cap = "Estimated accuracy versus the number of predictors for the sequence of models built using forward selection.", fig.pos = "H"}
13751398
1376-
```{r 06-fwdsel-3, echo = FALSE, warning = FALSE, fig.retina = 2, out.width = "65%", fig.cap = "Estimated accuracy versus the number of predictors for the sequence of models built using forward selection."}
13771399
fwd_sel_accuracies_plot <- accuracies |>
13781400
ggplot(aes(x = size, y = accuracy)) +
13791401
geom_line() +
@@ -1383,6 +1405,10 @@ fwd_sel_accuracies_plot <- accuracies |>
13831405
fwd_sel_accuracies_plot
13841406
```
13851407

1408+
> **Note:** Since the choice of which variables to include as predictors is
1409+
> part of tuning your classifier, you *cannot use your test data* for this
1410+
> process!
1411+
13861412
## Exercises
13871413

13881414
Practice exercises for the material covered in this chapter

0 commit comments

Comments
 (0)