You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
cancer_vfold <- vfold_cv(cancer_train, v = 5, strata = Class)
658
663
cancer_vfold
@@ -689,9 +694,9 @@ of the classifier's validation accuracy across the folds. You will find results
689
694
related to the accuracy in the row with `accuracy` listed under the `.metric` column.
690
695
You should consider the mean (`mean`) to be the estimated accuracy, while the standard
691
696
error (`std_err`) is a measure of how uncertain we are in the mean value. A detailed treatment of this
692
-
is beyond the scope of this chapter; but roughly, if your estimated mean is 0.88 and standard
693
-
error is 0.02, you can expect the *true* average accuracy of the
694
-
classifier to be somewhere roughly between 86% and 90% (although it may
697
+
is beyond the scope of this chapter; but roughly, if your estimated mean is `r round(filter(collect_metrics(knn_fit), .metric == "accuracy")$mean,2)` and standard
698
+
error is `r round(filter(collect_metrics(knn_fit), .metric == "accuracy")$std_err,2)`, you can expect the *true* average accuracy of the
699
+
classifier to be somewhere roughly between `r (round(filter(collect_metrics(knn_fit), .metric == "accuracy")$mean,2) - round(filter(collect_metrics(knn_fit), .metric == "accuracy")$std_err,2))*100`% and `r (round(filter(collect_metrics(knn_fit), .metric == "accuracy")$mean,2) + round(filter(collect_metrics(knn_fit), .metric == "accuracy")$std_err,2))*100`% (although it may
695
700
fall outside this range). You may ignore the other columns in the metrics data frame,
696
701
as they do not provide any additional insight.
697
702
You can also ignore the entire second row with `roc_auc` in the `.metric` column,
@@ -707,11 +712,10 @@ accuracy estimate will be (lower standard error). However, we are limited
707
712
by computational power: the
708
713
more folds we choose, the more computation it takes, and hence the more time
709
714
it takes to run the analysis. So when you do cross-validation, you need to
710
-
consider the size of the data, and the speed of the algorithm (e.g., $K$-nearest
711
-
neighbor) and the speed of your computer. In practice, this is a
712
-
trial-and-error process, but typically $C$ is chosen to be either 5 or 10. Here we show
713
-
how the standard error decreases when we use 10-fold cross-validation rather
714
-
than 5-fold:
715
+
consider the size of the data, the speed of the algorithm (e.g., $K$-nearest
716
+
neighbors), and the speed of your computer. In practice, this is a
717
+
trial-and-error process, but typically $C$ is chosen to be either 5 or 10. Here
718
+
we will try 10-fold cross-validation to see if we get a lower standard error:
715
719
716
720
```{r 06-10-fold}
717
721
cancer_vfold <- vfold_cv(cancer_train, v = 10, strata = Class)
**Overfitting:** \index{overfitting!classification} In contrast, when we decrease the number of neighbors, each
922
-
individual data point has a stronger and stronger vote regarding nearby points.
923
-
Since the data themselves are noisy, this causes a more "jagged" boundary
924
-
corresponding to a *less simple* model. If you take this case to the extreme,
925
-
setting $K = 1$, then the classifier is essentially just matching each new
926
-
observation to its closest neighbor in the training data set. This is just as
927
-
problematic as the large $K$ case, because the classifier becomes unreliable on
928
-
new data: if we had a different training set, the predictions would be
929
-
completely different. In general, if the model *is influenced too much* by the
930
-
training data, it is said to **overfit** the data.
931
-
932
958
Both overfitting and underfitting are problematic and will lead to a model
933
959
that does not generalize well to new data. When fitting a model, we need to strike
934
960
a balance between the two. You can see these two effects in Figure
@@ -1349,7 +1375,6 @@ for (i in 1:n_total) {
1349
1375
selected <- c(selected, names[[jstar]])
1350
1376
names <- names[-jstar]
1351
1377
}
1352
-
1353
1378
accuracies
1354
1379
```
1355
1380
@@ -1369,11 +1394,8 @@ predictors from the model! It is always worth remembering, however, that what cr
1369
1394
is an *estimate* of the true accuracy; you have to use your judgement when looking at this plot to decide
1370
1395
where the elbow occurs, and whether adding a variable provides a meaningful increase in accuracy.
1371
1396
1372
-
> **Note:** Since the choice of which variables to include as predictors is
1373
-
> part of tuning your classifier, you *cannot use your test data* for this
1374
-
> process!
1397
+
```{r 06-fwdsel-3, echo = FALSE, warning = FALSE, fig.retina = 2, out.width = "65%", fig.cap = "Estimated accuracy versus the number of predictors for the sequence of models built using forward selection.", fig.pos = "H"}
1375
1398
1376
-
```{r 06-fwdsel-3, echo = FALSE, warning = FALSE, fig.retina = 2, out.width = "65%", fig.cap = "Estimated accuracy versus the number of predictors for the sequence of models built using forward selection."}
0 commit comments