@@ -712,10 +712,10 @@ accuracy estimate will be (lower standard error). However, we are limited
712
712
by computational power: the
713
713
more folds we choose, the more computation it takes, and hence the more time
714
714
it takes to run the analysis. So when you do cross-validation, you need to
715
- consider the size of the data, and the speed of the algorithm (e.g., $K$-nearest
716
- neighbor) and the speed of your computer. In practice, this is a
717
- trial-and-error process, but typically $C$ is chosen to be either 5 or 10. Here we use 10-fold cross-validation rather
718
- than 5 -fold and we see we get a lower standard error:
715
+ consider the size of the data, the speed of the algorithm (e.g., $K$-nearest
716
+ neighbors), and the speed of your computer. In practice, this is a
717
+ trial-and-error process, but typically $C$ is chosen to be either 5 or 10. Here
718
+ we will try 10 -fold cross-validation to see if we get a lower standard error:
719
719
720
720
``` {r 06-10-fold}
721
721
cancer_vfold <- vfold_cv(cancer_train, v = 10, strata = Class)
@@ -728,19 +728,19 @@ vfold_metrics <- workflow() |>
728
728
729
729
vfold_metrics
730
730
```
731
-
732
- Increasing the number of folds will usually result in a lower standard error, though this is
733
- not always the case. Due to random noise, sometimes we might get a higher value. In this example,
734
- the standard error decreased slightly, but not by a lot.
731
+ In this case, using 10-fold instead of 5-fold cross validation did reduce the standard error, although
732
+ by only an insignificant amount. In fact, due to the randomness in how the data are split, sometimes
733
+ you might even end up with a * higher* standard error when increasing the number of folds!
734
+ We can make the reduction in standard error more dramatic by increasing the number of folds
735
+ by a large amount. In the following code we show the result when $C = 50$;
736
+ picking such a large number of folds often takes a long time to run in practice,
737
+ so we usually stick to 5 or 10.
735
738
736
739
``` {r 06-50-fold-seed, echo = FALSE, warning = FALSE, message = FALSE}
737
740
# hidden seed
738
741
set.seed(1)
739
742
```
740
743
741
- We can see
742
- how the standard error decreases by a more meaningful amount when we use 50-fold cross-validation rather
743
- than 5-fold or 10-fold:
744
744
``` {r 06-50-fold}
745
745
cancer_vfold_50 <- vfold_cv(cancer_train, v = 50, strata = Class)
746
746
@@ -753,8 +753,6 @@ vfold_metrics_50 <- workflow() |>
753
753
vfold_metrics_50
754
754
```
755
755
756
- In practice, we usually have a lot of data and setting $C$ to such a large number often takes a long time to run, so we usually stick to 5 or 10 folds.
757
-
758
756
### Parameter value selection
759
757
760
758
Using 5- and 10-fold cross-validation, we have estimated that the prediction
0 commit comments