Skip to content

Commit 4ef5ed6

Browse files
tc polish new text for increasing folds vs std error text
1 parent 0fd7a9c commit 4ef5ed6

File tree

1 file changed

+11
-13
lines changed

1 file changed

+11
-13
lines changed

classification2.Rmd

Lines changed: 11 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -712,10 +712,10 @@ accuracy estimate will be (lower standard error). However, we are limited
712712
by computational power: the
713713
more folds we choose, the more computation it takes, and hence the more time
714714
it takes to run the analysis. So when you do cross-validation, you need to
715-
consider the size of the data, and the speed of the algorithm (e.g., $K$-nearest
716-
neighbor) and the speed of your computer. In practice, this is a
717-
trial-and-error process, but typically $C$ is chosen to be either 5 or 10. Here we use 10-fold cross-validation rather
718-
than 5-fold and we see we get a lower standard error:
715+
consider the size of the data, the speed of the algorithm (e.g., $K$-nearest
716+
neighbors), and the speed of your computer. In practice, this is a
717+
trial-and-error process, but typically $C$ is chosen to be either 5 or 10. Here
718+
we will try 10-fold cross-validation to see if we get a lower standard error:
719719

720720
```{r 06-10-fold}
721721
cancer_vfold <- vfold_cv(cancer_train, v = 10, strata = Class)
@@ -728,19 +728,19 @@ vfold_metrics <- workflow() |>
728728
729729
vfold_metrics
730730
```
731-
732-
Increasing the number of folds will usually result in a lower standard error, though this is
733-
not always the case. Due to random noise, sometimes we might get a higher value. In this example,
734-
the standard error decreased slightly, but not by a lot.
731+
In this case, using 10-fold instead of 5-fold cross validation did reduce the standard error, although
732+
by only an insignificant amount. In fact, due to the randomness in how the data are split, sometimes
733+
you might even end up with a *higher* standard error when increasing the number of folds!
734+
We can make the reduction in standard error more dramatic by increasing the number of folds
735+
by a large amount. In the following code we show the result when $C = 50$;
736+
picking such a large number of folds often takes a long time to run in practice,
737+
so we usually stick to 5 or 10.
735738

736739
```{r 06-50-fold-seed, echo = FALSE, warning = FALSE, message = FALSE}
737740
# hidden seed
738741
set.seed(1)
739742
```
740743

741-
We can see
742-
how the standard error decreases by a more meaningful amount when we use 50-fold cross-validation rather
743-
than 5-fold or 10-fold:
744744
```{r 06-50-fold}
745745
cancer_vfold_50 <- vfold_cv(cancer_train, v = 50, strata = Class)
746746
@@ -753,8 +753,6 @@ vfold_metrics_50 <- workflow() |>
753753
vfold_metrics_50
754754
```
755755

756-
In practice, we usually have a lot of data and setting $C$ to such a large number often takes a long time to run, so we usually stick to 5 or 10 folds.
757-
758756
### Parameter value selection
759757

760758
Using 5- and 10-fold cross-validation, we have estimated that the prediction

0 commit comments

Comments
 (0)