Skip to content

Commit 82dd3a3

Browse files
committed
adding a higher number of folds
1 parent b951287 commit 82dd3a3

File tree

1 file changed

+26
-2
lines changed

1 file changed

+26
-2
lines changed

classification2.Rmd

Lines changed: 26 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -714,8 +714,7 @@ more folds we choose, the more computation it takes, and hence the more time
714714
it takes to run the analysis. So when you do cross-validation, you need to
715715
consider the size of the data, and the speed of the algorithm (e.g., $K$-nearest
716716
neighbor) and the speed of your computer. In practice, this is a
717-
trial-and-error process, but typically $C$ is chosen to be either 5 or 10. Here we show
718-
how the standard error decreases when we use 10-fold cross-validation rather
717+
trial-and-error process, but typically $C$ is chosen to be either 5 or 10. Here we use 10-fold cross-validation rather
719718
than 5-fold:
720719

721720
```{r 06-10-fold}
@@ -730,6 +729,31 @@ vfold_metrics <- workflow() |>
730729
vfold_metrics
731730
```
732731

732+
Increasing the number of folds will usually result in a lower standard error, though this is
733+
not always the case. Due to random noise, sometimes we might get a higher value. In this example,
734+
the standard error went down slightly, but not by a lot.
735+
736+
```{r 06-50-fold-seed, echo = FALSE, warning = FALSE, message = FALSE}
737+
# hidden seed
738+
set.seed(1)
739+
```
740+
741+
We can see
742+
how the standard error decreases by a more meaningful amount when we use 50-fold cross-validation rather
743+
than 5-fold or 10-fold:
744+
```{r 06-50-fold}
745+
cancer_vfold_50 <- vfold_cv(cancer_train, v = 50, strata = Class)
746+
747+
vfold_metrics_50 <- workflow() |>
748+
add_recipe(cancer_recipe) |>
749+
add_model(knn_spec) |>
750+
fit_resamples(resamples = cancer_vfold_50) |>
751+
collect_metrics()
752+
753+
vfold_metrics_50
754+
```
755+
756+
In practice, we usually have a lot of data and setting $C$ to such a large number often takes too long to run, so we usually stick to 5 or 10 folds.
733757

734758
### Parameter value selection
735759

0 commit comments

Comments
 (0)