@@ -714,8 +714,7 @@ more folds we choose, the more computation it takes, and hence the more time
714
714
it takes to run the analysis. So when you do cross-validation, you need to
715
715
consider the size of the data, and the speed of the algorithm (e.g., $K$-nearest
716
716
neighbor) and the speed of your computer. In practice, this is a
717
- trial-and-error process, but typically $C$ is chosen to be either 5 or 10. Here we show
718
- how the standard error decreases when we use 10-fold cross-validation rather
717
+ trial-and-error process, but typically $C$ is chosen to be either 5 or 10. Here we use 10-fold cross-validation rather
719
718
than 5-fold:
720
719
721
720
``` {r 06-10-fold}
@@ -730,6 +729,31 @@ vfold_metrics <- workflow() |>
730
729
vfold_metrics
731
730
```
732
731
732
+ Increasing the number of folds will usually result in a lower standard error, though this is
733
+ not always the case. Due to random noise, sometimes we might get a higher value. In this example,
734
+ the standard error went down slightly, but not by a lot.
735
+
736
+ ``` {r 06-50-fold-seed, echo = FALSE, warning = FALSE, message = FALSE}
737
+ # hidden seed
738
+ set.seed(1)
739
+ ```
740
+
741
+ We can see
742
+ how the standard error decreases by a more meaningful amount when we use 50-fold cross-validation rather
743
+ than 5-fold or 10-fold:
744
+ ``` {r 06-50-fold}
745
+ cancer_vfold_50 <- vfold_cv(cancer_train, v = 50, strata = Class)
746
+
747
+ vfold_metrics_50 <- workflow() |>
748
+ add_recipe(cancer_recipe) |>
749
+ add_model(knn_spec) |>
750
+ fit_resamples(resamples = cancer_vfold_50) |>
751
+ collect_metrics()
752
+
753
+ vfold_metrics_50
754
+ ```
755
+
756
+ In practice, we usually have a lot of data and setting $C$ to such a large number often takes too long to run, so we usually stick to 5 or 10 folds.
733
757
734
758
### Parameter value selection
735
759
0 commit comments