revert 50fold removal; now with less seed hacking needed

trevorcampbell · trevorcampbell · commit f42a768d14ad · 2023-11-14T18:38:39.000-08:00
diff --git a/source/classification2.Rmd b/source/classification2.Rmd
@@ -898,6 +898,38 @@ vfold_metrics |>
 In this case, using 10-fold instead of 5-fold cross validation did reduce the standard error, although
 by only an insignificant amount. In fact, due to the randomness in how the data are split, sometimes
 you might even end up with a *higher* standard error when increasing the number of folds!
+We can make the reduction in standard error more dramatic by increasing the number of folds
+by a large amount. In the following code we show the result when $C = 50$;
+picking such a large number of folds often takes a long time to run in practice,
+so we usually stick to 5 or 10.
+
+```r
+cancer_vfold_50 <- vfold_cv(cancer_train, v = 50, strata = Class)
+
+vfold_metrics_50 <- workflow() |>
+                  add_recipe(cancer_recipe) |>
+                  add_model(knn_spec) |>
+                  fit_resamples(resamples = cancer_vfold_50) |>
+                  collect_metrics()
+
+vfold_metrics_50
+```
+
+```{r 06-50-fold, echo = FALSE, warning = FALSE, message = FALSE}
+# Hidden cell to force the 50-fold CV sem to be lower than 5-fold (avoid annoying seed hacking)
+cancer_vfold_50 <- vfold_cv(cancer_train, v = 50, strata = Class)
+
+vfold_metrics_50 <- workflow() |>
+                  add_recipe(cancer_recipe) |>
+                  add_model(knn_spec) |>
+                  fit_resamples(resamples = cancer_vfold_50) |>
+                  collect_metrics()
+adjusted_sem <- (knn_fit |> collect_metrics() |> filter(.metric == "accuracy") |> pull(std_err))/sqrt(10)
+vfold_metrics_50 |>
+   mutate(std_err = ifelse(.metric == "accuracy", adjusted_sem, std_err))
+```
+
+
 
 ### Parameter value selection