You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: classification2.Rmd
+7-7Lines changed: 7 additions & 7 deletions
Original file line number
Diff line number
Diff line change
@@ -4,6 +4,7 @@
4
4
library(gridExtra)
5
5
library(cowplot)
6
6
library(stringr)
7
+
library(knitr)
7
8
8
9
knitr::opts_chunk$set(fig.align = "center")
9
10
@@ -750,7 +751,7 @@ Then instead of using `fit` or `fit_resamples`, we will use the `tune_grid` func
750
751
to fit the model for each value in a range of parameter values.
751
752
In particular, we first create a data frame with a `neighbors`
752
753
variable that contains the sequence of values of $K$ to try; below we create the `k_vals`
753
-
data frame with the `neighbors` variable containing each value from $K=1$ to $K=15$ using
754
+
data frame with the `neighbors` variable containing values from 1 to 100 (stepping by 5) using
754
755
the `seq` function.
755
756
Then we pass that data frame to the `grid` argument of `tune_grid`.
756
757
@@ -760,7 +761,7 @@ set.seed(1)
760
761
```
761
762
762
763
```{r 06-range-cross-val-2}
763
-
k_vals <- tibble(neighbors = seq(from = 1, to = 15, by = 1))
764
+
k_vals <- tibble(neighbors = seq(from = 1, to = 100, by = 5))
764
765
765
766
knn_results <- workflow() |>
766
767
add_recipe(cancer_recipe) |>
@@ -774,9 +775,8 @@ accuracies <- knn_results |>
774
775
accuracies
775
776
```
776
777
777
-
We can select the best value of the number of neighbors (i.e., the one that results
778
-
in the highest classifier accuracy estimate) by plotting the accuracy versus $K$
779
-
in Figure \@ref(fig:06-find-k).
778
+
We can decide which number of neighbors is best by plotting the accuracy versus $K$,
779
+
as shown in Figure \@ref(fig:06-find-k).
780
780
781
781
```{r 06-find-k, fig.height = 3.5, fig.width = 4, fig.cap= "Plot of estimated accuracy versus the number of neighbors."}
782
782
accuracy_vs_k <- ggplot(accuracies, aes(x = neighbors, y = mean)) +
@@ -790,7 +790,7 @@ accuracy_vs_k
790
790
Setting the number of
791
791
neighbors to $K =$ `r (accuracies |> arrange(desc(mean)) |> head(1))$neighbors`
792
792
provides the highest accuracy (`r (accuracies |> arrange(desc(mean)) |> slice(1) |> pull(mean) |> round(4))*100`%). But there is no exact or perfect answer here;
793
-
any selection from $K = 3$ and $15$ would be reasonably justified, as all
793
+
any selection from $K = 30$ and $60$ would be reasonably justified, as all
794
794
of these differ in classifier accuracy by a small amount. Remember: the
795
795
values you see on this plot are *estimates* of the true accuracy of our
796
796
classifier. Although the
@@ -801,7 +801,7 @@ value! Generally, when selecting $K$ (and other parameters for other predictive
801
801
models), we are looking for a value where:
802
802
803
803
- we get roughly optimal accuracy, so that our model will likely be accurate
804
-
- changing the value to a nearby one (e.g., adding or subtracting 1) doesn't decrease accuracy too much, so that our choice is reliable in the presence of uncertainty
804
+
- changing the value to a nearby one (e.g., adding or subtracting a small number) doesn't decrease accuracy too much, so that our choice is reliable in the presence of uncertainty
805
805
- the cost of training the model is not prohibitive (e.g., in our situation, if $K$ is too large, predicting becomes expensive!)
806
806
807
807
We know that $K =$ `r (accuracies |> arrange(desc(mean)) |> head(1))$neighbors`
0 commit comments