Skip to content

Commit f07636e

Browse files
fixed K selection in clasfcn2
1 parent 71cc01c commit f07636e

File tree

1 file changed

+6
-7
lines changed

1 file changed

+6
-7
lines changed

classification2.Rmd

Lines changed: 6 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -751,7 +751,7 @@ Then instead of using `fit` or `fit_resamples`, we will use the `tune_grid` func
751751
to fit the model for each value in a range of parameter values.
752752
In particular, we first create a data frame with a `neighbors`
753753
variable that contains the sequence of values of $K$ to try; below we create the `k_vals`
754-
data frame with the `neighbors` variable containing each value from $K=1$ to $K=15$ using
754+
data frame with the `neighbors` variable containing values from 1 to 100 (stepping by 5) using
755755
the `seq` function.
756756
Then we pass that data frame to the `grid` argument of `tune_grid`.
757757

@@ -761,7 +761,7 @@ set.seed(1)
761761
```
762762

763763
```{r 06-range-cross-val-2}
764-
k_vals <- tibble(neighbors = seq(from = 1, to = 15, by = 1))
764+
k_vals <- tibble(neighbors = seq(from = 1, to = 100, by = 5))
765765
766766
knn_results <- workflow() |>
767767
add_recipe(cancer_recipe) |>
@@ -775,9 +775,8 @@ accuracies <- knn_results |>
775775
accuracies
776776
```
777777

778-
We can select the best value of the number of neighbors (i.e., the one that results
779-
in the highest classifier accuracy estimate) by plotting the accuracy versus $K$
780-
in Figure \@ref(fig:06-find-k).
778+
We can decide which number of neighbors is best by plotting the accuracy versus $K$,
779+
as shown in Figure \@ref(fig:06-find-k).
781780

782781
```{r 06-find-k, fig.height = 3.5, fig.width = 4, fig.cap= "Plot of estimated accuracy versus the number of neighbors."}
783782
accuracy_vs_k <- ggplot(accuracies, aes(x = neighbors, y = mean)) +
@@ -791,7 +790,7 @@ accuracy_vs_k
791790
Setting the number of
792791
neighbors to $K =$ `r (accuracies |> arrange(desc(mean)) |> head(1))$neighbors`
793792
provides the highest accuracy (`r (accuracies |> arrange(desc(mean)) |> slice(1) |> pull(mean) |> round(4))*100`%). But there is no exact or perfect answer here;
794-
any selection from $K = 3$ and $15$ would be reasonably justified, as all
793+
any selection from $K = 30$ and $60$ would be reasonably justified, as all
795794
of these differ in classifier accuracy by a small amount. Remember: the
796795
values you see on this plot are *estimates* of the true accuracy of our
797796
classifier. Although the
@@ -802,7 +801,7 @@ value! Generally, when selecting $K$ (and other parameters for other predictive
802801
models), we are looking for a value where:
803802

804803
- we get roughly optimal accuracy, so that our model will likely be accurate
805-
- changing the value to a nearby one (e.g., adding or subtracting 1) doesn't decrease accuracy too much, so that our choice is reliable in the presence of uncertainty
804+
- changing the value to a nearby one (e.g., adding or subtracting a small number) doesn't decrease accuracy too much, so that our choice is reliable in the presence of uncertainty
806805
- the cost of training the model is not prohibitive (e.g., in our situation, if $K$ is too large, predicting becomes expensive!)
807806

808807
We know that $K =$ `r (accuracies |> arrange(desc(mean)) |> head(1))$neighbors`

0 commit comments

Comments
 (0)