Skip to content

Commit 3992c98

Browse files
Merge pull request #376 from UBC-DSCI/dev
fix choice of K example in classification 2
2 parents 44c73d3 + f07636e commit 3992c98

File tree

1 file changed

+7
-7
lines changed

1 file changed

+7
-7
lines changed

classification2.Rmd

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
library(gridExtra)
55
library(cowplot)
66
library(stringr)
7+
library(knitr)
78
89
knitr::opts_chunk$set(fig.align = "center")
910
@@ -750,7 +751,7 @@ Then instead of using `fit` or `fit_resamples`, we will use the `tune_grid` func
750751
to fit the model for each value in a range of parameter values.
751752
In particular, we first create a data frame with a `neighbors`
752753
variable that contains the sequence of values of $K$ to try; below we create the `k_vals`
753-
data frame with the `neighbors` variable containing each value from $K=1$ to $K=15$ using
754+
data frame with the `neighbors` variable containing values from 1 to 100 (stepping by 5) using
754755
the `seq` function.
755756
Then we pass that data frame to the `grid` argument of `tune_grid`.
756757

@@ -760,7 +761,7 @@ set.seed(1)
760761
```
761762

762763
```{r 06-range-cross-val-2}
763-
k_vals <- tibble(neighbors = seq(from = 1, to = 15, by = 1))
764+
k_vals <- tibble(neighbors = seq(from = 1, to = 100, by = 5))
764765
765766
knn_results <- workflow() |>
766767
add_recipe(cancer_recipe) |>
@@ -774,9 +775,8 @@ accuracies <- knn_results |>
774775
accuracies
775776
```
776777

777-
We can select the best value of the number of neighbors (i.e., the one that results
778-
in the highest classifier accuracy estimate) by plotting the accuracy versus $K$
779-
in Figure \@ref(fig:06-find-k).
778+
We can decide which number of neighbors is best by plotting the accuracy versus $K$,
779+
as shown in Figure \@ref(fig:06-find-k).
780780

781781
```{r 06-find-k, fig.height = 3.5, fig.width = 4, fig.cap= "Plot of estimated accuracy versus the number of neighbors."}
782782
accuracy_vs_k <- ggplot(accuracies, aes(x = neighbors, y = mean)) +
@@ -790,7 +790,7 @@ accuracy_vs_k
790790
Setting the number of
791791
neighbors to $K =$ `r (accuracies |> arrange(desc(mean)) |> head(1))$neighbors`
792792
provides the highest accuracy (`r (accuracies |> arrange(desc(mean)) |> slice(1) |> pull(mean) |> round(4))*100`%). But there is no exact or perfect answer here;
793-
any selection from $K = 3$ and $15$ would be reasonably justified, as all
793+
any selection from $K = 30$ and $60$ would be reasonably justified, as all
794794
of these differ in classifier accuracy by a small amount. Remember: the
795795
values you see on this plot are *estimates* of the true accuracy of our
796796
classifier. Although the
@@ -801,7 +801,7 @@ value! Generally, when selecting $K$ (and other parameters for other predictive
801801
models), we are looking for a value where:
802802

803803
- we get roughly optimal accuracy, so that our model will likely be accurate
804-
- changing the value to a nearby one (e.g., adding or subtracting 1) doesn't decrease accuracy too much, so that our choice is reliable in the presence of uncertainty
804+
- changing the value to a nearby one (e.g., adding or subtracting a small number) doesn't decrease accuracy too much, so that our choice is reliable in the presence of uncertainty
805805
- the cost of training the model is not prohibitive (e.g., in our situation, if $K$ is too large, predicting becomes expensive!)
806806

807807
We know that $K =$ `r (accuracies |> arrange(desc(mean)) |> head(1))$neighbors`

0 commit comments

Comments
 (0)