Skip to content

Commit 154a7fb

Browse files
evaluating on the test set in clsfcn2
1 parent ad07ec5 commit 154a7fb

File tree

1 file changed

+87
-9
lines changed

1 file changed

+87
-9
lines changed

source/classification2.Rmd

Lines changed: 87 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -491,7 +491,7 @@ cancer_test_predictions <- predict(knn_fit, cancer_test) |>
491491
cancer_test_predictions
492492
```
493493

494-
### Evaluate performance
494+
### Evaluate performance {#eval-performance-cls2}
495495

496496
Finally, we can assess our classifier's performance. First, we will examine
497497
accuracy. To do this we use the
@@ -941,14 +941,29 @@ accuracy_vs_k <- ggplot(accuracies, aes(x = neighbors, y = mean)) +
941941
accuracy_vs_k
942942
```
943943

944+
We can also obtain the number of neighbours with the highest accuracy
945+
programmatically by accessing the `neighbors` variable in the `accuracies` data
946+
frame where the `mean` variable is highest.
947+
Note that it is still useful to visualize the results as
948+
we did above since this provides additional information on how the model
949+
performance varies.
950+
951+
```{r 06-extract-k}
952+
best_k <- accuracies |>
953+
arrange(desc(mean)) |>
954+
head(1) |>
955+
pull(neighbors)
956+
best_k
957+
```
958+
944959
Setting the number of
945-
neighbors to $K =$ `r (accuracies |> arrange(desc(mean)) |> head(1))$neighbors`
960+
neighbors to $K =$ `r best_k`
946961
provides the highest accuracy (`r (accuracies |> arrange(desc(mean)) |> slice(1) |> pull(mean) |> round(4))*100`%). But there is no exact or perfect answer here;
947962
any selection from $K = 30$ and $60$ would be reasonably justified, as all
948963
of these differ in classifier accuracy by a small amount. Remember: the
949964
values you see on this plot are *estimates* of the true accuracy of our
950965
classifier. Although the
951-
$K =$ `r (accuracies |> arrange(desc(mean)) |> head(1))$neighbors` value is
966+
$K =$ `r best_k` value is
952967
higher than the others on this plot,
953968
that doesn't mean the classifier is actually more accurate with this parameter
954969
value! Generally, when selecting $K$ (and other parameters for other predictive
@@ -958,12 +973,12 @@ models), we are looking for a value where:
958973
- changing the value to a nearby one (e.g., adding or subtracting a small number) doesn't decrease accuracy too much, so that our choice is reliable in the presence of uncertainty;
959974
- the cost of training the model is not prohibitive (e.g., in our situation, if $K$ is too large, predicting becomes expensive!).
960975

961-
We know that $K =$ `r (accuracies |> arrange(desc(mean)) |> head(1))$neighbors`
976+
We know that $K =$ `r best_k`
962977
provides the highest estimated accuracy. Further, Figure \@ref(fig:06-find-k) shows that the estimated accuracy
963-
changes by only a small amount if we increase or decrease $K$ near $K =$ `r (accuracies |> arrange(desc(mean)) |> head(1))$neighbors`.
964-
And finally, $K =$ `r (accuracies |> arrange(desc(mean)) |> head(1))$neighbors` does not create a prohibitively expensive
978+
changes by only a small amount if we increase or decrease $K$ near $K =$ `r best_k`.
979+
And finally, $K =$ `r best_k` does not create a prohibitively expensive
965980
computational cost of training. Considering these three points, we would indeed select
966-
$K =$ `r (accuracies |> arrange(desc(mean)) |> head(1))$neighbors` for the classifier.
981+
$K =$ `r best_k` for the classifier.
967982

968983
### Under/Overfitting
969984

@@ -987,10 +1002,10 @@ knn_results <- workflow() |>
9871002
tune_grid(resamples = cancer_vfold, grid = k_lots) |>
9881003
collect_metrics()
9891004
990-
accuracies <- knn_results |>
1005+
accuracies_lots <- knn_results |>
9911006
filter(.metric == "accuracy")
9921007
993-
accuracy_vs_k_lots <- ggplot(accuracies, aes(x = neighbors, y = mean)) +
1008+
accuracy_vs_k_lots <- ggplot(accuracies_lots, aes(x = neighbors, y = mean)) +
9941009
geom_point() +
9951010
geom_line() +
9961011
labs(x = "Neighbors", y = "Accuracy Estimate") +
@@ -1082,6 +1097,69 @@ a balance between the two. You can see these two effects in Figure
10821097
\@ref(fig:06-decision-grid-K), which shows how the classifier changes as
10831098
we set the number of neighbors $K$ to 1, 7, 20, and 300.
10841099

1100+
### Evaluating on the test set
1101+
1102+
Now that we have tuned the KNN classifier and set $K =$ `r best_k`,
1103+
we are done building the model and it is time to evaluate the quality of its predictions on the held out
1104+
test data, as we did earlier in Section \@ref(eval-performance-cls2).
1105+
We first need to retrain the KNN classifier
1106+
on the entire training data set using the selected number of neighbors.
1107+
1108+
```{r 06-eval-on-test-set-after-tuning, message = FALSE, warning = FALSE}
1109+
cancer_recipe <- recipe(Class ~ Smoothness + Concavity, data = cancer_train) |>
1110+
step_scale(all_predictors()) |>
1111+
step_center(all_predictors())
1112+
1113+
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = best_k) |>
1114+
set_engine("kknn") |>
1115+
set_mode("classification")
1116+
1117+
knn_fit <- workflow() |>
1118+
add_recipe(cancer_recipe) |>
1119+
add_model(knn_spec) |>
1120+
fit(data = cancer_train)
1121+
1122+
knn_fit
1123+
```
1124+
1125+
Then to make predictions and assess the estimated accuracy of the best model on the test data, we use the
1126+
`predict` and `conf_mat` functions as we did earlier in this chapter.
1127+
1128+
```{r 06-predictions-after-tuning, message = FALSE, warning = FALSE}
1129+
cancer_test_predictions <- predict(knn_fit, cancer_test) |>
1130+
bind_cols(cancer_test)
1131+
1132+
cancer_test_predictions |>
1133+
metrics(truth = Class, estimate = .pred_class) |>
1134+
filter(.metric == "accuracy")
1135+
```
1136+
1137+
```{r 06-predictions-after-tuning-acc-save-hidden, echo = FALSE, message = FALSE, warning = FALSE}
1138+
cancer_acc_tuned <- cancer_test_predictions |>
1139+
metrics(truth = Class, estimate = .pred_class) |>
1140+
filter(.metric == "accuracy") |>
1141+
pull(.estimate)
1142+
```
1143+
1144+
```{r 06-confusion-matrix-after-tuning, message = FALSE, warning = FALSE}
1145+
confusion <- cancer_test_predictions |>
1146+
conf_mat(truth = Class, estimate = .pred_class)
1147+
confusion
1148+
```
1149+
1150+
At first glance, this is a bit surprising: the performance of the classifier
1151+
has not changed much despite tuning the number of neighbors! For example, our first model
1152+
with $K =$ 3 (before we knew how to tune) had an estimated accuracy of `r round(100*cancer_acc_1$.estimate, 0)`%,
1153+
while the tuned model with $K =$ `r best_k` had an estimated accuracy
1154+
of `r round(100*cancer_acc_tuned, 0)`%.
1155+
But upon examining Figure \@ref(fig:06-find-k) again closely&mdash;to revisit the
1156+
cross validation accuracy estimates for a range of neighbors&mdash;this result
1157+
becomes much less surprising. From `r min(accuracies$neighbors)` to around `r max(accuracies$neighbors)` neighbors, the cross
1158+
validation accuracy estimate varies only by around `r round(3*sd(100*accuracies$mean), 0)`%, with
1159+
each estimate having a standard error around `r round(mean(100*accuracies$std_err), 0)`%.
1160+
Since the cross-validation accuracy estimates the test set accuracy,
1161+
the fact that the test set accuracy also doesn't change much is expected.
1162+
10851163
## Summary
10861164

10871165
Classification algorithms use one or more quantitative variables to predict the

0 commit comments

Comments
 (0)