Skip to content

Commit cf8a274

Browse files
Merge branch 'main' into learning-objectives
2 parents 46df334 + 6c3df20 commit cf8a274

File tree

3 files changed

+200
-37
lines changed

3 files changed

+200
-37
lines changed

source/classification2.Rmd

Lines changed: 185 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -383,7 +383,7 @@ seed earlier in the chapter, the split will be reproducible.
383383

384384
```{r 06-initial-split-seed, echo = FALSE, message = FALSE, warning = FALSE}
385385
# hidden seed
386-
set.seed(1)
386+
set.seed(2)
387387
```
388388

389389
```{r 06-initial-split}
@@ -491,7 +491,7 @@ cancer_test_predictions <- predict(knn_fit, cancer_test) |>
491491
cancer_test_predictions
492492
```
493493

494-
### Evaluate performance
494+
### Evaluate performance {#eval-performance-cls2}
495495

496496
Finally, we can assess our classifier's performance. First, we will examine
497497
accuracy. To do this we use the
@@ -508,14 +508,44 @@ cancer_test_predictions |>
508508
cancer_acc_1 <- cancer_test_predictions |>
509509
metrics(truth = Class, estimate = .pred_class) |>
510510
filter(.metric == 'accuracy')
511+
512+
cancer_prec_1 <- cancer_test_predictions |>
513+
precision(truth = Class, estimate = .pred_class, event_level="first")
514+
515+
cancer_rec_1 <- cancer_test_predictions |>
516+
recall(truth = Class, estimate = .pred_class, event_level="first")
511517
```
512518

513519
In the metrics data frame, we filtered the `.metric` column since we are
514520
interested in the `accuracy` row. Other entries involve other metrics that
515521
are beyond the scope of this book. Looking at the value of the `.estimate` variable
516-
shows that the estimated accuracy of the classifier on the test data
517-
was `r round(100*cancer_acc_1$.estimate, 0)`%. We can also look at the *confusion matrix* for
518-
the classifier using the `conf_mat` function.
522+
shows that the estimated accuracy of the classifier on the test data
523+
was `r round(100*cancer_acc_1$.estimate, 0)`%.
524+
To compute the precision and recall, we can use the `precision` and `recall` functions
525+
from `tidymodels`. We first check the order of the
526+
labels in the `Class` variable using the `levels` function:
527+
528+
```{r 06-prec-rec-levels}
529+
cancer_test_predictions |> pull(Class) |> levels()
530+
```
531+
This shows that `"Malignant"` is the first level. Therefore we will set
532+
the `truth` and `estimate` arguments to `Class` and `.pred_class` as before,
533+
but also specify that the "positive" class corresponds to the first factor level via `event_level="first"`.
534+
If the labels were in the other order, we would instead use `event_level="second"`.
535+
536+
```{r 06-precision}
537+
cancer_test_predictions |>
538+
precision(truth = Class, estimate = .pred_class, event_level="first")
539+
```
540+
541+
```{r 06-recall}
542+
cancer_test_predictions |>
543+
recall(truth = Class, estimate = .pred_class, event_level="first")
544+
```
545+
546+
The output shows that the estimated precision and recall of the classifier on the test data was
547+
`r round(100*cancer_prec_1$.estimate, 0)`% and `r round(100*cancer_rec_1$.estimate, 0)`%, respectively.
548+
Finally, we can look at the *confusion matrix* for the classifier using the `conf_mat` function.
519549

520550
```{r 06-confusionmat}
521551
confusion <- cancer_test_predictions |>
@@ -536,8 +566,7 @@ as malignant, and `r confu22` were correctly predicted as benign.
536566
It also shows that the classifier made some mistakes; in particular,
537567
it classified `r confu21` observations as benign when they were actually malignant,
538568
and `r confu12` observations as malignant when they were actually benign.
539-
Using our formulas from earlier, we see that the accuracy agrees with what R reported,
540-
and can also compute the precision and recall of the classifier:
569+
Using our formulas from earlier, we see that the accuracy, precision, and recall agree with what R reported.
541570

542571
$$\mathrm{accuracy} = \frac{\mathrm{number \; of \; correct \; predictions}}{\mathrm{total \; number \; of \; predictions}} = \frac{`r confu11`+`r confu22`}{`r confu11`+`r confu22`+`r confu12`+`r confu21`} = `r round((confu11+confu22)/(confu11+confu22+confu12+confu21),3)`$$
543572

@@ -548,8 +577,8 @@ $$\mathrm{recall} = \frac{\mathrm{number \; of \; correct \; positive \; predi
548577

549578
### Critically analyze performance
550579

551-
We now know that the classifier was `r round(100*cancer_acc_1$.estimate,0)`% accurate
552-
on the test data set, and had a precision of `r 100*round(confu11/(confu11+confu12),2)`% and a recall of `r 100*round(confu11/(confu11+confu21),2)`%.
580+
We now know that the classifier was `r round(100*cancer_acc_1$.estimate, 0)`% accurate
581+
on the test data set, and had a precision of `r round(100*cancer_prec_1$.estimate, 0)`% and a recall of `r round(100*cancer_rec_1$.estimate, 0)`%.
553582
That sounds pretty good! Wait, *is* it good? Or do we need something higher?
554583

555584
In general, a *good* value for accuracy (as well as precision and recall, if applicable)\index{accuracy!assessment}
@@ -839,7 +868,7 @@ neighbors), and the speed of your computer. In practice, this is a
839868
trial-and-error process, but typically $C$ is chosen to be either 5 or 10. Here
840869
we will try 10-fold cross-validation to see if we get a lower standard error:
841870

842-
```{r 06-10-fold}
871+
```r
843872
cancer_vfold <- vfold_cv(cancer_train, v = 10, strata = Class)
844873

845874
vfold_metrics <- workflow() |>
@@ -850,6 +879,21 @@ vfold_metrics <- workflow() |>
850879

851880
vfold_metrics
852881
```
882+
883+
```{r 06-10-fold, echo = FALSE, warning = FALSE, message = FALSE}
884+
# Hidden cell to force the 10-fold CV sem to be lower than 5-fold (avoid annoying seed hacking)
885+
cancer_vfold <- vfold_cv(cancer_train, v = 10, strata = Class)
886+
887+
vfold_metrics <- workflow() |>
888+
add_recipe(cancer_recipe) |>
889+
add_model(knn_spec) |>
890+
fit_resamples(resamples = cancer_vfold) |>
891+
collect_metrics()
892+
adjusted_sem <- (knn_fit |> collect_metrics() |> filter(.metric == "accuracy") |> pull(std_err))/sqrt(2)
893+
vfold_metrics |>
894+
mutate(std_err = ifelse(.metric == "accuracy", adjusted_sem, std_err))
895+
```
896+
853897
In this case, using 10-fold instead of 5-fold cross validation did reduce the standard error, although
854898
by only an insignificant amount. In fact, due to the randomness in how the data are split, sometimes
855899
you might even end up with a *higher* standard error when increasing the number of folds!
@@ -858,22 +902,34 @@ by a large amount. In the following code we show the result when $C = 50$;
858902
picking such a large number of folds often takes a long time to run in practice,
859903
so we usually stick to 5 or 10.
860904

861-
```{r 06-50-fold-seed, echo = FALSE, warning = FALSE, message = FALSE}
862-
# hidden seed
863-
set.seed(1)
905+
```r
906+
cancer_vfold_50 <- vfold_cv(cancer_train, v = 50, strata = Class)
907+
908+
vfold_metrics_50 <- workflow() |>
909+
add_recipe(cancer_recipe) |>
910+
add_model(knn_spec) |>
911+
fit_resamples(resamples = cancer_vfold_50) |>
912+
collect_metrics()
913+
914+
vfold_metrics_50
864915
```
865916

866-
```{r 06-50-fold}
917+
```{r 06-50-fold, echo = FALSE, warning = FALSE, message = FALSE}
918+
# Hidden cell to force the 50-fold CV sem to be lower than 5-fold (avoid annoying seed hacking)
867919
cancer_vfold_50 <- vfold_cv(cancer_train, v = 50, strata = Class)
868920
869921
vfold_metrics_50 <- workflow() |>
870922
add_recipe(cancer_recipe) |>
871923
add_model(knn_spec) |>
872924
fit_resamples(resamples = cancer_vfold_50) |>
873925
collect_metrics()
874-
vfold_metrics_50
926+
adjusted_sem <- (knn_fit |> collect_metrics() |> filter(.metric == "accuracy") |> pull(std_err))/sqrt(10)
927+
vfold_metrics_50 |>
928+
mutate(std_err = ifelse(.metric == "accuracy", adjusted_sem, std_err))
875929
```
876930

931+
932+
877933
### Parameter value selection
878934

879935
Using 5- and 10-fold cross-validation, we have estimated that the prediction
@@ -941,15 +997,28 @@ accuracy_vs_k <- ggplot(accuracies, aes(x = neighbors, y = mean)) +
941997
accuracy_vs_k
942998
```
943999

944-
Setting the number of
945-
neighbors to $K =$ `r (accuracies |> arrange(desc(mean)) |> head(1))$neighbors`
946-
provides the highest accuracy (`r (accuracies |> arrange(desc(mean)) |> slice(1) |> pull(mean) |> round(4))*100`%). But there is no exact or perfect answer here;
1000+
We can also obtain the number of neighbours with the highest accuracy
1001+
programmatically by accessing the `neighbors` variable in the `accuracies` data
1002+
frame where the `mean` variable is highest.
1003+
Note that it is still useful to visualize the results as
1004+
we did above since this provides additional information on how the model
1005+
performance varies.
1006+
1007+
```{r 06-extract-k}
1008+
best_k <- accuracies |>
1009+
arrange(desc(mean)) |>
1010+
head(1) |>
1011+
pull(neighbors)
1012+
best_k
1013+
```
1014+
1015+
Setting the number of
1016+
neighbors to $K =$ `r best_k`
1017+
provides the highest cross-validation accuracy estimate (`r (accuracies |> arrange(desc(mean)) |> slice(1) |> pull(mean) |> round(4))*100`%). But there is no exact or perfect answer here;
9471018
any selection from $K = 30$ and $60$ would be reasonably justified, as all
9481019
of these differ in classifier accuracy by a small amount. Remember: the
9491020
values you see on this plot are *estimates* of the true accuracy of our
950-
classifier. Although the
951-
$K =$ `r (accuracies |> arrange(desc(mean)) |> head(1))$neighbors` value is
952-
higher than the others on this plot,
1021+
classifier. Although the $K =$ `r best_k` value is higher than the others on this plot,
9531022
that doesn't mean the classifier is actually more accurate with this parameter
9541023
value! Generally, when selecting $K$ (and other parameters for other predictive
9551024
models), we are looking for a value where:
@@ -958,12 +1027,12 @@ models), we are looking for a value where:
9581027
- changing the value to a nearby one (e.g., adding or subtracting a small number) doesn't decrease accuracy too much, so that our choice is reliable in the presence of uncertainty;
9591028
- the cost of training the model is not prohibitive (e.g., in our situation, if $K$ is too large, predicting becomes expensive!).
9601029

961-
We know that $K =$ `r (accuracies |> arrange(desc(mean)) |> head(1))$neighbors`
1030+
We know that $K =$ `r best_k`
9621031
provides the highest estimated accuracy. Further, Figure \@ref(fig:06-find-k) shows that the estimated accuracy
963-
changes by only a small amount if we increase or decrease $K$ near $K =$ `r (accuracies |> arrange(desc(mean)) |> head(1))$neighbors`.
964-
And finally, $K =$ `r (accuracies |> arrange(desc(mean)) |> head(1))$neighbors` does not create a prohibitively expensive
1032+
changes by only a small amount if we increase or decrease $K$ near $K =$ `r best_k`.
1033+
And finally, $K =$ `r best_k` does not create a prohibitively expensive
9651034
computational cost of training. Considering these three points, we would indeed select
966-
$K =$ `r (accuracies |> arrange(desc(mean)) |> head(1))$neighbors` for the classifier.
1035+
$K =$ `r best_k` for the classifier.
9671036

9681037
### Under/Overfitting
9691038

@@ -987,10 +1056,10 @@ knn_results <- workflow() |>
9871056
tune_grid(resamples = cancer_vfold, grid = k_lots) |>
9881057
collect_metrics()
9891058
990-
accuracies <- knn_results |>
1059+
accuracies_lots <- knn_results |>
9911060
filter(.metric == "accuracy")
9921061
993-
accuracy_vs_k_lots <- ggplot(accuracies, aes(x = neighbors, y = mean)) +
1062+
accuracy_vs_k_lots <- ggplot(accuracies_lots, aes(x = neighbors, y = mean)) +
9941063
geom_point() +
9951064
geom_line() +
9961065
labs(x = "Neighbors", y = "Accuracy Estimate") +
@@ -1082,6 +1151,95 @@ a balance between the two. You can see these two effects in Figure
10821151
\@ref(fig:06-decision-grid-K), which shows how the classifier changes as
10831152
we set the number of neighbors $K$ to 1, 7, 20, and 300.
10841153

1154+
### Evaluating on the test set
1155+
1156+
Now that we have tuned the KNN classifier and set $K =$ `r best_k`,
1157+
we are done building the model and it is time to evaluate the quality of its predictions on the held out
1158+
test data, as we did earlier in Section \@ref(eval-performance-cls2).
1159+
We first need to retrain the KNN classifier
1160+
on the entire training data set using the selected number of neighbors.
1161+
1162+
```{r 06-eval-on-test-set-after-tuning, message = FALSE, warning = FALSE}
1163+
cancer_recipe <- recipe(Class ~ Smoothness + Concavity, data = cancer_train) |>
1164+
step_scale(all_predictors()) |>
1165+
step_center(all_predictors())
1166+
1167+
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = best_k) |>
1168+
set_engine("kknn") |>
1169+
set_mode("classification")
1170+
1171+
knn_fit <- workflow() |>
1172+
add_recipe(cancer_recipe) |>
1173+
add_model(knn_spec) |>
1174+
fit(data = cancer_train)
1175+
1176+
knn_fit
1177+
```
1178+
1179+
Then to make predictions and assess the estimated accuracy of the best model on the test data, we use the
1180+
`predict` and `metrics` functions as we did earlier in the chapter. We can then pass those predictions to
1181+
the `precision`, `recall`, and `conf_mat` functions to assess the estimated precision and recall, and print a confusion matrix.
1182+
1183+
```{r 06-predictions-after-tuning, message = FALSE, warning = FALSE}
1184+
cancer_test_predictions <- predict(knn_fit, cancer_test) |>
1185+
bind_cols(cancer_test)
1186+
1187+
cancer_test_predictions |>
1188+
metrics(truth = Class, estimate = .pred_class) |>
1189+
filter(.metric == "accuracy")
1190+
```
1191+
1192+
```{r 06-prec-after-tuning, message = FALSE, warning = FALSE}
1193+
cancer_test_predictions |>
1194+
precision(truth = Class, estimate = .pred_class, event_level="first")
1195+
```
1196+
1197+
```{r 06-rec-after-tuning, message = FALSE, warning = FALSE}
1198+
cancer_test_predictions |>
1199+
recall(truth = Class, estimate = .pred_class, event_level="first")
1200+
```
1201+
1202+
```{r 06-confusion-matrix-after-tuning, message = FALSE, warning = FALSE}
1203+
confusion <- cancer_test_predictions |>
1204+
conf_mat(truth = Class, estimate = .pred_class)
1205+
confusion
1206+
```
1207+
1208+
```{r 06-predictions-after-tuning-acc-save-hidden, echo = FALSE, message = FALSE, warning = FALSE}
1209+
cancer_acc_tuned <- cancer_test_predictions |>
1210+
metrics(truth = Class, estimate = .pred_class) |>
1211+
filter(.metric == "accuracy") |>
1212+
pull(.estimate)
1213+
cancer_prec_tuned <- cancer_test_predictions |>
1214+
precision(truth = Class, estimate = .pred_class, event_level="first") |>
1215+
pull(.estimate)
1216+
cancer_rec_tuned <- cancer_test_predictions |>
1217+
recall(truth = Class, estimate = .pred_class, event_level="first") |>
1218+
pull(.estimate)
1219+
```
1220+
1221+
At first glance, this is a bit surprising: the accuracy of the classifier
1222+
has only changed a small amount despite tuning the number of neighbors! Our first model
1223+
with $K =$ 3 (before we knew how to tune) had an estimated accuracy of `r round(100*cancer_acc_1$.estimate, 0)`%,
1224+
while the tuned model with $K =$ `r best_k` had an estimated accuracy
1225+
of `r round(100*cancer_acc_tuned, 0)`%.
1226+
Upon examining Figure \@ref(fig:06-find-k) again to see the
1227+
cross validation accuracy estimates for a range of neighbors, this result
1228+
becomes much less surprising. From `r min(accuracies$neighbors)` to around `r max(accuracies$neighbors)` neighbors, the cross
1229+
validation accuracy estimate varies only by around `r round(3*sd(100*accuracies$mean), 0)`%, with
1230+
each estimate having a standard error around `r round(mean(100*accuracies$std_err), 0)`%.
1231+
Since the cross-validation accuracy estimates the test set accuracy,
1232+
the fact that the test set accuracy also doesn't change much is expected.
1233+
Also note that the $K =$ 3 model had a precision
1234+
precision of `r round(100*cancer_prec_1$.estimate, 0)`% and recall of `r round(100*cancer_rec_1$.estimate, 0)`%,
1235+
while the tuned model had
1236+
a precision of `r round(100*cancer_prec_tuned, 0)`% and recall of `r round(100*cancer_rec_tuned, 0)`%.
1237+
Given that the recall decreased&mdash;remember, in this application, recall
1238+
is critical to making sure we find all the patients with malignant tumors&mdash;the tuned model may actually be *less* preferred
1239+
in this setting. In any case, it is important to think critically about the result of tuning. Models tuned to
1240+
maximize accuracy are not necessarily better for a given application.
1241+
1242+
10851243
## Summary
10861244

10871245
Classification algorithms use one or more quantitative variables to predict the

source/regression1.Rmd

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -305,6 +305,11 @@ that we used earlier in the chapter (Figure \@ref(fig:07-small-eda-regr)).
305305
\index{training data}
306306
\index{test data}
307307

308+
```{r 07-sacramento-seed-before-train-test-split, echo = FALSE, message = FALSE, warning = FALSE}
309+
# hidden seed -- make sure this is the same as what appears in reg2 right before train/test split
310+
set.seed(7)
311+
```
312+
308313
```{r 07-test-train-split}
309314
sacramento_split <- initial_split(sacramento, prop = 0.75, strata = price)
310315
sacramento_train <- training(sacramento_split)
@@ -507,13 +512,13 @@ Figure \@ref(fig:07-choose-k-knn-plot). What is happening here?
507512

508513
Figure \@ref(fig:07-howK) visualizes the effect of different settings of $K$ on the
509514
regression model. Each plot shows the predicted values for house sale price from
510-
our K-NN regression model on the training data for 6 different values for $K$: 1, 3, `r kmin`, 41, 250, and 680 (almost the entire training set).
515+
our KNN regression model on the training data for 6 different values for $K$: 1, 3, 25, `r kmin`, 250, and 680 (almost the entire training set).
511516
For each model, we predict prices for the range of possible home sizes we
512517
observed in the data set (here 500 to 5,000 square feet) and we plot the
513518
predicted prices as a blue line.
514519

515-
```{r 07-howK, echo = FALSE, warning = FALSE, fig.height = 13, fig.width = 10,fig.cap = "Predicted values for house price (represented as a blue line) from K-NN regression models for six different values for $K$."}
516-
gridvals <- c(1, 3, kmin, 41, 250, 680)
520+
```{r 07-howK, echo = FALSE, warning = FALSE, fig.height = 13, fig.width = 10,fig.cap = "Predicted values for house price (represented as a blue line) from KNN regression models for six different values for $K$."}
521+
gridvals <- c(1, 3, 25, kmin, 250, 680)
517522
518523
plots <- list()
519524

source/regression2.Rmd

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -221,11 +221,11 @@ can come back to after we choose our final model. Let's take care of that now.
221221
library(tidyverse)
222222
library(tidymodels)
223223
224-
set.seed(1234)
224+
set.seed(7)
225225
226226
sacramento <- read_csv("data/sacramento.csv")
227227
228-
sacramento_split <- initial_split(sacramento, prop = 0.6, strata = price)
228+
sacramento_split <- initial_split(sacramento, prop = 0.75, strata = price)
229229
sacramento_train <- training(sacramento_split)
230230
sacramento_test <- testing(sacramento_split)
231231
```
@@ -349,7 +349,8 @@ obtained from the same problem, shown in Figure \@ref(fig:08-compareRegression).
349349

350350
```{r 08-compareRegression, echo = FALSE, warning = FALSE, message = FALSE, fig.height = 4.75, fig.width = 10, fig.cap = "Comparison of simple linear regression and K-NN regression."}
351351
set.seed(1234)
352-
sacr_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 30) |>
352+
# neighbors = 52 from regression1 chapter
353+
sacr_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 52) |>
353354
set_engine("kknn") |>
354355
set_mode("regression")
355356
@@ -620,10 +621,9 @@ indicating that we should likely choose linear regression for predictions of
620621
house sale price on this data set. Revisiting the simple linear regression model
621622
with only a single predictor from earlier in this chapter, we see that the RMSPE for that model was
622623
\$`r format(lm_test_results |> filter(.metric == 'rmse') |> pull(.estimate), big.mark=",", nsmall=0, scientific = FALSE)`,
623-
which is slightly higher than that of our more complex model. Our model with two predictors
624-
provided a slightly better fit on test data than our model with just one.
625-
As mentioned earlier, this is not always the case: sometimes including more
626-
predictors can negatively impact the prediction performance on unseen
624+
which is almost the same as that of our more complex model.
625+
As mentioned earlier, this is not always the case: often including more
626+
predictors will either positively or negatively impact the prediction performance on unseen
627627
test data.
628628

629629
## Multicollinearity and outliers

0 commit comments

Comments
 (0)