@@ -45,7 +45,7 @@ theme_update(axis.title = element_text(size = 12)) # modify axis label size in p
45
45
## Overview
46
46
This chapter continues the introduction to predictive modeling through
47
47
classification. While the previous chapter covered training and data
48
- preprocessing, this chapter focuses on how to evaluate the accuracy of
48
+ preprocessing, this chapter focuses on how to evaluate the performance of
49
49
a classifier, as well as how to improve the classifier (where possible)
50
50
to maximize its accuracy.
51
51
@@ -56,11 +56,13 @@ By the end of the chapter, readers will be able to do the following:
56
56
- Split data into training, validation, and test data sets.
57
57
- Describe what a random seed is and its importance in reproducible data analysis.
58
58
- Set the random seed in R using the ` set.seed ` function.
59
- - Evaluate classification accuracy in R using a validation data set and appropriate metrics.
59
+ - Describe and interpret accuracy, precision, recall, and confusion matrices.
60
+ - Evaluate classification accuracy in R using a validation data set.
61
+ - Produce a confusion matrix in R.
60
62
- Execute cross-validation in R to choose the number of neighbors in a $K$-nearest neighbors classifier.
61
63
- Describe the advantages and disadvantages of the $K$-nearest neighbors classification algorithm.
62
64
63
- ## Evaluating accuracy
65
+ ## Evaluating performance
64
66
65
67
Sometimes our classifier might make the wrong prediction. A classifier does not
66
68
need to be right 100\% of the time to be useful, though we don't want the
@@ -71,13 +73,15 @@ and think about how our classifier will be used in practice. A biopsy will be
71
73
performed on a * new* patient's tumor, the resulting image will be analyzed,
72
74
and the classifier will be asked to decide whether the tumor is benign or
73
75
malignant. The key word here is * new* : our classifier is "good" if it provides
74
- accurate predictions on data * not seen during training* . But then, how can we
75
- evaluate our classifier without visiting the hospital to collect more
76
+ accurate predictions on data * not seen during training* , as this implies that
77
+ it has actually learned about the relationship between the predictor variables and response variable,
78
+ as opposed to simply memorizing the labels of individual training data examples.
79
+ But then, how can we evaluate our classifier without visiting the hospital to collect more
76
80
tumor images?
77
81
78
82
The trick is to split the data into a ** training set** \index{training set} and ** test set** \index{test set} (Figure \@ ref(fig:06-training-test))
79
83
and use only the ** training set** when building the classifier.
80
- Then, to evaluate the accuracy of the classifier, we first set aside the true labels from the ** test set** ,
84
+ Then, to evaluate the performance of the classifier, we first set aside the true labels from the ** test set** ,
81
85
and then use the classifier to predict the labels in the ** test set** . If our predictions match the true
82
86
labels for the observations in the ** test set** , then we have some
83
87
confidence that our classifier might also accurately predict the class
@@ -95,23 +99,116 @@ knitr::include_graphics("img/classification2/training_test.jpeg")
95
99
96
100
How exactly can we assess how well our predictions match the true labels for
97
101
the observations in the test set? One way we can do this is to calculate the
98
- ** prediction accuracy** . \index{prediction accuracy|see{accuracy}}\index{accuracy} This is the fraction of examples for which the
102
+ prediction ** accuracy** . \index{prediction accuracy|see{accuracy}}\index{accuracy} This is the fraction of examples for which the
99
103
classifier made the correct prediction. To calculate this, we divide the number
100
104
of correct predictions by the number of predictions made.
101
-
102
- $$ \mathrm{prediction \; accuracy} = \frac{\mathrm{number \; of \; correct \; predictions}}{\mathrm{total \; number \; of \; predictions}} $$
103
-
104
-
105
105
The process for assessing if our predictions match the true labels in the
106
- test set is illustrated in Figure \@ ref(fig:06-ML-paradigm-test). Note that there
107
- are other measures for how well classifiers perform, such as * precision* and * recall* ;
108
- these will not be discussed here, but you will likely encounter them in other more advanced
109
- books on this topic.
106
+ test set is illustrated in Figure \@ ref(fig:06-ML-paradigm-test).
107
+
108
+ $$ \mathrm{accuracy} = \frac{\mathrm{number \; of \; correct \; predictions}}{\mathrm{total \; number \; of \; predictions}} $$
110
109
111
110
``` {r 06-ML-paradigm-test, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "Process for splitting the data and finding the prediction accuracy.", fig.retina = 2, out.width = "100%"}
112
111
knitr::include_graphics("img/classification2/ML-paradigm-test.png")
113
112
```
114
113
114
+ Accuracy is a convenient, general-purpose way to summarize the performance of a classifier with
115
+ a single number. But prediction accuracy by itself does not tell the whole
116
+ story. In particular, accuracy alone only tells us how often the classifier
117
+ makes mistakes in general, but does not tell us anything about the * kinds* of
118
+ mistakes the classifier makes. A more comprehensive view of performance can be
119
+ obtained by additionally examining the ** confusion matrix** . The confusion
120
+ matrix shows how many test set labels of each type are predicted correctly and
121
+ incorrectly, which gives us more detail about the kinds of mistakes the
122
+ classifier tends to make. Table \@ ref(tab: confusion-matrix ) shows an example
123
+ of what a confusion matrix might look like for the tumor image data with
124
+ a test set of 65 observations.
125
+
126
+ Table: (\# tab: confusion-matrix ) An example confusion matrix for the tumor image data.
127
+
128
+ | | Truly Malignant | Truly Benign |
129
+ | ---------------------- | --------------- | -------------- |
130
+ | ** Predicted Malignant** | 1 | 4 |
131
+ | ** Predicted Benign** | 3 | 57 |
132
+
133
+ In the example in Table \@ ref(tab: confusion-matrix ), we see that there was
134
+ 1 malignant observation that was correctly classified as malignant (top left corner),
135
+ and 57 benign observations that were correctly classified as benign (bottom right corner).
136
+ However, we can also see that the classifier made some mistakes:
137
+ it classified 3 malignant observations as benign, and 4 benign observations as
138
+ malignant. The accuracy of this classifier is roughly
139
+ 89%, given by the formula
140
+
141
+ $$ \mathrm{accuracy} = \frac{\mathrm{number \; of \; correct \; predictions}}{\mathrm{total \; number \; of \; predictions}} = \frac{1+57}{1+57+4+3} = 0.892 $$
142
+
143
+ But we can also see that the classifier only identified 1 out of 4 total malignant
144
+ tumors; in other words, it misclassified 75% of the malignant cases present in the
145
+ data set! In this example, misclassifying a malignant tumor is a potentially
146
+ disastrous error, since it may lead to a patient who requires treatment not receiving it.
147
+ Since we are particularly interested in identifying malignant cases, this
148
+ classifier would likely be unacceptable even with an accuracy of 89%.
149
+
150
+ Focusing more on one label than the other is
151
+ common in classification problems. In such cases, we typically refer to the label we are more
152
+ interested in identifying as the * positive* label, and the other as the
153
+ * negative* label. In the tumor example, we would refer to malignant
154
+ observations as * positive* , and benign observations as * negative* . We can then
155
+ use the following terms to talk about the four kinds of prediction that the
156
+ classifier can make, corresponding to the four entries in the confusion matrix:
157
+
158
+ - ** True Positive:** A malignant observation that was classified as malignant (top left in Table \@ ref(tab: confusion-matrix )).
159
+ - ** False Positive:** A benign observation that was classified as malignant (top right in Table \@ ref(tab: confusion-matrix )).
160
+ - ** True Negative:** A benign observation that was classified as benign (bottom right in Table \@ ref(tab: confusion-matrix )).
161
+ - ** False Negative:** A malignant observation that was classified as benign (bottom left in Table \@ ref(tab: confusion-matrix )).
162
+
163
+ A perfect classifier would have zero false negatives and false positives (and
164
+ therefore, 100% accuracy). However, classifiers in practice will almost always
165
+ make some errors. So you should think about which kinds of error are most
166
+ important in your application, and use the confusion matrix to quantify and
167
+ report them. Two commonly used metrics that we can compute using the confusion
168
+ matrix are the ** precision** and ** recall** of the classifier. These are often
169
+ reported together with accuracy. * Precision* quantifies how many of the
170
+ positive predictions the classifier made were actually positive. Intuitively,
171
+ we would like a classifier to have a * high* precision: for a classifier with
172
+ high precision, if the classifier reports that a new observation is positive,
173
+ we can trust that that new observation is indeed positive. We can compute the
174
+ precision of a classifier using the entries in the confusion matrix, with the
175
+ formula
176
+
177
+ $$ \mathrm{precision} = \frac{\mathrm{number \; of \; correct \; positive \; predictions}}{\mathrm{total \; number \; of \; positive \; predictions}}. $$
178
+
179
+ * Recall* quantifies how many of the positive observations in the test set were
180
+ identified as positive. Intuitively, we would like a classifier to have a
181
+ * high* recall: for a classifier with high recall, if there is a positive
182
+ observation in the test data, we can trust that the classifier will find it.
183
+ We can also compute the recall of the classifier using the entries in the
184
+ confusion matrix, with the formula
185
+
186
+ $$ \mathrm{recall} = \frac{\mathrm{number \; of \; correct \; positive \; predictions}}{\mathrm{total \; number \; of \; positive \; test \; set \; observations}}. $$
187
+
188
+ In the example presented in Table \@ ref(tab: confusion-matrix ), we have that the precision and recall are
189
+
190
+ $$ \mathrm{precision} = \frac{1}{1+4} = 0.20, \quad \mathrm{recall} = \frac{1}{1+3} = 0.25. $$
191
+
192
+ So even with an accuracy of 89%, the precision and recall of the classifier
193
+ were both relatively low. For this data analysis context, recall is
194
+ particularly important: if someone has a malignant tumor, we certainly want to
195
+ identify it. A recall of just 25% would likely be unacceptable!
196
+
197
+ > ** Note:** It is difficult to achieve both high precision and high recall at
198
+ > the same time; models with high precision tend to have low recall and vice
199
+ > versa. As an example, we can easily make a classifier that has * perfect
200
+ > recall* : just * always* guess positive! This classifier will of course find
201
+ > every positive observation in the test set, but it will make lots of false
202
+ > positive predictions along the way and have low precision. Similarly, we can
203
+ > easily make a classifier that has * perfect precision* : * never* guess
204
+ > positive! This classifier will never incorrectly identify an obsevation as
205
+ > positive, but it will make a lot of false negative predictions along the way.
206
+ > In fact, this classifier will have 0% recall! Of course, most real
207
+ > classifiers fall somewhere in between these two extremes. But these examples
208
+ > serve to show that in settings where one of the classes is of interest (i.e.,
209
+ > there is a * positive* label), there is a trade-off between precision and recall that one has to
210
+ > make when designing a classifier.
211
+
115
212
## Randomness and seeds {#randomseeds}
116
213
Beginning in this chapter, our data analyses will often involve the use
117
214
of * randomness* . \index{random} We use randomness any time we need to make a decision in our
@@ -210,7 +307,7 @@ Different argument values in `set.seed` lead to different patterns of randomness
210
307
you pick the same argument value your result will be the same.
211
308
In the remainder of the textbook, we will set the seed once at the beginning of each chapter.
212
309
213
- ## Evaluating accuracy with ` tidymodels `
310
+ ## Evaluating performance with ` tidymodels `
214
311
Back to evaluating classifiers now!
215
312
In R, we can use the ` tidymodels ` package \index{tidymodels} not only to perform $K$-nearest neighbors
216
313
classification, but also to assess how well our classification worked.
@@ -394,11 +491,12 @@ cancer_test_predictions <- predict(knn_fit, cancer_test) |>
394
491
cancer_test_predictions
395
492
```
396
493
397
- ### Compute the accuracy
494
+ ### Evaluate performance
398
495
399
- Finally, we can assess our classifier's accuracy. To do this we use the ` metrics ` function \index{tidymodels!metrics}
400
- from ` tidymodels ` to get the statistics about the quality of our model, specifying
401
- the ` truth ` and ` estimate ` arguments:
496
+ Finally, we can assess our classifier's performance. First, we will examine
497
+ accuracy. To do this we use the
498
+ ` metrics ` function \index{tidymodels!metrics} from ` tidymodels ` ,
499
+ specifying the ` truth ` and ` estimate ` arguments:
402
500
403
501
``` {r 06-accuracy}
404
502
cancer_test_predictions |>
@@ -413,7 +511,7 @@ cancer_acc_1 <- cancer_test_predictions |>
413
511
```
414
512
415
513
In the metrics data frame, we filtered the ` .metric ` column since we are
416
- interested in the ` accuracy ` row. Other entries involve more advanced metrics that
514
+ interested in the ` accuracy ` row. Other entries involve other metrics that
417
515
are beyond the scope of this book. Looking at the value of the ` .estimate ` variable
418
516
shows that the estimated accuracy of the classifier on the test data
419
517
was ` r round(100*cancer_acc_1$.estimate, 0) ` %.
@@ -436,31 +534,41 @@ confu22 <- (confusionmt |> filter(name == "cell_2_2"))$value
436
534
```
437
535
438
536
The confusion matrix shows ` r confu11 ` observations were correctly predicted
439
- as malignant, and ` r confu22 ` were correctly predicted as benign. Therefore the classifier labeled
440
- ` r confu11 ` + ` r confu22 ` = ` r confu11+confu22 ` observations
441
- correctly. It also shows that the classifier made some mistakes; in particular,
537
+ as malignant, and ` r confu22 ` were correctly predicted as benign.
538
+ It also shows that the classifier made some mistakes; in particular,
442
539
it classified ` r confu21 ` observations as benign when they were truly malignant,
443
540
and ` r confu12 ` observations as malignant when they were truly benign.
541
+ Using our formulas from earlier, we see that the accuracy agrees with what R reported,
542
+ and can also compute the precision and recall of the classifier:
543
+
544
+ $$ \mathrm{accuracy} = \frac{\mathrm{number \; of \; correct \; predictions}}{\mathrm{total \; number \; of \; predictions}} = \frac{`r confu11`+`r confu22`}{`r confu11`+`r confu22`+`r confu12`+`r confu21`} = `r round((confu11+confu22)/(confu11+confu22+confu12+confu21),3)` $$
545
+
546
+ $$ \mathrm{precision} = \frac{\mathrm{number \; of \; correct \; positive \; predictions}}{\mathrm{total \; number \; of \; positive \; predictions}} = \frac{`r confu11`}{`r confu11` + `r confu12`} = `r round(confu11/(confu11+confu12), 3)` $$
547
+
548
+ $$ \mathrm{recall} = \frac{\mathrm{number \; of \; correct \; positive \; predictions}}{\mathrm{total \; number \; of \; positive \; test \; set \; observations}} = \frac{`r confu11`}{`r confu11`+`r confu21`} = `r round(confu11/(confu11+confu21),3)` $$
549
+
444
550
445
551
### Critically analyze performance
446
552
447
553
We now know that the classifier was ` r round(100*cancer_acc_1$.estimate,0) ` % accurate
448
- on the test data set. That sounds pretty good! Wait, * is* it good?
449
- Or do we need something higher?
450
-
451
- In general, what a * good* value for accuracy \index{accuracy!assessment} is depends on the application.
452
- For instance, suppose you are predicting whether a tumor is benign or malignant
453
- for a type of tumor that is benign 99% of the time. It is very easy to obtain
454
- a 99% accuracy just by guessing benign for every observation. In this case,
455
- 99% accuracy is probably not good enough. And beyond just accuracy,
456
- sometimes the * kind* of mistake the classifier makes is important as well. In
457
- the previous example, it might be very bad for the classifier to predict
458
- "benign" when the true class is "malignant", as this might result in a patient
459
- not receiving appropriate medical attention. On the other hand, it might be
460
- less bad for the classifier to guess "malignant" when the true class is
461
- "benign", as the patient will then likely see a doctor who can provide an
462
- expert diagnosis. This is why it is important not only to look at accuracy, but
463
- also the confusion matrix.
554
+ on the test data set, and had a precision of ` r 100*round(confu11/(confu11+confu12),2) ` % and a recall of ` r 100*round(confu11/(confu11+confu21),2) ` %.
555
+ That sounds pretty good! Wait, * is* it good? Or do we need something higher?
556
+
557
+ In general, a * good* value for accuracy (as well as precision and recall, if applicable)\index{accuracy!assessment}
558
+ depends on the application; you must critically analyze your accuracy in the context of the problem
559
+ you are solving. For example, if we were building a classifier for a kind of tumor that is benign 99%
560
+ of the time, a classifier with 99% accuracy is not terribly impressive (just always guess benign!).
561
+ And beyond just accuracy, we need to consider the precision and recall: as mentioned
562
+ earlier, the * kind* of mistake the classifier makes is
563
+ important in many applications as well. In the previous example with 99% benign observations, it might be very bad for the
564
+ classifier to predict "benign" when the true class is "malignant" (a false negative), as this
565
+ might result in a patient not receiving appropriate medical attention. In other
566
+ words, in this context, we need the classifier to have a * high recall* . On the
567
+ other hand, it might be less bad for the classifier to guess "malignant" when
568
+ the true class is "benign" (a false positive), as the patient will then likely see a doctor who
569
+ can provide an expert diagnosis. In other words, we are fine with sacrificing
570
+ some precision in the interest of achieving high recall. This is why it is
571
+ important not only to look at accuracy, but also the confusion matrix.
464
572
465
573
However, there is always an easy baseline that you can compare to for any
466
574
classification problem: the * majority classifier* . The majority classifier \index{classification!majority}
@@ -518,8 +626,9 @@ By picking different values of $K$, we create different classifiers
518
626
that make different predictions.
519
627
520
628
So then, how do we pick the * best* value of $K$, i.e., * tune* the model?
521
- And is it possible to make this selection in a principled way? Ideally,
522
- we want somehow to maximize the performance of our classifier on data * it
629
+ And is it possible to make this selection in a principled way? In this book,
630
+ we will focus on maximizing the accuracy of the classifier. Ideally,
631
+ we want somehow to maximize the accuracy of our classifier on data * it
523
632
hasn't seen yet* . But we cannot use our test data set in the process of building
524
633
our model. So we will play the same trick we did before when evaluating
525
634
our classifier: we'll split our * training data itself* into two subsets,
0 commit comments