Skip to content

Commit 532fd87

Browse files
Merge pull request #508 from UBC-DSCI/confusion-precision-recall
Confusion matrix explanation improvement, precision, and recall
2 parents 2cc40db + f4b1488 commit 532fd87

File tree

2 files changed

+153
-42
lines changed

2 files changed

+153
-42
lines changed

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,3 +10,5 @@ _bookdown_files
1010
docs/**
1111
.local/**
1212
*.log
13+
_main.Rmd
14+
_main_files/**

source/classification2.Rmd

Lines changed: 151 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,7 @@ theme_update(axis.title = element_text(size = 12)) # modify axis label size in p
4545
## Overview
4646
This chapter continues the introduction to predictive modeling through
4747
classification. While the previous chapter covered training and data
48-
preprocessing, this chapter focuses on how to evaluate the accuracy of
48+
preprocessing, this chapter focuses on how to evaluate the performance of
4949
a classifier, as well as how to improve the classifier (where possible)
5050
to maximize its accuracy.
5151

@@ -56,11 +56,13 @@ By the end of the chapter, readers will be able to do the following:
5656
- Split data into training, validation, and test data sets.
5757
- Describe what a random seed is and its importance in reproducible data analysis.
5858
- Set the random seed in R using the `set.seed` function.
59-
- Evaluate classification accuracy in R using a validation data set and appropriate metrics.
59+
- Describe and interpret accuracy, precision, recall, and confusion matrices.
60+
- Evaluate classification accuracy in R using a validation data set.
61+
- Produce a confusion matrix in R.
6062
- Execute cross-validation in R to choose the number of neighbors in a $K$-nearest neighbors classifier.
6163
- Describe the advantages and disadvantages of the $K$-nearest neighbors classification algorithm.
6264

63-
## Evaluating accuracy
65+
## Evaluating performance
6466

6567
Sometimes our classifier might make the wrong prediction. A classifier does not
6668
need to be right 100\% of the time to be useful, though we don't want the
@@ -71,13 +73,15 @@ and think about how our classifier will be used in practice. A biopsy will be
7173
performed on a *new* patient's tumor, the resulting image will be analyzed,
7274
and the classifier will be asked to decide whether the tumor is benign or
7375
malignant. The key word here is *new*: our classifier is "good" if it provides
74-
accurate predictions on data *not seen during training*. But then, how can we
75-
evaluate our classifier without visiting the hospital to collect more
76+
accurate predictions on data *not seen during training*, as this implies that
77+
it has actually learned about the relationship between the predictor variables and response variable,
78+
as opposed to simply memorizing the labels of individual training data examples.
79+
But then, how can we evaluate our classifier without visiting the hospital to collect more
7680
tumor images?
7781

7882
The trick is to split the data into a **training set** \index{training set} and **test set** \index{test set} (Figure \@ref(fig:06-training-test))
7983
and use only the **training set** when building the classifier.
80-
Then, to evaluate the accuracy of the classifier, we first set aside the true labels from the **test set**,
84+
Then, to evaluate the performance of the classifier, we first set aside the true labels from the **test set**,
8185
and then use the classifier to predict the labels in the **test set**. If our predictions match the true
8286
labels for the observations in the **test set**, then we have some
8387
confidence that our classifier might also accurately predict the class
@@ -95,23 +99,116 @@ knitr::include_graphics("img/classification2/training_test.jpeg")
9599

96100
How exactly can we assess how well our predictions match the true labels for
97101
the observations in the test set? One way we can do this is to calculate the
98-
**prediction accuracy**. \index{prediction accuracy|see{accuracy}}\index{accuracy} This is the fraction of examples for which the
102+
prediction **accuracy**. \index{prediction accuracy|see{accuracy}}\index{accuracy} This is the fraction of examples for which the
99103
classifier made the correct prediction. To calculate this, we divide the number
100104
of correct predictions by the number of predictions made.
101-
102-
$$\mathrm{prediction \; accuracy} = \frac{\mathrm{number \; of \; correct \; predictions}}{\mathrm{total \; number \; of \; predictions}}$$
103-
104-
105105
The process for assessing if our predictions match the true labels in the
106-
test set is illustrated in Figure \@ref(fig:06-ML-paradigm-test). Note that there
107-
are other measures for how well classifiers perform, such as *precision* and *recall*;
108-
these will not be discussed here, but you will likely encounter them in other more advanced
109-
books on this topic.
106+
test set is illustrated in Figure \@ref(fig:06-ML-paradigm-test).
107+
108+
$$\mathrm{accuracy} = \frac{\mathrm{number \; of \; correct \; predictions}}{\mathrm{total \; number \; of \; predictions}}$$
110109

111110
```{r 06-ML-paradigm-test, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "Process for splitting the data and finding the prediction accuracy.", fig.retina = 2, out.width = "100%"}
112111
knitr::include_graphics("img/classification2/ML-paradigm-test.png")
113112
```
114113

114+
Accuracy is a convenient, general-purpose way to summarize the performance of a classifier with
115+
a single number. But prediction accuracy by itself does not tell the whole
116+
story. In particular, accuracy alone only tells us how often the classifier
117+
makes mistakes in general, but does not tell us anything about the *kinds* of
118+
mistakes the classifier makes. A more comprehensive view of performance can be
119+
obtained by additionally examining the **confusion matrix**. The confusion
120+
matrix shows how many test set labels of each type are predicted correctly and
121+
incorrectly, which gives us more detail about the kinds of mistakes the
122+
classifier tends to make. Table \@ref(tab:confusion-matrix) shows an example
123+
of what a confusion matrix might look like for the tumor image data with
124+
a test set of 65 observations.
125+
126+
Table: (\#tab:confusion-matrix) An example confusion matrix for the tumor image data.
127+
128+
| | Truly Malignant | Truly Benign |
129+
| ---------------------- | --------------- | -------------- |
130+
| **Predicted Malignant** | 1 | 4 |
131+
| **Predicted Benign** | 3 | 57 |
132+
133+
In the example in Table \@ref(tab:confusion-matrix), we see that there was
134+
1 malignant observation that was correctly classified as malignant (top left corner),
135+
and 57 benign observations that were correctly classified as benign (bottom right corner).
136+
However, we can also see that the classifier made some mistakes:
137+
it classified 3 malignant observations as benign, and 4 benign observations as
138+
malignant. The accuracy of this classifier is roughly
139+
89%, given by the formula
140+
141+
$$\mathrm{accuracy} = \frac{\mathrm{number \; of \; correct \; predictions}}{\mathrm{total \; number \; of \; predictions}} = \frac{1+57}{1+57+4+3} = 0.892$$
142+
143+
But we can also see that the classifier only identified 1 out of 4 total malignant
144+
tumors; in other words, it misclassified 75% of the malignant cases present in the
145+
data set! In this example, misclassifying a malignant tumor is a potentially
146+
disastrous error, since it may lead to a patient who requires treatment not receiving it.
147+
Since we are particularly interested in identifying malignant cases, this
148+
classifier would likely be unacceptable even with an accuracy of 89%.
149+
150+
Focusing more on one label than the other is
151+
common in classification problems. In such cases, we typically refer to the label we are more
152+
interested in identifying as the *positive* label, and the other as the
153+
*negative* label. In the tumor example, we would refer to malignant
154+
observations as *positive*, and benign observations as *negative*. We can then
155+
use the following terms to talk about the four kinds of prediction that the
156+
classifier can make, corresponding to the four entries in the confusion matrix:
157+
158+
- **True Positive:** A malignant observation that was classified as malignant (top left in Table \@ref(tab:confusion-matrix)).
159+
- **False Positive:** A benign observation that was classified as malignant (top right in Table \@ref(tab:confusion-matrix)).
160+
- **True Negative:** A benign observation that was classified as benign (bottom right in Table \@ref(tab:confusion-matrix)).
161+
- **False Negative:** A malignant observation that was classified as benign (bottom left in Table \@ref(tab:confusion-matrix)).
162+
163+
A perfect classifier would have zero false negatives and false positives (and
164+
therefore, 100% accuracy). However, classifiers in practice will almost always
165+
make some errors. So you should think about which kinds of error are most
166+
important in your application, and use the confusion matrix to quantify and
167+
report them. Two commonly used metrics that we can compute using the confusion
168+
matrix are the **precision** and **recall** of the classifier. These are often
169+
reported together with accuracy. *Precision* quantifies how many of the
170+
positive predictions the classifier made were actually positive. Intuitively,
171+
we would like a classifier to have a *high* precision: for a classifier with
172+
high precision, if the classifier reports that a new observation is positive,
173+
we can trust that that new observation is indeed positive. We can compute the
174+
precision of a classifier using the entries in the confusion matrix, with the
175+
formula
176+
177+
$$\mathrm{precision} = \frac{\mathrm{number \; of \; correct \; positive \; predictions}}{\mathrm{total \; number \; of \; positive \; predictions}}.$$
178+
179+
*Recall* quantifies how many of the positive observations in the test set were
180+
identified as positive. Intuitively, we would like a classifier to have a
181+
*high* recall: for a classifier with high recall, if there is a positive
182+
observation in the test data, we can trust that the classifier will find it.
183+
We can also compute the recall of the classifier using the entries in the
184+
confusion matrix, with the formula
185+
186+
$$\mathrm{recall} = \frac{\mathrm{number \; of \; correct \; positive \; predictions}}{\mathrm{total \; number \; of \; positive \; test \; set \; observations}}.$$
187+
188+
In the example presented in Table \@ref(tab:confusion-matrix), we have that the precision and recall are
189+
190+
$$\mathrm{precision} = \frac{1}{1+4} = 0.20, \quad \mathrm{recall} = \frac{1}{1+3} = 0.25.$$
191+
192+
So even with an accuracy of 89%, the precision and recall of the classifier
193+
were both relatively low. For this data analysis context, recall is
194+
particularly important: if someone has a malignant tumor, we certainly want to
195+
identify it. A recall of just 25% would likely be unacceptable!
196+
197+
> **Note:** It is difficult to achieve both high precision and high recall at
198+
> the same time; models with high precision tend to have low recall and vice
199+
> versa. As an example, we can easily make a classifier that has *perfect
200+
> recall*: just *always* guess positive! This classifier will of course find
201+
> every positive observation in the test set, but it will make lots of false
202+
> positive predictions along the way and have low precision. Similarly, we can
203+
> easily make a classifier that has *perfect precision*: *never* guess
204+
> positive! This classifier will never incorrectly identify an obsevation as
205+
> positive, but it will make a lot of false negative predictions along the way.
206+
> In fact, this classifier will have 0% recall! Of course, most real
207+
> classifiers fall somewhere in between these two extremes. But these examples
208+
> serve to show that in settings where one of the classes is of interest (i.e.,
209+
> there is a *positive* label), there is a trade-off between precision and recall that one has to
210+
> make when designing a classifier.
211+
115212
## Randomness and seeds {#randomseeds}
116213
Beginning in this chapter, our data analyses will often involve the use
117214
of *randomness*. \index{random} We use randomness any time we need to make a decision in our
@@ -210,7 +307,7 @@ Different argument values in `set.seed` lead to different patterns of randomness
210307
you pick the same argument value your result will be the same.
211308
In the remainder of the textbook, we will set the seed once at the beginning of each chapter.
212309

213-
## Evaluating accuracy with `tidymodels`
310+
## Evaluating performance with `tidymodels`
214311
Back to evaluating classifiers now!
215312
In R, we can use the `tidymodels` package \index{tidymodels} not only to perform $K$-nearest neighbors
216313
classification, but also to assess how well our classification worked.
@@ -394,11 +491,12 @@ cancer_test_predictions <- predict(knn_fit, cancer_test) |>
394491
cancer_test_predictions
395492
```
396493

397-
### Compute the accuracy
494+
### Evaluate performance
398495

399-
Finally, we can assess our classifier's accuracy. To do this we use the `metrics` function \index{tidymodels!metrics}
400-
from `tidymodels` to get the statistics about the quality of our model, specifying
401-
the `truth` and `estimate` arguments:
496+
Finally, we can assess our classifier's performance. First, we will examine
497+
accuracy. To do this we use the
498+
`metrics` function \index{tidymodels!metrics} from `tidymodels`,
499+
specifying the `truth` and `estimate` arguments:
402500

403501
```{r 06-accuracy}
404502
cancer_test_predictions |>
@@ -413,7 +511,7 @@ cancer_acc_1 <- cancer_test_predictions |>
413511
```
414512

415513
In the metrics data frame, we filtered the `.metric` column since we are
416-
interested in the `accuracy` row. Other entries involve more advanced metrics that
514+
interested in the `accuracy` row. Other entries involve other metrics that
417515
are beyond the scope of this book. Looking at the value of the `.estimate` variable
418516
shows that the estimated accuracy of the classifier on the test data
419517
was `r round(100*cancer_acc_1$.estimate, 0)`%.
@@ -436,31 +534,41 @@ confu22 <- (confusionmt |> filter(name == "cell_2_2"))$value
436534
```
437535

438536
The confusion matrix shows `r confu11` observations were correctly predicted
439-
as malignant, and `r confu22` were correctly predicted as benign. Therefore the classifier labeled
440-
`r confu11` + `r confu22` = `r confu11+confu22` observations
441-
correctly. It also shows that the classifier made some mistakes; in particular,
537+
as malignant, and `r confu22` were correctly predicted as benign.
538+
It also shows that the classifier made some mistakes; in particular,
442539
it classified `r confu21` observations as benign when they were truly malignant,
443540
and `r confu12` observations as malignant when they were truly benign.
541+
Using our formulas from earlier, we see that the accuracy agrees with what R reported,
542+
and can also compute the precision and recall of the classifier:
543+
544+
$$\mathrm{accuracy} = \frac{\mathrm{number \; of \; correct \; predictions}}{\mathrm{total \; number \; of \; predictions}} = \frac{`r confu11`+`r confu22`}{`r confu11`+`r confu22`+`r confu12`+`r confu21`} = `r round((confu11+confu22)/(confu11+confu22+confu12+confu21),3)`$$
545+
546+
$$\mathrm{precision} = \frac{\mathrm{number \; of \; correct \; positive \; predictions}}{\mathrm{total \; number \; of \; positive \; predictions}} = \frac{`r confu11`}{`r confu11` + `r confu12`} = `r round(confu11/(confu11+confu12), 3)`$$
547+
548+
$$\mathrm{recall} = \frac{\mathrm{number \; of \; correct \; positive \; predictions}}{\mathrm{total \; number \; of \; positive \; test \; set \; observations}} = \frac{`r confu11`}{`r confu11`+`r confu21`} = `r round(confu11/(confu11+confu21),3)`$$
549+
444550

445551
### Critically analyze performance
446552

447553
We now know that the classifier was `r round(100*cancer_acc_1$.estimate,0)`% accurate
448-
on the test data set. That sounds pretty good! Wait, *is* it good?
449-
Or do we need something higher?
450-
451-
In general, what a *good* value for accuracy \index{accuracy!assessment} is depends on the application.
452-
For instance, suppose you are predicting whether a tumor is benign or malignant
453-
for a type of tumor that is benign 99% of the time. It is very easy to obtain
454-
a 99% accuracy just by guessing benign for every observation. In this case,
455-
99% accuracy is probably not good enough. And beyond just accuracy,
456-
sometimes the *kind* of mistake the classifier makes is important as well. In
457-
the previous example, it might be very bad for the classifier to predict
458-
"benign" when the true class is "malignant", as this might result in a patient
459-
not receiving appropriate medical attention. On the other hand, it might be
460-
less bad for the classifier to guess "malignant" when the true class is
461-
"benign", as the patient will then likely see a doctor who can provide an
462-
expert diagnosis. This is why it is important not only to look at accuracy, but
463-
also the confusion matrix.
554+
on the test data set, and had a precision of `r 100*round(confu11/(confu11+confu12),2)`% and a recall of `r 100*round(confu11/(confu11+confu21),2)`%.
555+
That sounds pretty good! Wait, *is* it good? Or do we need something higher?
556+
557+
In general, a *good* value for accuracy (as well as precision and recall, if applicable)\index{accuracy!assessment}
558+
depends on the application; you must critically analyze your accuracy in the context of the problem
559+
you are solving. For example, if we were building a classifier for a kind of tumor that is benign 99%
560+
of the time, a classifier with 99% accuracy is not terribly impressive (just always guess benign!).
561+
And beyond just accuracy, we need to consider the precision and recall: as mentioned
562+
earlier, the *kind* of mistake the classifier makes is
563+
important in many applications as well. In the previous example with 99% benign observations, it might be very bad for the
564+
classifier to predict "benign" when the true class is "malignant" (a false negative), as this
565+
might result in a patient not receiving appropriate medical attention. In other
566+
words, in this context, we need the classifier to have a *high recall*. On the
567+
other hand, it might be less bad for the classifier to guess "malignant" when
568+
the true class is "benign" (a false positive), as the patient will then likely see a doctor who
569+
can provide an expert diagnosis. In other words, we are fine with sacrificing
570+
some precision in the interest of achieving high recall. This is why it is
571+
important not only to look at accuracy, but also the confusion matrix.
464572

465573
However, there is always an easy baseline that you can compare to for any
466574
classification problem: the *majority classifier*. The majority classifier \index{classification!majority}
@@ -518,8 +626,9 @@ By picking different values of $K$, we create different classifiers
518626
that make different predictions.
519627

520628
So then, how do we pick the *best* value of $K$, i.e., *tune* the model?
521-
And is it possible to make this selection in a principled way? Ideally,
522-
we want somehow to maximize the performance of our classifier on data *it
629+
And is it possible to make this selection in a principled way? In this book,
630+
we will focus on maximizing the accuracy of the classifier. Ideally,
631+
we want somehow to maximize the accuracy of our classifier on data *it
523632
hasn't seen yet*. But we cannot use our test data set in the process of building
524633
our model. So we will play the same trick we did before when evaluating
525634
our classifier: we'll split our *training data itself* into two subsets,

0 commit comments

Comments
 (0)