You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: source/classification2.Rmd
+38-24Lines changed: 38 additions & 24 deletions
Original file line number
Diff line number
Diff line change
@@ -160,40 +160,54 @@ classifier can make, corresponding to the four entries in the confusion matrix:
160
160
-**True Negative:** A benign observation that was classified as benign (bottom right in Table \@ref(tab:confusion-matrix)).
161
161
-**False Negative:** A malignant observation that was classified as benign (bottom left in Table \@ref(tab:confusion-matrix)).
162
162
163
-
A perfect classifier would have zero false negatives and false positives (and therefore, 100% accuracy).
164
-
However, real classifiers in practice will almost always make some mistakes, so it is important to think
165
-
about what type of error is more harmful. Two commonly used metrics that we can compute using the confusion matrix
166
-
are the **precision** and **recall** of the classifier. These are often reported together with accuracy.
167
-
*Precision* quantifies how many of the positive predictions the classifier made were actually positive. Intuitively,
168
-
we would like a classifier to have a *high* precision: for a classifier with high precision, if the
169
-
classifier reports that a new observation is positive, we can trust that that
170
-
new observation is indeed positive. We can compute
171
-
the precision of a classifier using the entries in the confusion matrix, with the formula
163
+
A perfect classifier would have zero false negatives and false positives (and
164
+
therefore, 100% accuracy). However, classifiers in practice will almost always
165
+
make some errors. So you should think about which kinds of error are most
166
+
important in your application, and use the confusion matrix to quantify and
167
+
report them. Two commonly used metrics that we can compute using the confusion
168
+
matrix are the **precision** and **recall** of the classifier. These are often
169
+
reported together with accuracy. *Precision* quantifies how many of the
170
+
positive predictions the classifier made were actually positive. Intuitively,
171
+
we would like a classifier to have a *high* precision: for a classifier with
172
+
high precision, if the classifier reports that a new observation is positive,
173
+
we can trust that that new observation is indeed positive. We can compute the
174
+
precision of a classifier using the entries in the confusion matrix, with the
175
+
formula
172
176
173
177
$$\mathrm{precision} = \frac{\mathrm{number \; of \; correct \; positive \; predictions}}{\mathrm{total \; number \; of \; positive \; predictions}}.$$
174
178
175
-
*Recall* quantifies how many of the positive observations in the test set were identified as positive. Intuitively, we would like
176
-
a classifier to have a *high* recall: for a classifier with high recall, if there is a positive observation in the test data, we can trust
177
-
that the classifier will find it.
178
-
We can also compute the recall of the classifier using the entries in the confusion matrix, with the formula
179
+
*Recall* quantifies how many of the positive observations in the test set were
180
+
identified as positive. Intuitively, we would like a classifier to have a
181
+
*high* recall: for a classifier with high recall, if there is a positive
182
+
observation in the test data, we can trust that the classifier will find it.
183
+
We can also compute the recall of the classifier using the entries in the
184
+
confusion matrix, with the formula
179
185
180
186
$$\mathrm{recall} = \frac{\mathrm{number \; of \; correct \; positive \; predictions}}{\mathrm{total \; number \; of \; positive \; test \; set \; observations}}.$$
181
187
182
188
In the example presented in Table \@ref(tab:confusion-matrix), we have that the precision and recall are
So even with an accuracy of 89%, the precision and recall of the classifier were both relatively low. For this data analysis
187
-
context, recall is particularly important: if someone has a malignant tumor, we certainly want to identify it.
188
-
A recall of just 25% would likely be unacceptable!
189
-
190
-
> **Note:** It is difficult to achieve both high precision and high recall at the same time; models with high precision tend to have low recall and vice versa.
191
-
> As an example, we can easily make a classifier that has *perfect recall*: just *always* guess positive! This classifier will of course find every
192
-
> positive observation in the test set, but it will make lots of false positive predictions along the way and have low precision. Similarly, we can easily
193
-
> make a classifier that has *perfect precision*: *never* guess positive! This classifier will never incorrectly identify an obsevation as positive,
194
-
> but it will make a lot of false negative predictions along the way. In fact, this classifier will have 0% recall! Of course, most real classifiers fall somewhere
195
-
> in between these two extremes. But these examples serve to show that in settings where one of the classes is of interest (i.e., there is a *positive* label),
196
-
> there is a trade-off between precision and recall that one has to make when designing a classifier.
192
+
So even with an accuracy of 89%, the precision and recall of the classifier
193
+
were both relatively low. For this data analysis context, recall is
194
+
particularly important: if someone has a malignant tumor, we certainly want to
195
+
identify it. A recall of just 25% would likely be unacceptable!
196
+
197
+
> **Note:** It is difficult to achieve both high precision and high recall at
198
+
> the same time; models with high precision tend to have low recall and vice
199
+
> versa. As an example, we can easily make a classifier that has *perfect
200
+
> recall*: just *always* guess positive! This classifier will of course find
201
+
> every positive observation in the test set, but it will make lots of false
202
+
> positive predictions along the way and have low precision. Similarly, we can
203
+
> easily make a classifier that has *perfect precision*: *never* guess
204
+
> positive! This classifier will never incorrectly identify an obsevation as
205
+
> positive, but it will make a lot of false negative predictions along the way.
206
+
> In fact, this classifier will have 0% recall! Of course, most real
207
+
> classifiers fall somewhere in between these two extremes. But these examples
208
+
> serve to show that in settings where one of the classes is of interest (i.e.,
209
+
> there is a *positive* label), there is a trade-off between precision and recall that one has to
210
+
> make when designing a classifier.
197
211
198
212
## Randomness and seeds {#randomseeds}
199
213
Beginning in this chapter, our data analyses will often involve the use
0 commit comments