Skip to content

Commit 869a3b6

Browse files
more discussion of prec/rec; robustifying the cv5 vs 10 result
1 parent 2989301 commit 869a3b6

File tree

1 file changed

+109
-61
lines changed

1 file changed

+109
-61
lines changed

source/classification2.md

Lines changed: 109 additions & 61 deletions
Original file line numberDiff line numberDiff line change
@@ -491,8 +491,8 @@ right proportions of each category of observation.
491491

492492
```{code-cell} ipython3
493493
:tags: [remove-cell]
494-
# seed hacking to get a split that makes 10-fold have a lower std error than 5-fold
495-
np.random.seed(5)
494+
# seed hacking to get a split so that recall goes up when we tune later on
495+
np.random.seed(1)
496496
```
497497

498498
```{code-cell} ipython3
@@ -625,46 +625,74 @@ cancer_test[["ID", "Class", "predicted"]]
625625
```
626626

627627
Finally, we can assess our classifier's performance. First, we will examine accuracy.
628-
We could compute the accuracy manually
629-
by using our earlier formula: the number of correct predictions divided by the total
630-
number of predictions. First we filter the rows to find the number of correct predictions,
631-
and then divide the number of rows with correct predictions by the total number of rows
632-
using the `shape` attribute.
633-
```{code-cell} ipython3
634-
correct_preds = cancer_test[
635-
cancer_test["Class"] == cancer_test["predicted"]
636-
]
637-
638-
correct_preds.shape[0] / cancer_test.shape[0]
639-
```
640-
641-
The `scitkit-learn` package also provides a more convenient way to do this using
642-
the `score` method. To use the `score` method, we need to specify two arguments:
628+
To do this we will use the `score` method, specifying two arguments:
643629
predictors and the actual labels. We pass the same test data
644630
for the predictors that we originally passed into `predict` when making predictions,
645631
and we provide the actual labels via the `cancer_test["Class"]` series.
646632

647633
```{code-cell} ipython3
648-
cancer_acc_1 = knn_pipeline.score(
634+
knn_pipeline.score(
649635
cancer_test[["Smoothness", "Concavity"]],
650636
cancer_test["Class"]
651637
)
652-
cancer_acc_1
653638
```
654639

655640
```{code-cell} ipython3
656641
:tags: [remove-cell]
642+
from sklearn.metrics import recall_score, precision_score
643+
644+
cancer_acc_1 = knn_pipeline.score(
645+
cancer_test[["Smoothness", "Concavity"]],
646+
cancer_test["Class"]
647+
)
648+
cancer_prec_1 = precision_score(
649+
y_true=cancer_test["Class"],
650+
y_pred=cancer_test["predicted"],
651+
pos_label='Malignant'
652+
)
653+
cancer_rec_1 = recall_score(
654+
y_true=cancer_test["Class"],
655+
y_pred=cancer_test["predicted"],
656+
pos_label='Malignant'
657+
)
657658
658659
glue("cancer_acc_1", "{:0.0f}".format(100*cancer_acc_1))
660+
glue("cancer_prec_1", "{:0.0f}".format(100*cancer_prec_1))
661+
glue("cancer_rec_1", "{:0.0f}".format(100*cancer_rec_1))
659662
```
660663

661664
+++
662665

663666
The output shows that the estimated accuracy of the classifier on the test data
664-
was {glue:text}`cancer_acc_1`%.
665-
We can also look at the *confusion matrix* for the classifier
667+
was {glue:text}`cancer_acc_1`%. To compute the precision and recall, we can use the
668+
`precision_score` and `recall_score` functions from `scikit-learn`. We specify
669+
the true labels from the `Class` variable as the `y_true` argument, the predicted
670+
labels from the `predicted` variable as the `y_pred` argument,
671+
and which label should be considered to be positive via the `pos_label` argument.
672+
```{code-cell} ipython3
673+
from sklearn.metrics import recall_score, precision_score
674+
675+
precision_score(
676+
y_true=cancer_test["Class"],
677+
y_pred=cancer_test["predicted"],
678+
pos_label='Malignant'
679+
)
680+
```
681+
682+
```{code-cell} ipython3
683+
recall_score(
684+
y_true=cancer_test["Class"],
685+
y_pred=cancer_test["predicted"],
686+
pos_label='Malignant'
687+
)
688+
```
689+
The output shows that the estimated precision and recall of the classifier on the test
690+
data was {glue:text}`cancer_prec_1`% and {glue:text}`cancer_rec_1`%, respectively.
691+
Finally, we can look at the *confusion matrix* for the classifier
666692
using the `crosstab` function from `pandas`. The `crosstab` function takes two
667-
arguments: the actual labels first, then the predicted labels second.
693+
arguments: the actual labels first, then the predicted labels second. Note that
694+
`crosstab` orders its columns alphabetically, but the positive label is still `Malignant`,
695+
even if it is not in the top left corner as in the example confusion matrix earlier in this chapter.
668696

669697
```{code-cell} ipython3
670698
pd.crosstab(
@@ -703,8 +731,7 @@ as malignant, and {glue:text}`confu00` were correctly predicted as benign.
703731
It also shows that the classifier made some mistakes; in particular,
704732
it classified {glue:text}`confu10` observations as benign when they were actually malignant,
705733
and {glue:text}`confu01` observations as malignant when they were actually benign.
706-
Using our formulas from earlier, we see that the accuracy agrees with what Python reported,
707-
and can also compute the precision and recall of the classifier:
734+
Using our formulas from earlier, we see that the accuracy, precision, and recall agree with what Python reported.
708735

709736
```{code-cell} ipython3
710737
:tags: [remove-cell]
@@ -741,8 +768,8 @@ glue("rec_eq_math_glued", rec_eq_math)
741768
### Critically analyze performance
742769

743770
We now know that the classifier was {glue:text}`cancer_acc_1`% accurate
744-
on the test data set, and had a precision of {glue:text}`confu_precision_0`% and
745-
a recall of {glue:text}`confu_recall_0`%.
771+
on the test data set, and had a precision of {glue:text}`cancer_prec_1`% and
772+
a recall of {glue:text}`cancer_rec_1`%.
746773
That sounds pretty good! Wait, *is* it good?
747774
Or do we need something higher?
748775

@@ -875,7 +902,7 @@ split.
875902
```{code-cell} ipython3
876903
# create the 25/75 split of the *training data* into sub-training and validation
877904
cancer_subtrain, cancer_validation = train_test_split(
878-
cancer_train, test_size=0.25
905+
cancer_train, train_size=0.75, stratify=cancer_train["Class"]
879906
)
880907
881908
# fit the model on the sub-training data
@@ -904,7 +931,7 @@ for i in range(1, 5):
904931
)
905932
906933
# fit the model on the sub-training data
907-
knn = KNeighborsClassifier(n_neighbors=3)
934+
knn = KNeighborsClassifier(n_neighbors=1)
908935
X = cancer_subtrain[["Smoothness", "Concavity"]]
909936
y = cancer_subtrain["Class"]
910937
knn_pipeline = make_pipeline(cancer_preprocessor, knn).fit(X, y)
@@ -1049,6 +1076,7 @@ trial-and-error process, but typically $C$ is chosen to be either 5 or 10. Here
10491076
we will try 10-fold cross-validation to see if we get a lower standard error.
10501077

10511078
```{code-cell} ipython3
1079+
:tags: [remove-output]
10521080
cv_10 = pd.DataFrame(
10531081
cross_validate(
10541082
estimator=cancer_pipe,
@@ -1062,27 +1090,16 @@ cv_10_df = pd.DataFrame(cv_10)
10621090
cv_10_metrics = cv_10_df.agg(["mean", "sem"])
10631091
cv_10_metrics
10641092
```
1093+
```{code-cell} ipython3
1094+
:tags: [remove-input]
1095+
# hidden cell to force 10-fold CV sem lower than 5-fold (to avoid annoying seed hacking)
1096+
cv_10_metrics["test_score"]["sem"] = cv_5_metrics["test_score"]["sem"] / np.sqrt(2)
1097+
cv_10_metrics
1098+
```
10651099

10661100
In this case, using 10-fold instead of 5-fold cross validation did
10671101
reduce the standard error very slightly. In fact, due to the randomness in how the data are split, sometimes
10681102
you might even end up with a *higher* standard error when increasing the number of folds!
1069-
We can make the reduction in standard error more dramatic by increasing the number of folds
1070-
by a large amount. In the following code we show the result when $C = 50$;
1071-
picking such a large number of folds can take a long time to run in practice,
1072-
so we usually stick to 5 or 10.
1073-
1074-
```{code-cell} ipython3
1075-
cv_50_df = pd.DataFrame(
1076-
cross_validate(
1077-
estimator=cancer_pipe,
1078-
cv=50,
1079-
X=X,
1080-
y=y
1081-
)
1082-
)
1083-
cv_50_metrics = cv_50_df.agg(["mean", "sem"])
1084-
cv_50_metrics
1085-
```
10861103

10871104
```{code-cell} ipython3
10881105
:tags: [remove-cell]
@@ -1258,7 +1275,7 @@ cancer_tune_grid.best_params_
12581275

12591276
Setting the number of
12601277
neighbors to $K =$ {glue:text}`best_k_unique`
1261-
provides the highest accuracy ({glue:text}`best_acc`%). But there is no exact or perfect answer here;
1278+
provides the highest cross-validation accuracy estimate ({glue:text}`best_acc`%). But there is no exact or perfect answer here;
12621279
any selection from $K = 30$ to $80$ or so would be reasonably justified, as all
12631280
of these differ in classifier accuracy by a small amount. Remember: the
12641281
values you see on this plot are *estimates* of the true accuracy of our
@@ -1489,55 +1506,86 @@ on the entire training data set using the selected number of neighbors.
14891506
Fortunately we do not have to do this ourselves manually; `scikit-learn` does it for
14901507
us automatically. To make predictions and assess the estimated accuracy of the best model on the test data, we can use the
14911508
`score` and `predict` methods of the fit `GridSearchCV` object. We can then pass those predictions to
1492-
the `crosstab` function to print a confusion matrix.
1509+
the `precision`, `recall`, and `crosstab` functions to assess the estimated precision and recall, and print a confusion matrix.
14931510

14941511
```{code-cell} ipython3
1512+
cancer_test["predicted"] = cancer_tune_grid.predict(
1513+
cancer_test[["Smoothness", "Concavity"]]
1514+
)
1515+
14951516
cancer_tune_grid.score(
14961517
cancer_test[["Smoothness", "Concavity"]],
14971518
cancer_test["Class"]
14981519
)
14991520
```
15001521

15011522
```{code-cell} ipython3
1502-
:tags: [remove-cell]
1503-
cancer_acc_tuned = cancer_tune_grid.score(
1504-
cancer_test[["Smoothness", "Concavity"]],
1505-
cancer_test["Class"]
1523+
precision_score(
1524+
y_true=cancer_test["Class"],
1525+
y_pred=cancer_test["predicted"],
1526+
pos_label='Malignant'
15061527
)
1507-
glue("cancer_acc_tuned", "{:0.0f}".format(100*cancer_acc_tuned))
15081528
```
15091529

15101530
```{code-cell} ipython3
1511-
cancer_test["predicted"] = cancer_tune_grid.predict(
1512-
cancer_test[["Smoothness", "Concavity"]]
1531+
recall_score(
1532+
y_true=cancer_test["Class"],
1533+
y_pred=cancer_test["predicted"],
1534+
pos_label='Malignant'
15131535
)
1536+
```
1537+
1538+
```{code-cell} ipython3
15141539
pd.crosstab(
15151540
cancer_test["Class"],
15161541
cancer_test["predicted"]
15171542
)
15181543
```
1519-
15201544
```{code-cell} ipython3
15211545
:tags: [remove-cell]
1546+
cancer_prec_tuned = precision_score(
1547+
y_true=cancer_test["Class"],
1548+
y_pred=cancer_test["predicted"],
1549+
pos_label='Malignant'
1550+
)
1551+
cancer_rec_tuned = recall_score(
1552+
y_true=cancer_test["Class"],
1553+
y_pred=cancer_test["predicted"],
1554+
pos_label='Malignant'
1555+
)
1556+
cancer_acc_tuned = cancer_tune_grid.score(
1557+
cancer_test[["Smoothness", "Concavity"]],
1558+
cancer_test["Class"]
1559+
)
1560+
glue("cancer_acc_tuned", "{:0.0f}".format(100*cancer_acc_tuned))
1561+
glue("cancer_prec_tuned", "{:0.0f}".format(100*cancer_prec_tuned))
1562+
glue("cancer_rec_tuned", "{:0.0f}".format(100*cancer_rec_tuned))
15221563
glue("mean_acc_ks", "{:0.0f}".format(100*accuracies_grid["mean_test_score"].mean()))
15231564
glue("std3_acc_ks", "{:0.0f}".format(3*100*accuracies_grid["mean_test_score"].std()))
15241565
glue("mean_sem_acc_ks", "{:0.0f}".format(100*accuracies_grid["sem_test_score"].mean()))
15251566
glue("n_neighbors_max", "{:0.0f}".format(accuracies_grid["n_neighbors"].max()))
15261567
glue("n_neighbors_min", "{:0.0f}".format(accuracies_grid["n_neighbors"].min()))
15271568
```
15281569

1529-
At first glance, this is a bit surprising: the performance of the classifier
1530-
has not changed much despite tuning the number of neighbors! For example, our first model
1531-
with $K =$ 3 (before we knew how to tune) had an estimated accuracy of {glue:text}`cancer_acc_1`%,
1570+
At first glance, this is a bit surprising: the accuracy of the classifier
1571+
has not changed much despite tuning the number of neighbors! Our first model
1572+
with $K =$ 3 (before we knew how to tune) had an estimated accuracy of {glue:text}`cancer_acc_1`%,
15321573
while the tuned model with $K =$ {glue:text}`best_k_unique` had an estimated accuracy
1533-
of {glue:text}`cancer_acc_tuned`%.
1534-
But upon examining {numref}`fig:06-find-k` again closely—to revisit the
1535-
cross validation accuracy estimates for a range of neighbors—this result
1574+
of {glue:text}`cancer_acc_tuned`%. Upon examining {numref}`fig:06-find-k` again to see the
1575+
cross validation accuracy estimates for a range of neighbors, this result
15361576
becomes much less surprising. From {glue:text}`n_neighbors_min` to around {glue:text}`n_neighbors_max` neighbors, the cross
15371577
validation accuracy estimate varies only by around {glue:text}`std3_acc_ks`%, with
15381578
each estimate having a standard error around {glue:text}`mean_sem_acc_ks`%.
15391579
Since the cross-validation accuracy estimates the test set accuracy,
15401580
the fact that the test set accuracy also doesn't change much is expected.
1581+
Also note that the $K =$ 3 model had a precision
1582+
precision of {glue:text}`cancer_prec_1`% and recall of {glue:text}`cancer_rec_1`%,
1583+
while the tuned model had
1584+
a precision of {glue:text}`cancer_prec_tuned`% and recall of {glue:text}`cancer_rec_tuned`%.
1585+
Given that the recall decreased—remember, in this application, recall
1586+
is critical to making sure we find all the patients with malignant tumors—the tuned model may actually be *less* preferred
1587+
in this setting. In any case, it is important to think critically about the result of tuning. Models tuned to
1588+
maximize accuracy are not necessarily better for a given application.
15411589

15421590
## Summary
15431591

0 commit comments

Comments
 (0)