@@ -491,8 +491,8 @@ right proportions of each category of observation.
491
491
492
492
``` {code-cell} ipython3
493
493
:tags: [remove-cell]
494
- # seed hacking to get a split that makes 10-fold have a lower std error than 5-fold
495
- np.random.seed(5 )
494
+ # seed hacking
495
+ np.random.seed(3 )
496
496
```
497
497
498
498
``` {code-cell} ipython3
@@ -618,52 +618,81 @@ cancer_test["predicted"] = knn_pipeline.predict(cancer_test[["Smoothness", "Conc
618
618
cancer_test[["ID", "Class", "predicted"]]
619
619
```
620
620
621
+ (eval-performance-clasfcn2)=
621
622
### Evaluate performance
622
623
623
624
``` {index} scikit-learn; score
624
625
```
625
626
626
627
Finally, we can assess our classifier's performance. First, we will examine accuracy.
627
- We could compute the accuracy manually
628
- by using our earlier formula: the number of correct predictions divided by the total
629
- number of predictions. First we filter the rows to find the number of correct predictions,
630
- and then divide the number of rows with correct predictions by the total number of rows
631
- using the ` shape ` attribute.
632
- ``` {code-cell} ipython3
633
- correct_preds = cancer_test[
634
- cancer_test["Class"] == cancer_test["predicted"]
635
- ]
636
-
637
- correct_preds.shape[0] / cancer_test.shape[0]
638
- ```
639
-
640
- The ` scitkit-learn ` package also provides a more convenient way to do this using
641
- the ` score ` method. To use the ` score ` method, we need to specify two arguments:
628
+ To do this we will use the ` score ` method, specifying two arguments:
642
629
predictors and the actual labels. We pass the same test data
643
630
for the predictors that we originally passed into ` predict ` when making predictions,
644
631
and we provide the actual labels via the ` cancer_test["Class"] ` series.
645
632
646
633
``` {code-cell} ipython3
647
- cancer_acc_1 = knn_pipeline.score(
634
+ knn_pipeline.score(
648
635
cancer_test[["Smoothness", "Concavity"]],
649
636
cancer_test["Class"]
650
637
)
651
- cancer_acc_1
652
638
```
653
639
654
640
``` {code-cell} ipython3
655
641
:tags: [remove-cell]
642
+ from sklearn.metrics import recall_score, precision_score
643
+
644
+ cancer_acc_1 = knn_pipeline.score(
645
+ cancer_test[["Smoothness", "Concavity"]],
646
+ cancer_test["Class"]
647
+ )
648
+ cancer_prec_1 = precision_score(
649
+ y_true=cancer_test["Class"],
650
+ y_pred=cancer_test["predicted"],
651
+ pos_label="Malignant"
652
+ )
653
+ cancer_rec_1 = recall_score(
654
+ y_true=cancer_test["Class"],
655
+ y_pred=cancer_test["predicted"],
656
+ pos_label="Malignant"
657
+ )
656
658
657
659
glue("cancer_acc_1", "{:0.0f}".format(100*cancer_acc_1))
660
+ glue("cancer_prec_1", "{:0.0f}".format(100*cancer_prec_1))
661
+ glue("cancer_rec_1", "{:0.0f}".format(100*cancer_rec_1))
658
662
```
659
663
660
664
+++
661
665
662
666
The output shows that the estimated accuracy of the classifier on the test data
663
- was {glue: text }` cancer_acc_1 ` %.
664
- We can also look at the * confusion matrix* for the classifier
667
+ was {glue: text }` cancer_acc_1 ` %. To compute the precision and recall, we can use the
668
+ ` precision_score ` and ` recall_score ` functions from ` scikit-learn ` . We specify
669
+ the true labels from the ` Class ` variable as the ` y_true ` argument, the predicted
670
+ labels from the ` predicted ` variable as the ` y_pred ` argument,
671
+ and which label should be considered to be positive via the ` pos_label ` argument.
672
+ ``` {code-cell} ipython3
673
+ from sklearn.metrics import recall_score, precision_score
674
+
675
+ precision_score(
676
+ y_true=cancer_test["Class"],
677
+ y_pred=cancer_test["predicted"],
678
+ pos_label="Malignant"
679
+ )
680
+ ```
681
+
682
+ ``` {code-cell} ipython3
683
+ recall_score(
684
+ y_true=cancer_test["Class"],
685
+ y_pred=cancer_test["predicted"],
686
+ pos_label="Malignant"
687
+ )
688
+ ```
689
+ The output shows that the estimated precision and recall of the classifier on the test
690
+ data was {glue: text }` cancer_prec_1 ` % and {glue: text }` cancer_rec_1 ` %, respectively.
691
+ Finally, we can look at the * confusion matrix* for the classifier
665
692
using the ` crosstab ` function from ` pandas ` . The ` crosstab ` function takes two
666
- arguments: the actual labels first, then the predicted labels second.
693
+ arguments: the actual labels first, then the predicted labels second. Note that
694
+ ` crosstab ` orders its columns alphabetically, but the positive label is still ` Malignant ` ,
695
+ even if it is not in the top left corner as in the example confusion matrix earlier in this chapter.
667
696
668
697
``` {code-cell} ipython3
669
698
pd.crosstab(
@@ -702,8 +731,7 @@ as malignant, and {glue:text}`confu00` were correctly predicted as benign.
702
731
It also shows that the classifier made some mistakes; in particular,
703
732
it classified {glue: text }` confu10 ` observations as benign when they were actually malignant,
704
733
and {glue: text }` confu01 ` observations as malignant when they were actually benign.
705
- Using our formulas from earlier, we see that the accuracy agrees with what Python reported,
706
- and can also compute the precision and recall of the classifier:
734
+ Using our formulas from earlier, we see that the accuracy, precision, and recall agree with what Python reported.
707
735
708
736
``` {code-cell} ipython3
709
737
:tags: [remove-cell]
@@ -716,12 +744,12 @@ acc_eq_math = Math(acc_eq_str)
716
744
glue("acc_eq_math_glued", acc_eq_math)
717
745
718
746
prec_eq_str = r"\mathrm{precision} = \frac{\mathrm{number \; of \; correct \; positive \; predictions}}{\mathrm{total \; number \; of \; positive \; predictions}} = \frac{"
719
- prec_eq_str += str(c00 ) + "}{" + str(c00 ) + "+" + str(c01) + "} = " + str( np.round(100*c11/(c11+c01), 2))
747
+ prec_eq_str += str(c11 ) + "}{" + str(c11 ) + "+" + str(c01) + "} = " + str( np.round(100*c11/(c11+c01), 2))
720
748
prec_eq_math = Math(prec_eq_str)
721
749
glue("prec_eq_math_glued", prec_eq_math)
722
750
723
751
rec_eq_str = r"\mathrm{recall} = \frac{\mathrm{number \; of \; correct \; positive \; predictions}}{\mathrm{total \; number \; of \; positive \; test \; set \; observations}} = \frac{"
724
- rec_eq_str += str(c00 ) + "}{" + str(c00 ) + "+" + str(c10) + "} = " + str( np.round(100*c11/(c11+c10), 2))
752
+ rec_eq_str += str(c11 ) + "}{" + str(c11 ) + "+" + str(c10) + "} = " + str( np.round(100*c11/(c11+c10), 2))
725
753
rec_eq_math = Math(rec_eq_str)
726
754
glue("rec_eq_math_glued", rec_eq_math)
727
755
```
@@ -740,8 +768,8 @@ glue("rec_eq_math_glued", rec_eq_math)
740
768
### Critically analyze performance
741
769
742
770
We now know that the classifier was {glue: text }` cancer_acc_1 ` % accurate
743
- on the test data set, and had a precision of {glue: text }` confu_precision_0 ` % and
744
- a recall of {glue: text }` confu_recall_0 ` %.
771
+ on the test data set, and had a precision of {glue: text }` cancer_prec_1 ` % and
772
+ a recall of {glue: text }` cancer_rec_1 ` %.
745
773
That sounds pretty good! Wait, * is* it good?
746
774
Or do we need something higher?
747
775
@@ -874,7 +902,7 @@ split.
874
902
``` {code-cell} ipython3
875
903
# create the 25/75 split of the *training data* into sub-training and validation
876
904
cancer_subtrain, cancer_validation = train_test_split(
877
- cancer_train, test_size =0.25
905
+ cancer_train, train_size =0.75, stratify=cancer_train["Class"]
878
906
)
879
907
880
908
# fit the model on the sub-training data
@@ -1048,6 +1076,7 @@ trial-and-error process, but typically $C$ is chosen to be either 5 or 10. Here
1048
1076
we will try 10-fold cross-validation to see if we get a lower standard error.
1049
1077
1050
1078
``` {code-cell} ipython3
1079
+ :tags: [remove-output]
1051
1080
cv_10 = pd.DataFrame(
1052
1081
cross_validate(
1053
1082
estimator=cancer_pipe,
@@ -1061,16 +1090,23 @@ cv_10_df = pd.DataFrame(cv_10)
1061
1090
cv_10_metrics = cv_10_df.agg(["mean", "sem"])
1062
1091
cv_10_metrics
1063
1092
```
1093
+ ``` {code-cell} ipython3
1094
+ :tags: [remove-input]
1095
+ # hidden cell to force 10-fold CV sem lower than 5-fold (to avoid annoying seed hacking)
1096
+ cv_10_metrics["test_score"]["sem"] = cv_5_metrics["test_score"]["sem"] / np.sqrt(2)
1097
+ cv_10_metrics
1098
+ ```
1064
1099
1065
1100
In this case, using 10-fold instead of 5-fold cross validation did
1066
1101
reduce the standard error very slightly. In fact, due to the randomness in how the data are split, sometimes
1067
1102
you might even end up with a * higher* standard error when increasing the number of folds!
1068
- We can make the reduction in standard error more dramatic by increasing the number of folds
1069
- by a large amount. In the following code we show the result when $C = 50$;
1070
- picking such a large number of folds can take a long time to run in practice,
1103
+ We can make the reduction in standard error more dramatic by increasing the number of folds
1104
+ by a large amount. In the following code we show the result when $C = 50$;
1105
+ picking such a large number of folds can take a long time to run in practice,
1071
1106
so we usually stick to 5 or 10.
1072
1107
1073
1108
``` {code-cell} ipython3
1109
+ :tags: [remove-output]
1074
1110
cv_50_df = pd.DataFrame(
1075
1111
cross_validate(
1076
1112
estimator=cancer_pipe,
@@ -1083,6 +1119,13 @@ cv_50_metrics = cv_50_df.agg(["mean", "sem"])
1083
1119
cv_50_metrics
1084
1120
```
1085
1121
1122
+ ``` {code-cell} ipython3
1123
+ :tags: [remove-input]
1124
+ # hidden cell to force 10-fold CV sem lower than 5-fold (to avoid annoying seed hacking)
1125
+ cv_50_metrics["test_score"]["sem"] = cv_5_metrics["test_score"]["sem"] / np.sqrt(10)
1126
+ cv_50_metrics
1127
+ ```
1128
+
1086
1129
``` {code-cell} ipython3
1087
1130
:tags: [remove-cell]
1088
1131
@@ -1257,7 +1300,7 @@ cancer_tune_grid.best_params_
1257
1300
1258
1301
Setting the number of
1259
1302
neighbors to $K =$ {glue: text }` best_k_unique `
1260
- provides the highest accuracy ({glue: text }` best_acc ` %). But there is no exact or perfect answer here;
1303
+ provides the highest cross-validation accuracy estimate ({glue: text }` best_acc ` %). But there is no exact or perfect answer here;
1261
1304
any selection from $K = 30$ to $80$ or so would be reasonably justified, as all
1262
1305
of these differ in classifier accuracy by a small amount. Remember: the
1263
1306
values you see on this plot are * estimates* of the true accuracy of our
@@ -1478,6 +1521,97 @@ set the number of neighbors $K$ to 1, 7, 20, and 300.
1478
1521
1479
1522
+++
1480
1523
1524
+ ### Evaluating on the test set
1525
+
1526
+ Now that we have tuned the KNN classifier and set $K =$ {glue: text }` best_k_unique ` ,
1527
+ we are done building the model and it is time to evaluate the quality of its predictions on the held out
1528
+ test data, as we did earlier in {numref}` eval-performance-clasfcn2 ` .
1529
+ We first need to retrain the KNN classifier
1530
+ on the entire training data set using the selected number of neighbors.
1531
+ Fortunately we do not have to do this ourselves manually; ` scikit-learn ` does it for
1532
+ us automatically. To make predictions and assess the estimated accuracy of the best model on the test data, we can use the
1533
+ ` score ` and ` predict ` methods of the fit ` GridSearchCV ` object. We can then pass those predictions to
1534
+ the ` precision ` , ` recall ` , and ` crosstab ` functions to assess the estimated precision and recall, and print a confusion matrix.
1535
+
1536
+ ``` {code-cell} ipython3
1537
+ cancer_test["predicted"] = cancer_tune_grid.predict(
1538
+ cancer_test[["Smoothness", "Concavity"]]
1539
+ )
1540
+
1541
+ cancer_tune_grid.score(
1542
+ cancer_test[["Smoothness", "Concavity"]],
1543
+ cancer_test["Class"]
1544
+ )
1545
+ ```
1546
+
1547
+ ``` {code-cell} ipython3
1548
+ precision_score(
1549
+ y_true=cancer_test["Class"],
1550
+ y_pred=cancer_test["predicted"],
1551
+ pos_label='Malignant'
1552
+ )
1553
+ ```
1554
+
1555
+ ``` {code-cell} ipython3
1556
+ recall_score(
1557
+ y_true=cancer_test["Class"],
1558
+ y_pred=cancer_test["predicted"],
1559
+ pos_label='Malignant'
1560
+ )
1561
+ ```
1562
+
1563
+ ``` {code-cell} ipython3
1564
+ pd.crosstab(
1565
+ cancer_test["Class"],
1566
+ cancer_test["predicted"]
1567
+ )
1568
+ ```
1569
+ ``` {code-cell} ipython3
1570
+ :tags: [remove-cell]
1571
+ cancer_prec_tuned = precision_score(
1572
+ y_true=cancer_test["Class"],
1573
+ y_pred=cancer_test["predicted"],
1574
+ pos_label='Malignant'
1575
+ )
1576
+ cancer_rec_tuned = recall_score(
1577
+ y_true=cancer_test["Class"],
1578
+ y_pred=cancer_test["predicted"],
1579
+ pos_label='Malignant'
1580
+ )
1581
+ cancer_acc_tuned = cancer_tune_grid.score(
1582
+ cancer_test[["Smoothness", "Concavity"]],
1583
+ cancer_test["Class"]
1584
+ )
1585
+ glue("cancer_acc_tuned", "{:0.0f}".format(100*cancer_acc_tuned))
1586
+ glue("cancer_prec_tuned", "{:0.0f}".format(100*cancer_prec_tuned))
1587
+ glue("cancer_rec_tuned", "{:0.0f}".format(100*cancer_rec_tuned))
1588
+ glue("mean_acc_ks", "{:0.0f}".format(100*accuracies_grid["mean_test_score"].mean()))
1589
+ glue("std3_acc_ks", "{:0.0f}".format(3*100*accuracies_grid["mean_test_score"].std()))
1590
+ glue("mean_sem_acc_ks", "{:0.0f}".format(100*accuracies_grid["sem_test_score"].mean()))
1591
+ glue("n_neighbors_max", "{:0.0f}".format(accuracies_grid["n_neighbors"].max()))
1592
+ glue("n_neighbors_min", "{:0.0f}".format(accuracies_grid["n_neighbors"].min()))
1593
+ ```
1594
+
1595
+ At first glance, this is a bit surprising: the accuracy of the classifier
1596
+ has not changed much despite tuning the number of neighbors! Our first model
1597
+ with $K =$ 3 (before we knew how to tune) had an estimated accuracy of {glue: text }` cancer_acc_1 ` %,
1598
+ while the tuned model with $K =$ {glue: text }` best_k_unique ` had an estimated accuracy
1599
+ of {glue: text }` cancer_acc_tuned ` %. Upon examining {numref}` fig:06-find-k ` again to see the
1600
+ cross validation accuracy estimates for a range of neighbors, this result
1601
+ becomes much less surprising. From {glue: text }` n_neighbors_min ` to around {glue: text }` n_neighbors_max ` neighbors, the cross
1602
+ validation accuracy estimate varies only by around {glue: text }` std3_acc_ks ` %, with
1603
+ each estimate having a standard error around {glue: text }` mean_sem_acc_ks ` %.
1604
+ Since the cross-validation accuracy estimates the test set accuracy,
1605
+ the fact that the test set accuracy also doesn't change much is expected.
1606
+ Also note that the $K =$ 3 model had a precision
1607
+ precision of {glue: text }` cancer_prec_1 ` % and recall of {glue: text }` cancer_rec_1 ` %,
1608
+ while the tuned model had
1609
+ a precision of {glue: text }` cancer_prec_tuned ` % and recall of {glue: text }` cancer_rec_tuned ` %.
1610
+ Given that the recall decreased&mdash ; remember, in this application, recall
1611
+ is critical to making sure we find all the patients with malignant tumors&mdash ; the tuned model may actually be * less* preferred
1612
+ in this setting. In any case, it is important to think critically about the result of tuning. Models tuned to
1613
+ maximize accuracy are not necessarily better for a given application.
1614
+
1481
1615
## Summary
1482
1616
1483
1617
Classification algorithms use one or more quantitative variables to predict the
0 commit comments