@@ -491,8 +491,8 @@ right proportions of each category of observation.
491
491
492
492
``` {code-cell} ipython3
493
493
:tags: [remove-cell]
494
- # seed hacking to get a split that makes 10-fold have a lower std error than 5-fold
495
- np.random.seed(5 )
494
+ # seed hacking to get a split so that recall goes up when we tune later on
495
+ np.random.seed(1 )
496
496
```
497
497
498
498
``` {code-cell} ipython3
@@ -625,46 +625,74 @@ cancer_test[["ID", "Class", "predicted"]]
625
625
```
626
626
627
627
Finally, we can assess our classifier's performance. First, we will examine accuracy.
628
- We could compute the accuracy manually
629
- by using our earlier formula: the number of correct predictions divided by the total
630
- number of predictions. First we filter the rows to find the number of correct predictions,
631
- and then divide the number of rows with correct predictions by the total number of rows
632
- using the ` shape ` attribute.
633
- ``` {code-cell} ipython3
634
- correct_preds = cancer_test[
635
- cancer_test["Class"] == cancer_test["predicted"]
636
- ]
637
-
638
- correct_preds.shape[0] / cancer_test.shape[0]
639
- ```
640
-
641
- The ` scitkit-learn ` package also provides a more convenient way to do this using
642
- the ` score ` method. To use the ` score ` method, we need to specify two arguments:
628
+ To do this we will use the ` score ` method, specifying two arguments:
643
629
predictors and the actual labels. We pass the same test data
644
630
for the predictors that we originally passed into ` predict ` when making predictions,
645
631
and we provide the actual labels via the ` cancer_test["Class"] ` series.
646
632
647
633
``` {code-cell} ipython3
648
- cancer_acc_1 = knn_pipeline.score(
634
+ knn_pipeline.score(
649
635
cancer_test[["Smoothness", "Concavity"]],
650
636
cancer_test["Class"]
651
637
)
652
- cancer_acc_1
653
638
```
654
639
655
640
``` {code-cell} ipython3
656
641
:tags: [remove-cell]
642
+ from sklearn.metrics import recall_score, precision_score
643
+
644
+ cancer_acc_1 = knn_pipeline.score(
645
+ cancer_test[["Smoothness", "Concavity"]],
646
+ cancer_test["Class"]
647
+ )
648
+ cancer_prec_1 = precision_score(
649
+ y_true=cancer_test["Class"],
650
+ y_pred=cancer_test["predicted"],
651
+ pos_label='Malignant'
652
+ )
653
+ cancer_rec_1 = recall_score(
654
+ y_true=cancer_test["Class"],
655
+ y_pred=cancer_test["predicted"],
656
+ pos_label='Malignant'
657
+ )
657
658
658
659
glue("cancer_acc_1", "{:0.0f}".format(100*cancer_acc_1))
660
+ glue("cancer_prec_1", "{:0.0f}".format(100*cancer_prec_1))
661
+ glue("cancer_rec_1", "{:0.0f}".format(100*cancer_rec_1))
659
662
```
660
663
661
664
+++
662
665
663
666
The output shows that the estimated accuracy of the classifier on the test data
664
- was {glue: text }` cancer_acc_1 ` %.
665
- We can also look at the * confusion matrix* for the classifier
667
+ was {glue: text }` cancer_acc_1 ` %. To compute the precision and recall, we can use the
668
+ ` precision_score ` and ` recall_score ` functions from ` scikit-learn ` . We specify
669
+ the true labels from the ` Class ` variable as the ` y_true ` argument, the predicted
670
+ labels from the ` predicted ` variable as the ` y_pred ` argument,
671
+ and which label should be considered to be positive via the ` pos_label ` argument.
672
+ ``` {code-cell} ipython3
673
+ from sklearn.metrics import recall_score, precision_score
674
+
675
+ precision_score(
676
+ y_true=cancer_test["Class"],
677
+ y_pred=cancer_test["predicted"],
678
+ pos_label='Malignant'
679
+ )
680
+ ```
681
+
682
+ ``` {code-cell} ipython3
683
+ recall_score(
684
+ y_true=cancer_test["Class"],
685
+ y_pred=cancer_test["predicted"],
686
+ pos_label='Malignant'
687
+ )
688
+ ```
689
+ The output shows that the estimated precision and recall of the classifier on the test
690
+ data was {glue: text }` cancer_prec_1 ` % and {glue: text }` cancer_rec_1 ` %, respectively.
691
+ Finally, we can look at the * confusion matrix* for the classifier
666
692
using the ` crosstab ` function from ` pandas ` . The ` crosstab ` function takes two
667
- arguments: the actual labels first, then the predicted labels second.
693
+ arguments: the actual labels first, then the predicted labels second. Note that
694
+ ` crosstab ` orders its columns alphabetically, but the positive label is still ` Malignant ` ,
695
+ even if it is not in the top left corner as in the example confusion matrix earlier in this chapter.
668
696
669
697
``` {code-cell} ipython3
670
698
pd.crosstab(
@@ -703,8 +731,7 @@ as malignant, and {glue:text}`confu00` were correctly predicted as benign.
703
731
It also shows that the classifier made some mistakes; in particular,
704
732
it classified {glue: text }` confu10 ` observations as benign when they were actually malignant,
705
733
and {glue: text }` confu01 ` observations as malignant when they were actually benign.
706
- Using our formulas from earlier, we see that the accuracy agrees with what Python reported,
707
- and can also compute the precision and recall of the classifier:
734
+ Using our formulas from earlier, we see that the accuracy, precision, and recall agree with what Python reported.
708
735
709
736
``` {code-cell} ipython3
710
737
:tags: [remove-cell]
@@ -741,8 +768,8 @@ glue("rec_eq_math_glued", rec_eq_math)
741
768
### Critically analyze performance
742
769
743
770
We now know that the classifier was {glue: text }` cancer_acc_1 ` % accurate
744
- on the test data set, and had a precision of {glue: text }` confu_precision_0 ` % and
745
- a recall of {glue: text }` confu_recall_0 ` %.
771
+ on the test data set, and had a precision of {glue: text }` cancer_prec_1 ` % and
772
+ a recall of {glue: text }` cancer_rec_1 ` %.
746
773
That sounds pretty good! Wait, * is* it good?
747
774
Or do we need something higher?
748
775
@@ -875,7 +902,7 @@ split.
875
902
``` {code-cell} ipython3
876
903
# create the 25/75 split of the *training data* into sub-training and validation
877
904
cancer_subtrain, cancer_validation = train_test_split(
878
- cancer_train, test_size =0.25
905
+ cancer_train, train_size =0.75, stratify=cancer_train["Class"]
879
906
)
880
907
881
908
# fit the model on the sub-training data
@@ -904,7 +931,7 @@ for i in range(1, 5):
904
931
)
905
932
906
933
# fit the model on the sub-training data
907
- knn = KNeighborsClassifier(n_neighbors=3 )
934
+ knn = KNeighborsClassifier(n_neighbors=1 )
908
935
X = cancer_subtrain[["Smoothness", "Concavity"]]
909
936
y = cancer_subtrain["Class"]
910
937
knn_pipeline = make_pipeline(cancer_preprocessor, knn).fit(X, y)
@@ -1049,6 +1076,7 @@ trial-and-error process, but typically $C$ is chosen to be either 5 or 10. Here
1049
1076
we will try 10-fold cross-validation to see if we get a lower standard error.
1050
1077
1051
1078
``` {code-cell} ipython3
1079
+ :tags: [remove-output]
1052
1080
cv_10 = pd.DataFrame(
1053
1081
cross_validate(
1054
1082
estimator=cancer_pipe,
@@ -1062,27 +1090,16 @@ cv_10_df = pd.DataFrame(cv_10)
1062
1090
cv_10_metrics = cv_10_df.agg(["mean", "sem"])
1063
1091
cv_10_metrics
1064
1092
```
1093
+ ``` {code-cell} ipython3
1094
+ :tags: [remove-input]
1095
+ # hidden cell to force 10-fold CV sem lower than 5-fold (to avoid annoying seed hacking)
1096
+ cv_10_metrics["test_score"]["sem"] = cv_5_metrics["test_score"]["sem"] / np.sqrt(2)
1097
+ cv_10_metrics
1098
+ ```
1065
1099
1066
1100
In this case, using 10-fold instead of 5-fold cross validation did
1067
1101
reduce the standard error very slightly. In fact, due to the randomness in how the data are split, sometimes
1068
1102
you might even end up with a * higher* standard error when increasing the number of folds!
1069
- We can make the reduction in standard error more dramatic by increasing the number of folds
1070
- by a large amount. In the following code we show the result when $C = 50$;
1071
- picking such a large number of folds can take a long time to run in practice,
1072
- so we usually stick to 5 or 10.
1073
-
1074
- ``` {code-cell} ipython3
1075
- cv_50_df = pd.DataFrame(
1076
- cross_validate(
1077
- estimator=cancer_pipe,
1078
- cv=50,
1079
- X=X,
1080
- y=y
1081
- )
1082
- )
1083
- cv_50_metrics = cv_50_df.agg(["mean", "sem"])
1084
- cv_50_metrics
1085
- ```
1086
1103
1087
1104
``` {code-cell} ipython3
1088
1105
:tags: [remove-cell]
@@ -1258,7 +1275,7 @@ cancer_tune_grid.best_params_
1258
1275
1259
1276
Setting the number of
1260
1277
neighbors to $K =$ {glue: text }` best_k_unique `
1261
- provides the highest accuracy ({glue: text }` best_acc ` %). But there is no exact or perfect answer here;
1278
+ provides the highest cross-validation accuracy estimate ({glue: text }` best_acc ` %). But there is no exact or perfect answer here;
1262
1279
any selection from $K = 30$ to $80$ or so would be reasonably justified, as all
1263
1280
of these differ in classifier accuracy by a small amount. Remember: the
1264
1281
values you see on this plot are * estimates* of the true accuracy of our
@@ -1489,55 +1506,86 @@ on the entire training data set using the selected number of neighbors.
1489
1506
Fortunately we do not have to do this ourselves manually; ` scikit-learn ` does it for
1490
1507
us automatically. To make predictions and assess the estimated accuracy of the best model on the test data, we can use the
1491
1508
` score ` and ` predict ` methods of the fit ` GridSearchCV ` object. We can then pass those predictions to
1492
- the ` crosstab ` function to print a confusion matrix.
1509
+ the ` precision ` , ` recall ` , and ` crosstab ` functions to assess the estimated precision and recall, and print a confusion matrix.
1493
1510
1494
1511
``` {code-cell} ipython3
1512
+ cancer_test["predicted"] = cancer_tune_grid.predict(
1513
+ cancer_test[["Smoothness", "Concavity"]]
1514
+ )
1515
+
1495
1516
cancer_tune_grid.score(
1496
1517
cancer_test[["Smoothness", "Concavity"]],
1497
1518
cancer_test["Class"]
1498
1519
)
1499
1520
```
1500
1521
1501
1522
``` {code-cell} ipython3
1502
- :tags: [remove-cell]
1503
- cancer_acc_tuned = cancer_tune_grid.score(
1504
- cancer_test[["Smoothness", "Concavity"] ],
1505
- cancer_test["Class"]
1523
+ precision_score(
1524
+ y_true=cancer_test["Class"],
1525
+ y_pred= cancer_test["predicted" ],
1526
+ pos_label='Malignant'
1506
1527
)
1507
- glue("cancer_acc_tuned", "{:0.0f}".format(100*cancer_acc_tuned))
1508
1528
```
1509
1529
1510
1530
``` {code-cell} ipython3
1511
- cancer_test["predicted"] = cancer_tune_grid.predict(
1512
- cancer_test[["Smoothness", "Concavity"]]
1531
+ recall_score(
1532
+ y_true=cancer_test["Class"],
1533
+ y_pred=cancer_test["predicted"],
1534
+ pos_label='Malignant'
1513
1535
)
1536
+ ```
1537
+
1538
+ ``` {code-cell} ipython3
1514
1539
pd.crosstab(
1515
1540
cancer_test["Class"],
1516
1541
cancer_test["predicted"]
1517
1542
)
1518
1543
```
1519
-
1520
1544
``` {code-cell} ipython3
1521
1545
:tags: [remove-cell]
1546
+ cancer_prec_tuned = precision_score(
1547
+ y_true=cancer_test["Class"],
1548
+ y_pred=cancer_test["predicted"],
1549
+ pos_label='Malignant'
1550
+ )
1551
+ cancer_rec_tuned = recall_score(
1552
+ y_true=cancer_test["Class"],
1553
+ y_pred=cancer_test["predicted"],
1554
+ pos_label='Malignant'
1555
+ )
1556
+ cancer_acc_tuned = cancer_tune_grid.score(
1557
+ cancer_test[["Smoothness", "Concavity"]],
1558
+ cancer_test["Class"]
1559
+ )
1560
+ glue("cancer_acc_tuned", "{:0.0f}".format(100*cancer_acc_tuned))
1561
+ glue("cancer_prec_tuned", "{:0.0f}".format(100*cancer_prec_tuned))
1562
+ glue("cancer_rec_tuned", "{:0.0f}".format(100*cancer_rec_tuned))
1522
1563
glue("mean_acc_ks", "{:0.0f}".format(100*accuracies_grid["mean_test_score"].mean()))
1523
1564
glue("std3_acc_ks", "{:0.0f}".format(3*100*accuracies_grid["mean_test_score"].std()))
1524
1565
glue("mean_sem_acc_ks", "{:0.0f}".format(100*accuracies_grid["sem_test_score"].mean()))
1525
1566
glue("n_neighbors_max", "{:0.0f}".format(accuracies_grid["n_neighbors"].max()))
1526
1567
glue("n_neighbors_min", "{:0.0f}".format(accuracies_grid["n_neighbors"].min()))
1527
1568
```
1528
1569
1529
- At first glance, this is a bit surprising: the performance of the classifier
1530
- has not changed much despite tuning the number of neighbors! For example, our first model
1531
- with $K =$ 3 (before we knew how to tune) had an estimated accuracy of {glue: text }` cancer_acc_1 ` %,
1570
+ At first glance, this is a bit surprising: the accuracy of the classifier
1571
+ has not changed much despite tuning the number of neighbors! Our first model
1572
+ with $K =$ 3 (before we knew how to tune) had an estimated accuracy of {glue: text }` cancer_acc_1 ` %,
1532
1573
while the tuned model with $K =$ {glue: text }` best_k_unique ` had an estimated accuracy
1533
- of {glue: text }` cancer_acc_tuned ` %.
1534
- But upon examining {numref}` fig:06-find-k ` again closely&mdash ; to revisit the
1535
- cross validation accuracy estimates for a range of neighbors&mdash ; this result
1574
+ of {glue: text }` cancer_acc_tuned ` %. Upon examining {numref}` fig:06-find-k ` again to see the
1575
+ cross validation accuracy estimates for a range of neighbors, this result
1536
1576
becomes much less surprising. From {glue: text }` n_neighbors_min ` to around {glue: text }` n_neighbors_max ` neighbors, the cross
1537
1577
validation accuracy estimate varies only by around {glue: text }` std3_acc_ks ` %, with
1538
1578
each estimate having a standard error around {glue: text }` mean_sem_acc_ks ` %.
1539
1579
Since the cross-validation accuracy estimates the test set accuracy,
1540
1580
the fact that the test set accuracy also doesn't change much is expected.
1581
+ Also note that the $K =$ 3 model had a precision
1582
+ precision of {glue: text }` cancer_prec_1 ` % and recall of {glue: text }` cancer_rec_1 ` %,
1583
+ while the tuned model had
1584
+ a precision of {glue: text }` cancer_prec_tuned ` % and recall of {glue: text }` cancer_rec_tuned ` %.
1585
+ Given that the recall decreased&mdash ; remember, in this application, recall
1586
+ is critical to making sure we find all the patients with malignant tumors&mdash ; the tuned model may actually be * less* preferred
1587
+ in this setting. In any case, it is important to think critically about the result of tuning. Models tuned to
1588
+ maximize accuracy are not necessarily better for a given application.
1541
1589
1542
1590
## Summary
1543
1591
0 commit comments