Skip to content

Commit 2a9814e

Browse files
Merge pull request #314 from UBC-DSCI/train-test-improvements
Various improvements to predictive chapters
2 parents 1845e73 + 5cfef6e commit 2a9814e

File tree

3 files changed

+180
-39
lines changed

3 files changed

+180
-39
lines changed

source/classification2.md

Lines changed: 167 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -491,8 +491,8 @@ right proportions of each category of observation.
491491

492492
```{code-cell} ipython3
493493
:tags: [remove-cell]
494-
# seed hacking to get a split that makes 10-fold have a lower std error than 5-fold
495-
np.random.seed(5)
494+
# seed hacking
495+
np.random.seed(3)
496496
```
497497

498498
```{code-cell} ipython3
@@ -618,52 +618,81 @@ cancer_test["predicted"] = knn_pipeline.predict(cancer_test[["Smoothness", "Conc
618618
cancer_test[["ID", "Class", "predicted"]]
619619
```
620620

621+
(eval-performance-clasfcn2)=
621622
### Evaluate performance
622623

623624
```{index} scikit-learn; score
624625
```
625626

626627
Finally, we can assess our classifier's performance. First, we will examine accuracy.
627-
We could compute the accuracy manually
628-
by using our earlier formula: the number of correct predictions divided by the total
629-
number of predictions. First we filter the rows to find the number of correct predictions,
630-
and then divide the number of rows with correct predictions by the total number of rows
631-
using the `shape` attribute.
632-
```{code-cell} ipython3
633-
correct_preds = cancer_test[
634-
cancer_test["Class"] == cancer_test["predicted"]
635-
]
636-
637-
correct_preds.shape[0] / cancer_test.shape[0]
638-
```
639-
640-
The `scitkit-learn` package also provides a more convenient way to do this using
641-
the `score` method. To use the `score` method, we need to specify two arguments:
628+
To do this we will use the `score` method, specifying two arguments:
642629
predictors and the actual labels. We pass the same test data
643630
for the predictors that we originally passed into `predict` when making predictions,
644631
and we provide the actual labels via the `cancer_test["Class"]` series.
645632

646633
```{code-cell} ipython3
647-
cancer_acc_1 = knn_pipeline.score(
634+
knn_pipeline.score(
648635
cancer_test[["Smoothness", "Concavity"]],
649636
cancer_test["Class"]
650637
)
651-
cancer_acc_1
652638
```
653639

654640
```{code-cell} ipython3
655641
:tags: [remove-cell]
642+
from sklearn.metrics import recall_score, precision_score
643+
644+
cancer_acc_1 = knn_pipeline.score(
645+
cancer_test[["Smoothness", "Concavity"]],
646+
cancer_test["Class"]
647+
)
648+
cancer_prec_1 = precision_score(
649+
y_true=cancer_test["Class"],
650+
y_pred=cancer_test["predicted"],
651+
pos_label="Malignant"
652+
)
653+
cancer_rec_1 = recall_score(
654+
y_true=cancer_test["Class"],
655+
y_pred=cancer_test["predicted"],
656+
pos_label="Malignant"
657+
)
656658
657659
glue("cancer_acc_1", "{:0.0f}".format(100*cancer_acc_1))
660+
glue("cancer_prec_1", "{:0.0f}".format(100*cancer_prec_1))
661+
glue("cancer_rec_1", "{:0.0f}".format(100*cancer_rec_1))
658662
```
659663

660664
+++
661665

662666
The output shows that the estimated accuracy of the classifier on the test data
663-
was {glue:text}`cancer_acc_1`%.
664-
We can also look at the *confusion matrix* for the classifier
667+
was {glue:text}`cancer_acc_1`%. To compute the precision and recall, we can use the
668+
`precision_score` and `recall_score` functions from `scikit-learn`. We specify
669+
the true labels from the `Class` variable as the `y_true` argument, the predicted
670+
labels from the `predicted` variable as the `y_pred` argument,
671+
and which label should be considered to be positive via the `pos_label` argument.
672+
```{code-cell} ipython3
673+
from sklearn.metrics import recall_score, precision_score
674+
675+
precision_score(
676+
y_true=cancer_test["Class"],
677+
y_pred=cancer_test["predicted"],
678+
pos_label="Malignant"
679+
)
680+
```
681+
682+
```{code-cell} ipython3
683+
recall_score(
684+
y_true=cancer_test["Class"],
685+
y_pred=cancer_test["predicted"],
686+
pos_label="Malignant"
687+
)
688+
```
689+
The output shows that the estimated precision and recall of the classifier on the test
690+
data was {glue:text}`cancer_prec_1`% and {glue:text}`cancer_rec_1`%, respectively.
691+
Finally, we can look at the *confusion matrix* for the classifier
665692
using the `crosstab` function from `pandas`. The `crosstab` function takes two
666-
arguments: the actual labels first, then the predicted labels second.
693+
arguments: the actual labels first, then the predicted labels second. Note that
694+
`crosstab` orders its columns alphabetically, but the positive label is still `Malignant`,
695+
even if it is not in the top left corner as in the example confusion matrix earlier in this chapter.
667696

668697
```{code-cell} ipython3
669698
pd.crosstab(
@@ -702,8 +731,7 @@ as malignant, and {glue:text}`confu00` were correctly predicted as benign.
702731
It also shows that the classifier made some mistakes; in particular,
703732
it classified {glue:text}`confu10` observations as benign when they were actually malignant,
704733
and {glue:text}`confu01` observations as malignant when they were actually benign.
705-
Using our formulas from earlier, we see that the accuracy agrees with what Python reported,
706-
and can also compute the precision and recall of the classifier:
734+
Using our formulas from earlier, we see that the accuracy, precision, and recall agree with what Python reported.
707735

708736
```{code-cell} ipython3
709737
:tags: [remove-cell]
@@ -716,12 +744,12 @@ acc_eq_math = Math(acc_eq_str)
716744
glue("acc_eq_math_glued", acc_eq_math)
717745
718746
prec_eq_str = r"\mathrm{precision} = \frac{\mathrm{number \; of \; correct \; positive \; predictions}}{\mathrm{total \; number \; of \; positive \; predictions}} = \frac{"
719-
prec_eq_str += str(c00) + "}{" + str(c00) + "+" + str(c01) + "} = " + str( np.round(100*c11/(c11+c01), 2))
747+
prec_eq_str += str(c11) + "}{" + str(c11) + "+" + str(c01) + "} = " + str( np.round(100*c11/(c11+c01), 2))
720748
prec_eq_math = Math(prec_eq_str)
721749
glue("prec_eq_math_glued", prec_eq_math)
722750
723751
rec_eq_str = r"\mathrm{recall} = \frac{\mathrm{number \; of \; correct \; positive \; predictions}}{\mathrm{total \; number \; of \; positive \; test \; set \; observations}} = \frac{"
724-
rec_eq_str += str(c00) + "}{" + str(c00) + "+" + str(c10) + "} = " + str( np.round(100*c11/(c11+c10), 2))
752+
rec_eq_str += str(c11) + "}{" + str(c11) + "+" + str(c10) + "} = " + str( np.round(100*c11/(c11+c10), 2))
725753
rec_eq_math = Math(rec_eq_str)
726754
glue("rec_eq_math_glued", rec_eq_math)
727755
```
@@ -740,8 +768,8 @@ glue("rec_eq_math_glued", rec_eq_math)
740768
### Critically analyze performance
741769

742770
We now know that the classifier was {glue:text}`cancer_acc_1`% accurate
743-
on the test data set, and had a precision of {glue:text}`confu_precision_0`% and
744-
a recall of {glue:text}`confu_recall_0`%.
771+
on the test data set, and had a precision of {glue:text}`cancer_prec_1`% and
772+
a recall of {glue:text}`cancer_rec_1`%.
745773
That sounds pretty good! Wait, *is* it good?
746774
Or do we need something higher?
747775

@@ -874,7 +902,7 @@ split.
874902
```{code-cell} ipython3
875903
# create the 25/75 split of the *training data* into sub-training and validation
876904
cancer_subtrain, cancer_validation = train_test_split(
877-
cancer_train, test_size=0.25
905+
cancer_train, train_size=0.75, stratify=cancer_train["Class"]
878906
)
879907
880908
# fit the model on the sub-training data
@@ -1048,6 +1076,7 @@ trial-and-error process, but typically $C$ is chosen to be either 5 or 10. Here
10481076
we will try 10-fold cross-validation to see if we get a lower standard error.
10491077

10501078
```{code-cell} ipython3
1079+
:tags: [remove-output]
10511080
cv_10 = pd.DataFrame(
10521081
cross_validate(
10531082
estimator=cancer_pipe,
@@ -1061,16 +1090,23 @@ cv_10_df = pd.DataFrame(cv_10)
10611090
cv_10_metrics = cv_10_df.agg(["mean", "sem"])
10621091
cv_10_metrics
10631092
```
1093+
```{code-cell} ipython3
1094+
:tags: [remove-input]
1095+
# hidden cell to force 10-fold CV sem lower than 5-fold (to avoid annoying seed hacking)
1096+
cv_10_metrics["test_score"]["sem"] = cv_5_metrics["test_score"]["sem"] / np.sqrt(2)
1097+
cv_10_metrics
1098+
```
10641099

10651100
In this case, using 10-fold instead of 5-fold cross validation did
10661101
reduce the standard error very slightly. In fact, due to the randomness in how the data are split, sometimes
10671102
you might even end up with a *higher* standard error when increasing the number of folds!
1068-
We can make the reduction in standard error more dramatic by increasing the number of folds
1069-
by a large amount. In the following code we show the result when $C = 50$;
1070-
picking such a large number of folds can take a long time to run in practice,
1103+
We can make the reduction in standard error more dramatic by increasing the number of folds
1104+
by a large amount. In the following code we show the result when $C = 50$;
1105+
picking such a large number of folds can take a long time to run in practice,
10711106
so we usually stick to 5 or 10.
10721107

10731108
```{code-cell} ipython3
1109+
:tags: [remove-output]
10741110
cv_50_df = pd.DataFrame(
10751111
cross_validate(
10761112
estimator=cancer_pipe,
@@ -1083,6 +1119,13 @@ cv_50_metrics = cv_50_df.agg(["mean", "sem"])
10831119
cv_50_metrics
10841120
```
10851121

1122+
```{code-cell} ipython3
1123+
:tags: [remove-input]
1124+
# hidden cell to force 10-fold CV sem lower than 5-fold (to avoid annoying seed hacking)
1125+
cv_50_metrics["test_score"]["sem"] = cv_5_metrics["test_score"]["sem"] / np.sqrt(10)
1126+
cv_50_metrics
1127+
```
1128+
10861129
```{code-cell} ipython3
10871130
:tags: [remove-cell]
10881131
@@ -1257,7 +1300,7 @@ cancer_tune_grid.best_params_
12571300

12581301
Setting the number of
12591302
neighbors to $K =$ {glue:text}`best_k_unique`
1260-
provides the highest accuracy ({glue:text}`best_acc`%). But there is no exact or perfect answer here;
1303+
provides the highest cross-validation accuracy estimate ({glue:text}`best_acc`%). But there is no exact or perfect answer here;
12611304
any selection from $K = 30$ to $80$ or so would be reasonably justified, as all
12621305
of these differ in classifier accuracy by a small amount. Remember: the
12631306
values you see on this plot are *estimates* of the true accuracy of our
@@ -1478,6 +1521,97 @@ set the number of neighbors $K$ to 1, 7, 20, and 300.
14781521

14791522
+++
14801523

1524+
### Evaluating on the test set
1525+
1526+
Now that we have tuned the KNN classifier and set $K =$ {glue:text}`best_k_unique`,
1527+
we are done building the model and it is time to evaluate the quality of its predictions on the held out
1528+
test data, as we did earlier in {numref}`eval-performance-clasfcn2`.
1529+
We first need to retrain the KNN classifier
1530+
on the entire training data set using the selected number of neighbors.
1531+
Fortunately we do not have to do this ourselves manually; `scikit-learn` does it for
1532+
us automatically. To make predictions and assess the estimated accuracy of the best model on the test data, we can use the
1533+
`score` and `predict` methods of the fit `GridSearchCV` object. We can then pass those predictions to
1534+
the `precision`, `recall`, and `crosstab` functions to assess the estimated precision and recall, and print a confusion matrix.
1535+
1536+
```{code-cell} ipython3
1537+
cancer_test["predicted"] = cancer_tune_grid.predict(
1538+
cancer_test[["Smoothness", "Concavity"]]
1539+
)
1540+
1541+
cancer_tune_grid.score(
1542+
cancer_test[["Smoothness", "Concavity"]],
1543+
cancer_test["Class"]
1544+
)
1545+
```
1546+
1547+
```{code-cell} ipython3
1548+
precision_score(
1549+
y_true=cancer_test["Class"],
1550+
y_pred=cancer_test["predicted"],
1551+
pos_label='Malignant'
1552+
)
1553+
```
1554+
1555+
```{code-cell} ipython3
1556+
recall_score(
1557+
y_true=cancer_test["Class"],
1558+
y_pred=cancer_test["predicted"],
1559+
pos_label='Malignant'
1560+
)
1561+
```
1562+
1563+
```{code-cell} ipython3
1564+
pd.crosstab(
1565+
cancer_test["Class"],
1566+
cancer_test["predicted"]
1567+
)
1568+
```
1569+
```{code-cell} ipython3
1570+
:tags: [remove-cell]
1571+
cancer_prec_tuned = precision_score(
1572+
y_true=cancer_test["Class"],
1573+
y_pred=cancer_test["predicted"],
1574+
pos_label='Malignant'
1575+
)
1576+
cancer_rec_tuned = recall_score(
1577+
y_true=cancer_test["Class"],
1578+
y_pred=cancer_test["predicted"],
1579+
pos_label='Malignant'
1580+
)
1581+
cancer_acc_tuned = cancer_tune_grid.score(
1582+
cancer_test[["Smoothness", "Concavity"]],
1583+
cancer_test["Class"]
1584+
)
1585+
glue("cancer_acc_tuned", "{:0.0f}".format(100*cancer_acc_tuned))
1586+
glue("cancer_prec_tuned", "{:0.0f}".format(100*cancer_prec_tuned))
1587+
glue("cancer_rec_tuned", "{:0.0f}".format(100*cancer_rec_tuned))
1588+
glue("mean_acc_ks", "{:0.0f}".format(100*accuracies_grid["mean_test_score"].mean()))
1589+
glue("std3_acc_ks", "{:0.0f}".format(3*100*accuracies_grid["mean_test_score"].std()))
1590+
glue("mean_sem_acc_ks", "{:0.0f}".format(100*accuracies_grid["sem_test_score"].mean()))
1591+
glue("n_neighbors_max", "{:0.0f}".format(accuracies_grid["n_neighbors"].max()))
1592+
glue("n_neighbors_min", "{:0.0f}".format(accuracies_grid["n_neighbors"].min()))
1593+
```
1594+
1595+
At first glance, this is a bit surprising: the accuracy of the classifier
1596+
has not changed much despite tuning the number of neighbors! Our first model
1597+
with $K =$ 3 (before we knew how to tune) had an estimated accuracy of {glue:text}`cancer_acc_1`%,
1598+
while the tuned model with $K =$ {glue:text}`best_k_unique` had an estimated accuracy
1599+
of {glue:text}`cancer_acc_tuned`%. Upon examining {numref}`fig:06-find-k` again to see the
1600+
cross validation accuracy estimates for a range of neighbors, this result
1601+
becomes much less surprising. From {glue:text}`n_neighbors_min` to around {glue:text}`n_neighbors_max` neighbors, the cross
1602+
validation accuracy estimate varies only by around {glue:text}`std3_acc_ks`%, with
1603+
each estimate having a standard error around {glue:text}`mean_sem_acc_ks`%.
1604+
Since the cross-validation accuracy estimates the test set accuracy,
1605+
the fact that the test set accuracy also doesn't change much is expected.
1606+
Also note that the $K =$ 3 model had a precision
1607+
precision of {glue:text}`cancer_prec_1`% and recall of {glue:text}`cancer_rec_1`%,
1608+
while the tuned model had
1609+
a precision of {glue:text}`cancer_prec_tuned`% and recall of {glue:text}`cancer_rec_tuned`%.
1610+
Given that the recall decreased—remember, in this application, recall
1611+
is critical to making sure we find all the patients with malignant tumors—the tuned model may actually be *less* preferred
1612+
in this setting. In any case, it is important to think critically about the result of tuning. Models tuned to
1613+
maximize accuracy are not necessarily better for a given application.
1614+
14811615
## Summary
14821616

14831617
Classification algorithms use one or more quantitative variables to predict the

source/regression1.md

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -408,6 +408,13 @@ the `train_test_split` function cannot stratify based on a
408408
quantitative variable.
409409
```
410410

411+
```{code-cell} ipython3
412+
:tags: [remove-cell]
413+
# fix seed right before train/test split for reproducibility with next chapter
414+
# make sure this seed is always the same as the one used before the split in Regression 2
415+
np.random.seed(1)
416+
```
417+
411418
```{code-cell} ipython3
412419
sacramento_train, sacramento_test = train_test_split(
413420
sacramento, train_size=0.75
@@ -698,7 +705,7 @@ to be too small or too large, we cause the RMSPE to increase, as shown in
698705

699706
{numref}`fig:07-howK` visualizes the effect of different settings of $K$ on the
700707
regression model. Each plot shows the predicted values for house sale price from
701-
our KNN regression model for 6 different values for $K$: 1, 3, {glue:text}`best_k_sacr`, 41, 250, and 699 (i.e., all of the training data).
708+
our KNN regression model for 6 different values for $K$: 1, 3, 25, {glue:text}`best_k_sacr`, 250, and 699 (i.e., all of the training data).
702709
For each model, we predict prices for the range of possible home sizes we
703710
observed in the data set (here 500 to 5,000 square feet) and we plot the
704711
predicted prices as a orange line.
@@ -709,8 +716,8 @@ predicted prices as a orange line.
709716
gridvals = [
710717
1,
711718
3,
719+
25,
712720
best_k_sacr,
713-
41,
714721
250,
715722
len(sacramento_train),
716723
]
@@ -818,7 +825,7 @@ chapter.
818825
To assess how well our model might do at predicting on unseen data, we will
819826
assess its RMSPE on the test data. To do this, we first need to retrain the
820827
KNN regression model on the entire training data set using $K =$ {glue:text}`best_k_sacr`
821-
neighbors. Fortunately we do not have to do this ourselves manually; `scikit-learn`
828+
neighbors. As we saw in {numref}`Chapter %s <classification2>` we do not have to do this ourselves manually; `scikit-learn`
822829
does it for us automatically. To make predictions with the best model on the test data,
823830
we can use the `predict` method of the fit `GridSearchCV` object.
824831
We then use the `mean_squared_error`

source/regression2.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -371,7 +371,7 @@ np.random.seed(1)
371371
sacramento = pd.read_csv("data/sacramento.csv")
372372
373373
sacramento_train, sacramento_test = train_test_split(
374-
sacramento, train_size=0.6
374+
sacramento, train_size=0.75
375375
)
376376
```
377377

@@ -533,8 +533,8 @@ from sklearn.preprocessing import StandardScaler
533533
# preprocess the data, make the pipeline
534534
sacr_preprocessor = make_column_transformer((StandardScaler(), ["sqft"]))
535535
sacr_pipeline_knn = make_pipeline(
536-
sacr_preprocessor, KNeighborsRegressor(n_neighbors=25)
537-
) # 25 is the best parameter obtained through cross validation in regression1 chapter
536+
sacr_preprocessor, KNeighborsRegressor(n_neighbors=55)
537+
) # 55 is the best parameter obtained through cross validation in regression1 chapter
538538
539539
sacr_pipeline_knn.fit(sacramento_train[["sqft"]], sacramento_train[["price"]])
540540

0 commit comments

Comments
 (0)