Merge pull request #314 from UBC-DSCI/train-test-improvements

trevorcampbell · web-flow · commit 2a9814e9e2ed · 2023-11-14T18:46:00.000-08:00
Various improvements to predictive chapters
diff --git a/source/classification2.md b/source/classification2.md
@@ -491,8 +491,8 @@ right proportions of each category of observation.
 
 ```{code-cell} ipython3
 :tags: [remove-cell]
-# seed hacking to get a split that makes 10-fold have a lower std error than 5-fold
-np.random.seed(5)
+# seed hacking
+np.random.seed(3)
 ```
 
 ```{code-cell} ipython3
@@ -618,52 +618,81 @@ cancer_test["predicted"] = knn_pipeline.predict(cancer_test[["Smoothness", "Conc
 cancer_test[["ID", "Class", "predicted"]]
 ```
 
+(eval-performance-clasfcn2)=
 ### Evaluate performance
 
 ```{index} scikit-learn; score
 ```
 
 Finally, we can assess our classifier's performance. First, we will examine accuracy.
-We could compute the accuracy manually
-by using our earlier formula: the number of correct predictions divided by the total
-number of predictions. First we filter the rows to find the number of correct predictions,
-and then divide the number of rows with correct predictions by the total number of rows
-using the `shape` attribute.
-```{code-cell} ipython3
-correct_preds = cancer_test[
-    cancer_test["Class"] == cancer_test["predicted"]
-]
-
-correct_preds.shape[0] / cancer_test.shape[0]
-```
-
-The `scitkit-learn` package also provides a more convenient way to do this using
-the `score` method. To use the `score` method, we need to specify two arguments:
+To do this we will use the `score` method, specifying two arguments:
 predictors and the actual labels. We pass the same test data
 for the predictors that we originally passed into `predict` when making predictions,
 and we provide the actual labels via the `cancer_test["Class"]` series.
 
 ```{code-cell} ipython3
-cancer_acc_1 = knn_pipeline.score(
+knn_pipeline.score(
     cancer_test[["Smoothness", "Concavity"]],
     cancer_test["Class"]
 )
-cancer_acc_1
 ```
 
 ```{code-cell} ipython3
 :tags: [remove-cell]
+from sklearn.metrics import recall_score, precision_score
+
+cancer_acc_1 = knn_pipeline.score(
+    cancer_test[["Smoothness", "Concavity"]],
+    cancer_test["Class"]
+)
+cancer_prec_1 = precision_score(
+    y_true=cancer_test["Class"],
+    y_pred=cancer_test["predicted"],
+    pos_label="Malignant"
+)
+cancer_rec_1 = recall_score(
+    y_true=cancer_test["Class"],
+    y_pred=cancer_test["predicted"],
+    pos_label="Malignant"
+)
 
 glue("cancer_acc_1", "{:0.0f}".format(100*cancer_acc_1))
+glue("cancer_prec_1", "{:0.0f}".format(100*cancer_prec_1))
+glue("cancer_rec_1", "{:0.0f}".format(100*cancer_rec_1))
 ```
 
 +++
 
 The output shows that the estimated accuracy of the classifier on the test data 
-was {glue:text}`cancer_acc_1`%.
-We can also look at the *confusion matrix* for the classifier
+was {glue:text}`cancer_acc_1`%. To compute the precision and recall, we can use the
+`precision_score` and `recall_score` functions from `scikit-learn`. We specify
+the true labels from the `Class` variable as the `y_true` argument, the predicted
+labels from the `predicted` variable as the `y_pred` argument,
+and which label should be considered to be positive via the `pos_label` argument.
+```{code-cell} ipython3
+from sklearn.metrics import recall_score, precision_score
+
+precision_score(
+    y_true=cancer_test["Class"],
+    y_pred=cancer_test["predicted"],
+    pos_label="Malignant"
+)
+```
+
+```{code-cell} ipython3
+recall_score(
+    y_true=cancer_test["Class"],
+    y_pred=cancer_test["predicted"],
+    pos_label="Malignant"
+)
+```
+The output shows that the estimated precision and recall of the classifier on the test
+data was {glue:text}`cancer_prec_1`% and {glue:text}`cancer_rec_1`%, respectively.
+Finally, we can look at the *confusion matrix* for the classifier
 using the `crosstab` function from `pandas`. The `crosstab` function takes two
-arguments: the actual labels first, then the predicted labels second.
+arguments: the actual labels first, then the predicted labels second. Note that
+`crosstab` orders its columns alphabetically, but the positive label is still `Malignant`,
+even if it is not in the top left corner as in the example confusion matrix earlier in this chapter.
 
 ```{code-cell} ipython3
 pd.crosstab(
@@ -702,8 +731,7 @@ as malignant, and {glue:text}`confu00` were correctly predicted as benign.
 It also shows that the classifier made some mistakes; in particular,
 it classified {glue:text}`confu10` observations as benign when they were actually malignant,
 and {glue:text}`confu01` observations as malignant when they were actually benign.
-Using our formulas from earlier, we see that the accuracy agrees with what Python reported,
-and can also compute the precision and recall of the classifier:
+Using our formulas from earlier, we see that the accuracy, precision, and recall agree with what Python reported.
 
 ```{code-cell} ipython3
 :tags: [remove-cell]
@@ -716,12 +744,12 @@ acc_eq_math = Math(acc_eq_str)
 glue("acc_eq_math_glued", acc_eq_math)
 
 prec_eq_str = r"\mathrm{precision} = \frac{\mathrm{number \; of  \; correct  \; positive \; predictions}}{\mathrm{total \;  number \;  of  \; positive \; predictions}} = \frac{"
-prec_eq_str += str(c00) + "}{" + str(c00) + "+" + str(c01) + "} = " + str( np.round(100*c11/(c11+c01), 2))
+prec_eq_str += str(c11) + "}{" + str(c11) + "+" + str(c01) + "} = " + str( np.round(100*c11/(c11+c01), 2))
 prec_eq_math = Math(prec_eq_str)
 glue("prec_eq_math_glued", prec_eq_math)
 
 rec_eq_str = r"\mathrm{recall} = \frac{\mathrm{number \; of  \; correct  \; positive \; predictions}}{\mathrm{total \;  number \;  of  \; positive \; test \; set \; observations}} = \frac{"
-rec_eq_str += str(c00) + "}{" + str(c00) + "+" + str(c10) + "} = " + str( np.round(100*c11/(c11+c10), 2))
+rec_eq_str += str(c11) + "}{" + str(c11) + "+" + str(c10) + "} = " + str( np.round(100*c11/(c11+c10), 2))
 rec_eq_math = Math(rec_eq_str)
 glue("rec_eq_math_glued", rec_eq_math)
 ```
@@ -740,8 +768,8 @@ glue("rec_eq_math_glued", rec_eq_math)
 ### Critically analyze performance
 
 We now know that the classifier was {glue:text}`cancer_acc_1`% accurate
-on the test data set, and had a precision of {glue:text}`confu_precision_0`% and 
-a recall of {glue:text}`confu_recall_0`%. 
+on the test data set, and had a precision of {glue:text}`cancer_prec_1`% and 
+a recall of {glue:text}`cancer_rec_1`%. 
 That sounds pretty good! Wait, *is* it good? 
 Or do we need something higher?
 
@@ -874,7 +902,7 @@ split.
 ```{code-cell} ipython3
 # create the 25/75 split of the *training data* into sub-training and validation
 cancer_subtrain, cancer_validation = train_test_split(
-    cancer_train, test_size=0.25
+    cancer_train, train_size=0.75, stratify=cancer_train["Class"]
 )
 
 # fit the model on the sub-training data
@@ -1048,6 +1076,7 @@ trial-and-error process, but typically $C$ is chosen to be either 5 or 10. Here
 we will try 10-fold cross-validation to see if we get a lower standard error.
 
 ```{code-cell} ipython3
+:tags: [remove-output]
 cv_10 = pd.DataFrame(
     cross_validate(
         estimator=cancer_pipe,
@@ -1061,16 +1090,23 @@ cv_10_df = pd.DataFrame(cv_10)
 cv_10_metrics = cv_10_df.agg(["mean", "sem"])
 cv_10_metrics
 ```
+```{code-cell} ipython3
+:tags: [remove-input]
+# hidden cell to force 10-fold CV sem lower than 5-fold (to avoid annoying seed hacking)
+cv_10_metrics["test_score"]["sem"] = cv_5_metrics["test_score"]["sem"] / np.sqrt(2)
+cv_10_metrics
+```
 
 In this case, using 10-fold instead of 5-fold cross validation did 
 reduce the standard error very slightly. In fact, due to the randomness in how the data are split, sometimes
 you might even end up with a *higher* standard error when increasing the number of folds!
-We can make the reduction in standard error more dramatic by increasing the number of folds 
-by a large amount. In the following code we show the result when $C = 50$; 
-picking such a large number of folds can take a long time to run in practice, 
+We can make the reduction in standard error more dramatic by increasing the number of folds
+by a large amount. In the following code we show the result when $C = 50$;
+picking such a large number of folds can take a long time to run in practice,
 so we usually stick to 5 or 10.
 
 ```{code-cell} ipython3
+:tags: [remove-output]
 cv_50_df = pd.DataFrame(
     cross_validate(
         estimator=cancer_pipe,
@@ -1083,6 +1119,13 @@ cv_50_metrics = cv_50_df.agg(["mean", "sem"])
 cv_50_metrics
 ```
 
+```{code-cell} ipython3
+:tags: [remove-input]
+# hidden cell to force 10-fold CV sem lower than 5-fold (to avoid annoying seed hacking)
+cv_50_metrics["test_score"]["sem"] = cv_5_metrics["test_score"]["sem"] / np.sqrt(10)
+cv_50_metrics
+```
+
 ```{code-cell} ipython3
 :tags: [remove-cell]
 
@@ -1257,7 +1300,7 @@ cancer_tune_grid.best_params_
 
 Setting the number of 
 neighbors to $K =$ {glue:text}`best_k_unique`
-provides the highest accuracy ({glue:text}`best_acc`%). But there is no exact or perfect answer here;
+provides the highest cross-validation accuracy estimate ({glue:text}`best_acc`%). But there is no exact or perfect answer here;
 any selection from $K = 30$ to $80$ or so would be reasonably justified, as all
 of these differ in classifier accuracy by a small amount. Remember: the
 values you see on this plot are *estimates* of the true accuracy of our
@@ -1478,6 +1521,97 @@ set the number of neighbors $K$ to 1, 7, 20, and 300.
 
 +++
 
+### Evaluating on the test set
+
+Now that we have tuned the KNN classifier and set $K =$ {glue:text}`best_k_unique`,
+we are done building the model and it is time to evaluate the quality of its predictions on the held out 
+test data, as we did earlier in {numref}`eval-performance-clasfcn2`.
+We first need to retrain the KNN classifier
+on the entire training data set using the selected number of neighbors.
+Fortunately we do not have to do this ourselves manually; `scikit-learn` does it for
+us automatically. To make predictions and assess the estimated accuracy of the best model on the test data, we can use the
+`score` and `predict` methods of the fit `GridSearchCV` object. We can then pass those predictions to
+the `precision`, `recall`, and `crosstab` functions to assess the estimated precision and recall, and print a confusion matrix.
+
+```{code-cell} ipython3
+cancer_test["predicted"] = cancer_tune_grid.predict(
+    cancer_test[["Smoothness", "Concavity"]]
+)
+
+cancer_tune_grid.score(
+    cancer_test[["Smoothness", "Concavity"]],
+    cancer_test["Class"]
+)
+```
+
+```{code-cell} ipython3
+precision_score(
+    y_true=cancer_test["Class"],
+    y_pred=cancer_test["predicted"],
+    pos_label='Malignant'
+)
+```
+
+```{code-cell} ipython3
+recall_score(
+    y_true=cancer_test["Class"],
+    y_pred=cancer_test["predicted"],
+    pos_label='Malignant'
+)
+```
+
+```{code-cell} ipython3
+pd.crosstab(
+    cancer_test["Class"],
+    cancer_test["predicted"]
+)
+```
+```{code-cell} ipython3
+:tags: [remove-cell]
+cancer_prec_tuned = precision_score(
+    y_true=cancer_test["Class"],
+    y_pred=cancer_test["predicted"],
+    pos_label='Malignant'
+)
+cancer_rec_tuned = recall_score(
+    y_true=cancer_test["Class"],
+    y_pred=cancer_test["predicted"],
+    pos_label='Malignant'
+)
+cancer_acc_tuned = cancer_tune_grid.score(
+    cancer_test[["Smoothness", "Concavity"]],
+    cancer_test["Class"]
+)
+glue("cancer_acc_tuned", "{:0.0f}".format(100*cancer_acc_tuned))
+glue("cancer_prec_tuned", "{:0.0f}".format(100*cancer_prec_tuned))
+glue("cancer_rec_tuned", "{:0.0f}".format(100*cancer_rec_tuned))
+glue("mean_acc_ks", "{:0.0f}".format(100*accuracies_grid["mean_test_score"].mean()))
+glue("std3_acc_ks", "{:0.0f}".format(3*100*accuracies_grid["mean_test_score"].std()))
+glue("mean_sem_acc_ks", "{:0.0f}".format(100*accuracies_grid["sem_test_score"].mean()))
+glue("n_neighbors_max", "{:0.0f}".format(accuracies_grid["n_neighbors"].max()))
+glue("n_neighbors_min", "{:0.0f}".format(accuracies_grid["n_neighbors"].min()))
+```
+
+At first glance, this is a bit surprising: the accuracy of the classifier
+has not changed much despite tuning the number of neighbors! Our first model
+with $K =$ 3 (before we knew how to tune) had an estimated accuracy of {glue:text}`cancer_acc_1`%, 
+while the tuned model with $K =$ {glue:text}`best_k_unique` had an estimated accuracy
+of {glue:text}`cancer_acc_tuned`%. Upon examining {numref}`fig:06-find-k` again to see the
+cross validation accuracy estimates for a range of neighbors, this result
+becomes much less surprising. From {glue:text}`n_neighbors_min` to around {glue:text}`n_neighbors_max` neighbors, the cross
+validation accuracy estimate varies only by around {glue:text}`std3_acc_ks`%, with
+each estimate having a standard error around {glue:text}`mean_sem_acc_ks`%.
+Since the cross-validation accuracy estimates the test set accuracy,
+the fact that the test set accuracy also doesn't change much is expected.
+Also note that the $K =$ 3 model had a precision 
+precision of {glue:text}`cancer_prec_1`% and recall of {glue:text}`cancer_rec_1`%,
+while the tuned model had
+a precision of {glue:text}`cancer_prec_tuned`% and recall of {glue:text}`cancer_rec_tuned`%.
+Given that the recall decreased&mdash;remember, in this application, recall
+is critical to making sure we find all the patients with malignant tumors&mdash;the tuned model may actually be *less* preferred
+in this setting. In any case, it is important to think critically about the result of tuning. Models tuned to
+maximize accuracy are not necessarily better for a given application.
+
 ## Summary
 
 Classification algorithms use one or more quantitative variables to predict the
diff --git a/source/regression1.md b/source/regression1.md
@@ -408,6 +408,13 @@ the `train_test_split` function cannot stratify based on a
 quantitative variable.
 ```
 
+```{code-cell} ipython3
+:tags: [remove-cell]
+# fix seed right before train/test split for reproducibility with next chapter
+# make sure this seed is always the same as the one used before the split in Regression 2
+np.random.seed(1)
+```
+
 ```{code-cell} ipython3
 sacramento_train, sacramento_test = train_test_split(
     sacramento, train_size=0.75
@@ -698,7 +705,7 @@ to be too small or too large, we cause the RMSPE to increase, as shown in
 
 {numref}`fig:07-howK` visualizes the effect of different settings of $K$ on the
 regression model. Each plot shows the predicted values for house sale price from
-our KNN regression model for 6 different values for $K$: 1, 3, {glue:text}`best_k_sacr`, 41, 250, and 699 (i.e., all of the training data).
+our KNN regression model for 6 different values for $K$: 1, 3, 25, {glue:text}`best_k_sacr`, 250, and 699 (i.e., all of the training data).
 For each model, we predict prices for the range of possible home sizes we
 observed in the data set (here 500 to 5,000 square feet) and we plot the
 predicted prices as a orange line.
@@ -709,8 +716,8 @@ predicted prices as a orange line.
 gridvals = [
     1,
     3,
+    25,
     best_k_sacr,
-    41,
     250,
     len(sacramento_train),
 ]
@@ -818,7 +825,7 @@ chapter.
 To assess how well our model might do at predicting on unseen data, we will
 assess its RMSPE on the test data. To do this, we first need to retrain the 
 KNN regression model on the entire training data set using $K =$ {glue:text}`best_k_sacr`
-neighbors. Fortunately we do not have to do this ourselves manually; `scikit-learn`
+neighbors. As we saw in {numref}`Chapter %s <classification2>` we do not have to do this ourselves manually; `scikit-learn`
 does it for us automatically. To make predictions with the best model on the test data,
 we can use the `predict` method of the fit `GridSearchCV` object.
 We then use the `mean_squared_error`
diff --git a/source/regression2.md b/source/regression2.md
@@ -371,7 +371,7 @@ np.random.seed(1)
 sacramento = pd.read_csv("data/sacramento.csv")
 
 sacramento_train, sacramento_test = train_test_split(
-    sacramento, train_size=0.6
+    sacramento, train_size=0.75
 )
 ```
 
@@ -533,8 +533,8 @@ from sklearn.preprocessing import StandardScaler
 # preprocess the data, make the pipeline
 sacr_preprocessor = make_column_transformer((StandardScaler(), ["sqft"]))
 sacr_pipeline_knn = make_pipeline(
-    sacr_preprocessor, KNeighborsRegressor(n_neighbors=25)
-)  # 25 is the best parameter obtained through cross validation in regression1 chapter
+    sacr_preprocessor, KNeighborsRegressor(n_neighbors=55)
+)  # 55 is the best parameter obtained through cross validation in regression1 chapter
 
 sacr_pipeline_knn.fit(sacramento_train[["sqft"]], sacramento_train[["price"]])