@@ -121,6 +121,9 @@ $$\mathrm{accuracy} = \frac{\mathrm{number \; of \; correct \; predictions}}{\
121
121
Process for splitting the data and finding the prediction accuracy.
122
122
```
123
123
124
+ ``` {index} confusion matrix
125
+ ```
126
+
124
127
Accuracy is a convenient, general-purpose way to summarize the performance of a classifier with
125
128
a single number. But prediction accuracy by itself does not tell the whole
126
129
story. In particular, accuracy alone only tells us how often the classifier
@@ -165,6 +168,9 @@ disastrous error, since it may lead to a patient who requires treatment not rece
165
168
Since we are particularly interested in identifying malignant cases, this
166
169
classifier would likely be unacceptable even with an accuracy of 89%.
167
170
171
+ ``` {index} positive label, negative label, true positive, true negative, false positive, false negative
172
+ ```
173
+
168
174
Focusing more on one label than the other is
169
175
common in classification problems. In such cases, we typically refer to the label we are more
170
176
interested in identifying as the * positive* label, and the other as the
@@ -178,6 +184,9 @@ classifier can make, corresponding to the four entries in the confusion matrix:
178
184
- ** True Negative:** A benign observation that was classified as benign (bottom right in {numref}` confusion-matrix-table ` ).
179
185
- ** False Negative:** A malignant observation that was classified as benign (top right in {numref}` confusion-matrix-table ` ).
180
186
187
+ ``` {index} precision, recall
188
+ ```
189
+
181
190
A perfect classifier would have zero false negatives and false positives (and
182
191
therefore, 100% accuracy). However, classifiers in practice will almost always
183
192
make some errors. So you should think about which kinds of error are most
@@ -358,6 +367,12 @@ in `np.random.seed` will lead to different patterns of randomness, but as long a
358
367
value your analysis results will be the same. In the remainder of the textbook,
359
368
we will set the seed once at the beginning of each chapter.
360
369
370
+ ``` {index} RandomState
371
+ ```
372
+
373
+ ``` {index} see: RandomState; seed
374
+ ```
375
+
361
376
```` {note}
362
377
When you use `np.random.seed`, you are really setting the seed for the `numpy`
363
378
package's *default random number generator*. Using the global default random
@@ -516,7 +531,7 @@ glue("cancer_train_nrow", "{:d}".format(len(cancer_train)))
516
531
glue("cancer_test_nrow", "{:d}".format(len(cancer_test)))
517
532
```
518
533
519
- ``` {index} info
534
+ ``` {index} DataFrame; info
520
535
```
521
536
522
537
We can see from the ` info ` method above that the training set contains {glue: text }` cancer_train_nrow ` observations,
@@ -525,7 +540,7 @@ a train / test split of 75% / 25%, as desired. Recall from {numref}`Chapter %s <
525
540
that we use the ` info ` method to preview the number of rows, the variable names, their data types, and
526
541
missing entries of a data frame.
527
542
528
- ``` {index} groupby, count
543
+ ``` {index} Series; value_counts
529
544
```
530
545
531
546
We can use the ` value_counts ` method with the ` normalize ` argument set to ` True `
@@ -557,7 +572,7 @@ training and test data sets.
557
572
558
573
+++
559
574
560
- ``` {index} pipeline, pipeline ; make_column_transformer, pipeline ; StandardScaler
575
+ ``` {index} scikit-learn; Pipeline, scikit-learn ; make_column_transformer, scikit-learn ; StandardScaler
561
576
```
562
577
563
578
Fortunately, ` scikit-learn ` helps us handle this properly as long as we wrap our
@@ -603,7 +618,7 @@ knn_pipeline
603
618
604
619
### Predict the labels in the test set
605
620
606
- ``` {index} pandas.concat
621
+ ``` {index} scikit-learn; predict
607
622
```
608
623
609
624
Now that we have a K-nearest neighbors classifier object, we can use it to
@@ -622,7 +637,7 @@ cancer_test[["ID", "Class", "predicted"]]
622
637
(eval-performance-clasfcn2)=
623
638
### Evaluate performance
624
639
625
- ``` {index} scikit-learn; score
640
+ ``` {index} scikit-learn; score, scikit-learn; precision_score, scikit-learn; recall_score
626
641
```
627
642
628
643
Finally, we can assess our classifier's performance. First, we will examine accuracy.
@@ -695,6 +710,9 @@ arguments: the actual labels first, then the predicted labels second. Note that
695
710
` crosstab ` orders its columns alphabetically, but the positive label is still ` Malignant ` ,
696
711
even if it is not in the top left corner as in the example confusion matrix earlier in this chapter.
697
712
713
+ ``` {index} crosstab
714
+ ```
715
+
698
716
``` {code-cell} ipython3
699
717
pd.crosstab(
700
718
cancer_test["Class"],
@@ -774,7 +792,7 @@ a recall of {glue:text}`cancer_rec_1`%.
774
792
That sounds pretty good! Wait, * is* it good?
775
793
Or do we need something higher?
776
794
777
- ``` {index} accuracy; assessment
795
+ ``` {index} accuracy;assessment, precision;assessment, recall; assessment
778
796
```
779
797
780
798
In general, a * good* value for accuracy (as well as precision and recall, if applicable)
@@ -1026,6 +1044,12 @@ cv_5_df = pd.DataFrame(
1026
1044
cv_5_df
1027
1045
```
1028
1046
1047
+ ``` {index} see: sem;standard error
1048
+ ```
1049
+
1050
+ ``` {index} standard error, DataFrame;agg
1051
+ ```
1052
+
1029
1053
The validation scores we are interested in are contained in the ` test_score ` column.
1030
1054
We can then aggregate the * mean* and * standard error*
1031
1055
of the classifier's validation accuracy across the folds.
@@ -1098,6 +1122,9 @@ cv_10_metrics["test_score"]["sem"] = cv_5_metrics["test_score"]["sem"] / np.sqrt
1098
1122
cv_10_metrics
1099
1123
```
1100
1124
1125
+ ``` {index} cross-validation; folds
1126
+ ```
1127
+
1101
1128
In this case, using 10-fold instead of 5-fold cross validation did
1102
1129
reduce the standard error very slightly. In fact, due to the randomness in how the data are split, sometimes
1103
1130
you might even end up with a * higher* standard error when increasing the number of folds!
@@ -1153,6 +1180,11 @@ functionality, named `GridSearchCV`, to automatically handle the details for us.
1153
1180
Before we use ` GridSearchCV ` , we need to create a new pipeline
1154
1181
with a ` KNeighborsClassifier ` that has the number of neighbors left unspecified.
1155
1182
1183
+ ``` {index} see: make_pipeline; scikit-learn
1184
+ ```
1185
+ ``` {index} scikit-learn;make_pipeline
1186
+ ```
1187
+
1156
1188
``` {code-cell} ipython3
1157
1189
knn = KNeighborsClassifier()
1158
1190
cancer_tune_pipe = make_pipeline(cancer_preprocessor, knn)
@@ -1534,6 +1566,9 @@ us automatically. To make predictions and assess the estimated accuracy of the b
1534
1566
` score ` and ` predict ` methods of the fit ` GridSearchCV ` object. We can then pass those predictions to
1535
1567
the ` precision ` , ` recall ` , and ` crosstab ` functions to assess the estimated precision and recall, and print a confusion matrix.
1536
1568
1569
+ ``` {index} scikit-learn;predict, scikit-learn;score, scikit-learn;precision_score, scikit-learn;recall_score, crosstab
1570
+ ```
1571
+
1537
1572
``` {code-cell} ipython3
1538
1573
cancer_test["predicted"] = cancer_tune_grid.predict(
1539
1574
cancer_test[["Smoothness", "Concavity"]]
@@ -1637,7 +1672,7 @@ Overview of K-NN classification.
1637
1672
1638
1673
+++
1639
1674
1640
- ``` {index} scikit-learn, pipeline , cross-validation, K-nearest neighbors; classification, classification
1675
+ ``` {index} scikit-learn;Pipeline , cross-validation, K-nearest neighbors; classification, classification
1641
1676
```
1642
1677
1643
1678
The overall workflow for performing K-nearest neighbors classification using ` scikit-learn ` is as follows:
@@ -1755,19 +1790,7 @@ for i in range(len(ks)):
1755
1790
cancer_tune_pipe = make_pipeline(cancer_preprocessor, KNeighborsClassifier())
1756
1791
param_grid = {
1757
1792
"kneighborsclassifier__n_neighbors": range(1, 21),
1758
- } ## double check: in R textbook, it is tune_grid(..., grid=20), so I guess it matches RandomizedSearchCV
1759
- ## instead of GridSeachCV?
1760
- # param_grid_rand = {
1761
- # "kneighborsclassifier__n_neighbors": range(1, 100),
1762
- # }
1763
- # cancer_tune_grid = RandomizedSearchCV(
1764
- # estimator=cancer_tune_pipe,
1765
- # param_distributions=param_grid_rand,
1766
- # n_iter=20,
1767
- # cv=5,
1768
- # n_jobs=-1,
1769
- # return_train_score=True,
1770
- # )
1793
+ }
1771
1794
cancer_tune_grid = GridSearchCV(
1772
1795
estimator=cancer_tune_pipe,
1773
1796
param_grid=param_grid,
@@ -1980,7 +2003,10 @@ where to learn more about advanced predictor selection methods.
1980
2003
1981
2004
+++
1982
2005
1983
- ### Forward selection in ` scikit-learn `
2006
+ ### Forward selection in Python
2007
+
2008
+ ``` {index} variable selection; implementation
2009
+ ```
1984
2010
1985
2011
We now turn to implementing forward selection in Python.
1986
2012
First we will extract a smaller set of predictors to work with in this illustrative example&mdash ; ` Smoothness ` ,
0 commit comments