Skip to content

Commit 7c9a46a

Browse files
Merge pull request #316 from UBC-DSCI/index-update
Index Update
2 parents c729fa8 + dc5f6b5 commit 7c9a46a

15 files changed

+418
-223
lines changed

source/_config.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,7 @@ html:
4949
extra_navbar: Powered by <a href="https://jupyterbook.org">Jupyter Book</a> # Will be displayed underneath the left navbar.
5050
extra_footer: "" # Will be displayed underneath the footer.
5151
google_analytics_id: "G-7XBFF4RSN2" # A GA id that can be used to track book views.
52-
home_page_in_navbar: true # Whether to include your home page in the left Navigation Bar
52+
home_page_in_navbar: false # Whether to include your home page in the left Navigation Bar
5353
baseurl: "" # The base URL where your book will be hosted. Used for creating image previews and social links. e.g.: https://mypage.com/mybook/
5454
comments:
5555
hypothesis: false

source/classification1.md

Lines changed: 34 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -144,7 +144,7 @@ In this case, the file containing the breast cancer data set is a `.csv`
144144
file with headers. We'll use the `read_csv` function with no additional
145145
arguments, and then inspect its contents:
146146

147-
```{index} read function; read\_csv
147+
```{index} read function; read_csv
148148
```
149149

150150
```{code-cell} ipython3
@@ -183,7 +183,7 @@ total set of variables per image in this data set is:
183183

184184
+++
185185

186-
```{index} info
186+
```{index} DataFrame; info
187187
```
188188

189189
Below we use the `info` method to preview the data frame. This method can
@@ -195,7 +195,7 @@ as well as their data types and the number of non-missing entries.
195195
cancer.info()
196196
```
197197

198-
```{index} unique
198+
```{index} Series; unique
199199
```
200200

201201
From the summary of the data above, we can see that `Class` is of type `object`.
@@ -213,7 +213,7 @@ method. The `replace` method takes one argument: a dictionary that maps
213213
previous values to desired new values.
214214
We will verify the result using the `unique` method.
215215

216-
```{index} replace
216+
```{index} Series; replace
217217
```
218218

219219
```{code-cell} ipython3
@@ -227,7 +227,7 @@ cancer["Class"].unique()
227227

228228
### Exploring the cancer data
229229

230-
```{index} groupby, count
230+
```{index} DataFrame; groupby, Series;size
231231
```
232232

233233
```{code-cell} ipython3
@@ -239,9 +239,9 @@ glue("malignant_pct", "{:0.0f}".format(100*cancer["Class"].value_counts(normaliz
239239
```
240240

241241
Before we start doing any modeling, let's explore our data set. Below we use
242-
the `groupby` and `count` methods to find the number and percentage
242+
the `groupby` and `size` methods to find the number and percentage
243243
of benign and malignant tumor observations in our data set. When paired with
244-
`groupby`, `count` counts the number of observations for each value of the `Class`
244+
`groupby`, `size` counts the number of observations for each value of the `Class`
245245
variable. Then we calculate the percentage in each group by dividing by the total
246246
number of observations and multiplying by 100.
247247
The total number of observations equals the number of rows in the data frame,
@@ -256,7 +256,7 @@ tumor observations.
256256
100 * cancer.groupby("Class").size() / cancer.shape[0]
257257
```
258258

259-
```{index} value_counts
259+
```{index} Series; value_counts
260260
```
261261

262262
The `pandas` package also has a more convenient specialized `value_counts` method for
@@ -621,8 +621,6 @@ glue("fig:05-multiknn-1", perim_concav_with_new_point3)
621621
Scatter plot of concavity versus perimeter with new observation represented as a red diamond.
622622
:::
623623

624-
```{index} pandas.DataFrame; assign
625-
```
626624

627625
```{code-cell} ipython3
628626
new_obs_Perimeter = 0
@@ -952,7 +950,7 @@ knn = KNeighborsClassifier(n_neighbors=5)
952950
knn
953951
```
954952

955-
```{index} scikit-learn; X & y
953+
```{index} scikit-learn; fit, scikit-learn; predictors, scikit-learn; response
956954
```
957955

958956
In order to fit the model on the breast cancer data, we need to call `fit` on
@@ -1061,10 +1059,13 @@ predictors (colored by diagnosis) for both the unstandardized data we just
10611059
loaded, and the standardized version of that same data. But first, we need to
10621060
standardize the `unscaled_cancer` data set with `scikit-learn`.
10631061

1064-
```{index} pipeline, scikit-learn; make_column_transformer
1062+
```{index} see: Pipeline; scikit-learn
1063+
```
1064+
1065+
```{index} see: make_column_transformer; scikit-learn
10651066
```
10661067

1067-
```{index} double: scikit-learn; pipeline
1068+
```{index} scikit-learn;Pipeline, scikit-learn; make_column_transformer
10681069
```
10691070

10701071
The `scikit-learn` framework provides a collection of *preprocessors* used to manipulate
@@ -1090,13 +1091,13 @@ preprocessor = make_column_transformer(
10901091
preprocessor
10911092
```
10921093

1093-
```{index} scikit-learn; ColumnTransformer, scikit-learn; StandardScaler, scikit-learn; fit_transform
1094+
```{index} scikit-learn; make_column_transformer, scikit-learn; StandardScaler
10941095
```
10951096

1096-
```{index} ColumnTransformer; StandardScaler
1097+
```{index} see: StandardScaler; scikit-learn
10971098
```
10981099

1099-
```{index} scikit-learn; fit, scikit-learn; transform
1100+
```{index} scikit-learn; fit, scikit-learn; make_column_selector, scikit-learn; StandardScaler
11001101
```
11011102

11021103
You can see that the preprocessor includes a single standardization step
@@ -1119,7 +1120,10 @@ preprocessor = make_column_transformer(
11191120
preprocessor
11201121
```
11211122

1122-
```{index} see: fit, transform, fit_transform; scikit-learn
1123+
```{index} see: fit ; scikit-learn
1124+
```
1125+
1126+
```{index} scikit-learn; transform
11231127
```
11241128

11251129
We are now ready to standardize the numerical predictor columns in the `unscaled_cancer` data frame.
@@ -1409,6 +1413,9 @@ detection, there are many cases in which the "important" class to identify
14091413
(presence of disease, malicious email) is much rarer than the "unimportant"
14101414
class (no disease, normal email).
14111415

1416+
```{index} concat
1417+
```
1418+
14121419
To better illustrate the problem, let's revisit the scaled breast cancer data,
14131420
`cancer`; except now we will remove many of the observations of malignant tumors, simulating
14141421
what the data would look like if the cancer was rare. We will do this by
@@ -1603,7 +1610,7 @@ Imbalanced data with background color indicating the decision of the classifier
16031610

16041611
+++
16051612

1606-
```{index} oversampling, scikit-learn; sample
1613+
```{index} oversampling, DataFrame; sample
16071614
```
16081615

16091616
Despite the simplicity of the problem, solving it in a statistically sound manner is actually
@@ -1747,6 +1754,9 @@ entries, one option is to simply remove those observations prior to building
17471754
the K-nearest neighbors classifier. We can accomplish this by using the
17481755
`dropna` method prior to working with the data.
17491756

1757+
```{index} missing data; dropna
1758+
```
1759+
17501760
```{code-cell} ipython3
17511761
no_missing_cancer = missing_cancer.dropna()
17521762
no_missing_cancer
@@ -1758,8 +1768,11 @@ possible approach is to *impute* the missing entries, i.e., fill in synthetic
17581768
values based on the other observations in the data set. One reasonable choice
17591769
is to perform *mean imputation*, where missing entries are filled in using the
17601770
mean of the present entries in each variable. To perform mean imputation, we
1761-
use a `SimpleImputer` transformer with the default arguments, and wrap it in a
1762-
`ColumnTransformer` to indicate which columns need imputation.
1771+
use a `SimpleImputer` transformer with the default arguments, and use
1772+
`make_column_transformer` to indicate which columns need imputation.
1773+
1774+
```{index} scikit-learn; SimpleImputer, missing data;mean imputation
1775+
```
17631776

17641777
```{code-cell} ipython3
17651778
from sklearn.impute import SimpleImputer
@@ -1792,7 +1805,7 @@ question you are answering.
17921805
(08:puttingittogetherworkflow)=
17931806
## Putting it together in a `Pipeline`
17941807

1795-
```{index} scikit-learn; pipeline
1808+
```{index} scikit-learn; Pipeline
17961809
```
17971810

17981811
The `scikit-learn` package collection also provides the [`Pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html?highlight=pipeline#sklearn.pipeline.Pipeline),

source/classification2.md

Lines changed: 47 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -121,6 +121,9 @@ $$\mathrm{accuracy} = \frac{\mathrm{number \; of \; correct \; predictions}}{\
121121
Process for splitting the data and finding the prediction accuracy.
122122
```
123123

124+
```{index} confusion matrix
125+
```
126+
124127
Accuracy is a convenient, general-purpose way to summarize the performance of a classifier with
125128
a single number. But prediction accuracy by itself does not tell the whole
126129
story. In particular, accuracy alone only tells us how often the classifier
@@ -165,6 +168,9 @@ disastrous error, since it may lead to a patient who requires treatment not rece
165168
Since we are particularly interested in identifying malignant cases, this
166169
classifier would likely be unacceptable even with an accuracy of 89%.
167170

171+
```{index} positive label, negative label, true positive, true negative, false positive, false negative
172+
```
173+
168174
Focusing more on one label than the other is
169175
common in classification problems. In such cases, we typically refer to the label we are more
170176
interested in identifying as the *positive* label, and the other as the
@@ -178,6 +184,9 @@ classifier can make, corresponding to the four entries in the confusion matrix:
178184
- **True Negative:** A benign observation that was classified as benign (bottom right in {numref}`confusion-matrix-table`).
179185
- **False Negative:** A malignant observation that was classified as benign (top right in {numref}`confusion-matrix-table`).
180186

187+
```{index} precision, recall
188+
```
189+
181190
A perfect classifier would have zero false negatives and false positives (and
182191
therefore, 100% accuracy). However, classifiers in practice will almost always
183192
make some errors. So you should think about which kinds of error are most
@@ -358,6 +367,12 @@ in `np.random.seed` will lead to different patterns of randomness, but as long a
358367
value your analysis results will be the same. In the remainder of the textbook,
359368
we will set the seed once at the beginning of each chapter.
360369

370+
```{index} RandomState
371+
```
372+
373+
```{index} see: RandomState; seed
374+
```
375+
361376
````{note}
362377
When you use `np.random.seed`, you are really setting the seed for the `numpy`
363378
package's *default random number generator*. Using the global default random
@@ -516,7 +531,7 @@ glue("cancer_train_nrow", "{:d}".format(len(cancer_train)))
516531
glue("cancer_test_nrow", "{:d}".format(len(cancer_test)))
517532
```
518533

519-
```{index} info
534+
```{index} DataFrame; info
520535
```
521536

522537
We can see from the `info` method above that the training set contains {glue:text}`cancer_train_nrow` observations,
@@ -525,7 +540,7 @@ a train / test split of 75% / 25%, as desired. Recall from {numref}`Chapter %s <
525540
that we use the `info` method to preview the number of rows, the variable names, their data types, and
526541
missing entries of a data frame.
527542

528-
```{index} groupby, count
543+
```{index} Series; value_counts
529544
```
530545

531546
We can use the `value_counts` method with the `normalize` argument set to `True`
@@ -557,7 +572,7 @@ training and test data sets.
557572

558573
+++
559574

560-
```{index} pipeline, pipeline; make_column_transformer, pipeline; StandardScaler
575+
```{index} scikit-learn; Pipeline, scikit-learn; make_column_transformer, scikit-learn; StandardScaler
561576
```
562577

563578
Fortunately, `scikit-learn` helps us handle this properly as long as we wrap our
@@ -603,7 +618,7 @@ knn_pipeline
603618

604619
### Predict the labels in the test set
605620

606-
```{index} pandas.concat
621+
```{index} scikit-learn; predict
607622
```
608623

609624
Now that we have a K-nearest neighbors classifier object, we can use it to
@@ -622,7 +637,7 @@ cancer_test[["ID", "Class", "predicted"]]
622637
(eval-performance-clasfcn2)=
623638
### Evaluate performance
624639

625-
```{index} scikit-learn; score
640+
```{index} scikit-learn; score, scikit-learn; precision_score, scikit-learn; recall_score
626641
```
627642

628643
Finally, we can assess our classifier's performance. First, we will examine accuracy.
@@ -695,6 +710,9 @@ arguments: the actual labels first, then the predicted labels second. Note that
695710
`crosstab` orders its columns alphabetically, but the positive label is still `Malignant`,
696711
even if it is not in the top left corner as in the example confusion matrix earlier in this chapter.
697712

713+
```{index} crosstab
714+
```
715+
698716
```{code-cell} ipython3
699717
pd.crosstab(
700718
cancer_test["Class"],
@@ -774,7 +792,7 @@ a recall of {glue:text}`cancer_rec_1`%.
774792
That sounds pretty good! Wait, *is* it good?
775793
Or do we need something higher?
776794

777-
```{index} accuracy; assessment
795+
```{index} accuracy;assessment, precision;assessment, recall;assessment
778796
```
779797

780798
In general, a *good* value for accuracy (as well as precision and recall, if applicable)
@@ -1026,6 +1044,12 @@ cv_5_df = pd.DataFrame(
10261044
cv_5_df
10271045
```
10281046

1047+
```{index} see: sem;standard error
1048+
```
1049+
1050+
```{index} standard error, DataFrame;agg
1051+
```
1052+
10291053
The validation scores we are interested in are contained in the `test_score` column.
10301054
We can then aggregate the *mean* and *standard error*
10311055
of the classifier's validation accuracy across the folds.
@@ -1098,6 +1122,9 @@ cv_10_metrics["test_score"]["sem"] = cv_5_metrics["test_score"]["sem"] / np.sqrt
10981122
cv_10_metrics
10991123
```
11001124

1125+
```{index} cross-validation; folds
1126+
```
1127+
11011128
In this case, using 10-fold instead of 5-fold cross validation did
11021129
reduce the standard error very slightly. In fact, due to the randomness in how the data are split, sometimes
11031130
you might even end up with a *higher* standard error when increasing the number of folds!
@@ -1153,6 +1180,11 @@ functionality, named `GridSearchCV`, to automatically handle the details for us.
11531180
Before we use `GridSearchCV`, we need to create a new pipeline
11541181
with a `KNeighborsClassifier` that has the number of neighbors left unspecified.
11551182

1183+
```{index} see: make_pipeline; scikit-learn
1184+
```
1185+
```{index} scikit-learn;make_pipeline
1186+
```
1187+
11561188
```{code-cell} ipython3
11571189
knn = KNeighborsClassifier()
11581190
cancer_tune_pipe = make_pipeline(cancer_preprocessor, knn)
@@ -1534,6 +1566,9 @@ us automatically. To make predictions and assess the estimated accuracy of the b
15341566
`score` and `predict` methods of the fit `GridSearchCV` object. We can then pass those predictions to
15351567
the `precision`, `recall`, and `crosstab` functions to assess the estimated precision and recall, and print a confusion matrix.
15361568

1569+
```{index} scikit-learn;predict, scikit-learn;score, scikit-learn;precision_score, scikit-learn;recall_score, crosstab
1570+
```
1571+
15371572
```{code-cell} ipython3
15381573
cancer_test["predicted"] = cancer_tune_grid.predict(
15391574
cancer_test[["Smoothness", "Concavity"]]
@@ -1637,7 +1672,7 @@ Overview of K-NN classification.
16371672

16381673
+++
16391674

1640-
```{index} scikit-learn, pipeline, cross-validation, K-nearest neighbors; classification, classification
1675+
```{index} scikit-learn;Pipeline, cross-validation, K-nearest neighbors; classification, classification
16411676
```
16421677

16431678
The overall workflow for performing K-nearest neighbors classification using `scikit-learn` is as follows:
@@ -1755,19 +1790,7 @@ for i in range(len(ks)):
17551790
cancer_tune_pipe = make_pipeline(cancer_preprocessor, KNeighborsClassifier())
17561791
param_grid = {
17571792
"kneighborsclassifier__n_neighbors": range(1, 21),
1758-
} ## double check: in R textbook, it is tune_grid(..., grid=20), so I guess it matches RandomizedSearchCV
1759-
## instead of GridSeachCV?
1760-
# param_grid_rand = {
1761-
# "kneighborsclassifier__n_neighbors": range(1, 100),
1762-
# }
1763-
# cancer_tune_grid = RandomizedSearchCV(
1764-
# estimator=cancer_tune_pipe,
1765-
# param_distributions=param_grid_rand,
1766-
# n_iter=20,
1767-
# cv=5,
1768-
# n_jobs=-1,
1769-
# return_train_score=True,
1770-
# )
1793+
}
17711794
cancer_tune_grid = GridSearchCV(
17721795
estimator=cancer_tune_pipe,
17731796
param_grid=param_grid,
@@ -1980,7 +2003,10 @@ where to learn more about advanced predictor selection methods.
19802003

19812004
+++
19822005

1983-
### Forward selection in `scikit-learn`
2006+
### Forward selection in Python
2007+
2008+
```{index} variable selection; implementation
2009+
```
19842010

19852011
We now turn to implementing forward selection in Python.
19862012
First we will extract a smaller set of predictors to work with in this illustrative example&mdash;`Smoothness`,

0 commit comments

Comments
 (0)