@@ -46,8 +46,8 @@ By the end of the chapter, readers will be able to do the following:
46
46
- Describe what a training data set is and how it is used in classification.
47
47
- Interpret the output of a classifier.
48
48
- Compute, by hand, the straight-line (Euclidean) distance between points on a graph when there are two predictor variables.
49
- - Explain the $K$ -nearest neighbor classification algorithm.
50
- - Perform $K$ -nearest neighbor classification in Python using ` scikit-learn ` .
49
+ - Explain the K -nearest neighbors classification algorithm.
50
+ - Perform K -nearest neighbors classification in Python using ` scikit-learn ` .
51
51
- Use methods from ` scikit-learn ` to center, scale, balance, and impute data as a preprocessing step.
52
52
- Combine preprocessing and model training into a ` Pipeline ` using ` make_pipeline ` .
53
53
@@ -88,7 +88,7 @@ the classifier to make predictions on new data for which we do not know the clas
88
88
89
89
There are many possible methods that we could use to predict
90
90
a categorical class/label for an observation. In this book, we will
91
- focus on the widely used ** $K$ -nearest neighbors** algorithm {cite: p }` knnfix,knncover ` .
91
+ focus on the widely used ** K -nearest neighbors** algorithm {cite: p }` knnfix,knncover ` .
92
92
In your future studies, you might encounter decision trees, support vector machines (SVMs),
93
93
logistic regression, neural networks, and more; see the additional resources
94
94
section at the end of the next chapter for where to begin learning more about
@@ -317,7 +317,7 @@ tumor images with unknown diagnoses.
317
317
318
318
+++
319
319
320
- ## Classification with $K$ -nearest neighbors
320
+ ## Classification with K -nearest neighbors
321
321
322
322
``` {code-cell} ipython3
323
323
:tags: [remove-cell]
@@ -342,15 +342,15 @@ my_distances = euclidean_distances(perim_concav_with_new_point_df[attrs])[
342
342
343
343
In order to actually make predictions for new observations in practice, we
344
344
will need a classification algorithm.
345
- In this book, we will use the $K$ -nearest neighbors classification algorithm.
345
+ In this book, we will use the K -nearest neighbors classification algorithm.
346
346
To predict the label of a new observation (here, classify it as either benign
347
- or malignant), the $K$ -nearest neighbors classifier generally finds the $K$
347
+ or malignant), the K -nearest neighbors classifier generally finds the $K$
348
348
"nearest" or "most similar" observations in our training set, and then uses
349
349
their diagnoses to make a prediction for the new observation's diagnosis. $K$
350
350
is a number that we must choose in advance; for now, we will assume that someone has chosen
351
351
$K$ for us. We will cover how to choose $K$ ourselves in the next chapter.
352
352
353
- To illustrate the concept of $K$ -nearest neighbors classification, we
353
+ To illustrate the concept of K -nearest neighbors classification, we
354
354
will walk through an example. Suppose we have a
355
355
new observation, with standardized perimeter
356
356
of {glue: text }` new_point_1_0 ` and standardized concavity
@@ -716,7 +716,7 @@ Scatter plot of concavity versus perimeter with 5 nearest neighbors circled.
716
716
### More than two explanatory variables
717
717
718
718
Although the above description is directed toward two predictor variables,
719
- exactly the same $K$ -nearest neighbors algorithm applies when you
719
+ exactly the same K -nearest neighbors algorithm applies when you
720
720
have a higher number of predictor variables. Each predictor variable may give us new
721
721
information to help create our classifier. The only difference is the formula
722
722
for the distance between points. Suppose we have $m$ predictor
@@ -872,30 +872,30 @@ nearest neighbors look like, for learning purposes.
872
872
873
873
+++
874
874
875
- ### Summary of $K$ -nearest neighbors algorithm
875
+ ### Summary of K -nearest neighbors algorithm
876
876
877
- In order to classify a new observation using a $K$ -nearest neighbor classifier, we have to do the following:
877
+ In order to classify a new observation using a K -nearest neighbors classifier, we have to do the following:
878
878
879
879
1 . Compute the distance between the new observation and each observation in the training set.
880
880
2 . Find the $K$ rows corresponding to the $K$ smallest distances.
881
881
3 . Classify the new observation based on a majority vote of the neighbor classes.
882
882
883
883
+++
884
884
885
- ## $K$ -nearest neighbors with ` scikit-learn `
885
+ ## K -nearest neighbors with ` scikit-learn `
886
886
887
887
``` {index} scikit-learn
888
888
```
889
889
890
- Coding the $K$ -nearest neighbors algorithm in Python ourselves can get complicated,
890
+ Coding the K -nearest neighbors algorithm in Python ourselves can get complicated,
891
891
especially if we want to handle multiple classes, more than two variables,
892
892
or predict the class for multiple new observations. Thankfully, in Python,
893
- the $K$ -nearest neighbors algorithm is
893
+ the K -nearest neighbors algorithm is
894
894
implemented in [ the ` scikit-learn ` Python package] ( https://scikit-learn.org/stable/index.html ) {cite: p }` sklearn_api ` along with
895
895
many [ other models] ( https://scikit-learn.org/stable/user_guide.html ) that you will encounter in this and future chapters of the book. Using the functions
896
896
in the ` scikit-learn ` package (named ` sklearn ` in Python) will help keep our code simple, readable and accurate; the
897
897
less we have to code ourselves, the fewer mistakes we will likely make.
898
- Before getting started with $K$ -nearest neighbors, we need to tell the ` sklearn ` package
898
+ Before getting started with K -nearest neighbors, we need to tell the ` sklearn ` package
899
899
that we prefer using ` pandas ` data frames over regular arrays via the ` set_config ` function.
900
900
``` {note}
901
901
You will notice a new way of importing functions in the code below: `from ... import ...`. This lets us
@@ -913,14 +913,14 @@ from sklearn import set_config
913
913
set_config(transform_output="pandas")
914
914
```
915
915
916
- We can now get started with $K$ -nearest neighbors. The first step is to
916
+ We can now get started with K -nearest neighbors. The first step is to
917
917
import the ` KNeighborsClassifier ` from the ` sklearn.neighbors ` module.
918
918
919
919
``` {code-cell} ipython3
920
920
from sklearn.neighbors import KNeighborsClassifier
921
921
```
922
922
923
- Let's walk through how to use ` KNeighborsClassifier ` to perform $K$ -nearest neighbors classification.
923
+ Let's walk through how to use ` KNeighborsClassifier ` to perform K -nearest neighbors classification.
924
924
We will use the ` cancer ` data set from above, with
925
925
perimeter and concavity as predictors and $K = 5$ neighbors to build our classifier. Then
926
926
we will use the classifier to predict the diagnosis label for a new observation with
@@ -935,7 +935,7 @@ cancer_train
935
935
``` {index} scikit-learn; model object, scikit-learn; KNeighborsClassifier
936
936
```
937
937
938
- Next, we create a * model object* for $K$ -nearest neighbors classification
938
+ Next, we create a * model object* for K -nearest neighbors classification
939
939
by creating a ` KNeighborsClassifier ` instance, specifying that we want to use $K = 5$ neighbors;
940
940
we will discuss how to choose $K$ in the next chapter.
941
941
@@ -974,7 +974,7 @@ knn.fit(X=cancer_train[["Perimeter", "Concavity"]], y=cancer_train["Class"]);
974
974
975
975
After using the ` fit ` function, we can make a prediction on a new observation
976
976
by calling ` predict ` on the classifier object, passing the new observation
977
- itself. As above, when we ran the $K$ -nearest neighbors classification
977
+ itself. As above, when we ran the K -nearest neighbors classification
978
978
algorithm manually, the ` knn ` model object classifies the new observation as
979
979
"Malignant". Note that the ` predict ` function outputs an ` array ` with the
980
980
model's prediction; you can actually make multiple predictions at the same
@@ -1000,7 +1000,7 @@ learn ways to quantify how accurate we think our predictions are.
1000
1000
``` {index} scaling
1001
1001
```
1002
1002
1003
- When using $K$ -nearest neighbor classification, the * scale* of each variable
1003
+ When using K -nearest neighbors classification, the * scale* of each variable
1004
1004
(i.e., its size and range of values) matters. Since the classifier predicts
1005
1005
classes by identifying observations nearest to it, any variables with
1006
1006
a large scale will have a much larger effect than variables with a small
@@ -1026,7 +1026,7 @@ degrees Celsius, the two variables would differ by a constant shift of 273
1026
1026
hypothetical job classification example, we would likely see that the center of
1027
1027
the salary variable is in the tens of thousands, while the center of the years
1028
1028
of education variable is in the single digits. Although this doesn't affect the
1029
- $K$ -nearest neighbor classification algorithm, this large shift can change the
1029
+ K -nearest neighbors classification algorithm, this large shift can change the
1030
1030
outcome of using many other predictive models.
1031
1031
1032
1032
``` {index} standardization; K-nearest neighbors
@@ -1038,8 +1038,8 @@ set of numbers) and *standard deviation* (a number quantifying how spread out va
1038
1038
For each observed value of the variable, we subtract the mean (i.e., center the variable)
1039
1039
and divide by the standard deviation (i.e., scale the variable). When we do this, the data
1040
1040
is said to be * standardized* , and all variables in a data set will have a mean of 0
1041
- and a standard deviation of 1. To illustrate the effect that standardization can have on the $K$ -nearest
1042
- neighbor algorithm, we will read in the original, unstandardized Wisconsin breast
1041
+ and a standard deviation of 1. To illustrate the effect that standardization can have on the K -nearest
1042
+ neighbors algorithm, we will read in the original, unstandardized Wisconsin breast
1043
1043
cancer data set; we have been using a standardized version of the data set up
1044
1044
until now. We will apply the same initial wrangling steps as we did earlier,
1045
1045
and to keep things simple we will just use the ` Area ` , ` Smoothness ` , and ` Class `
@@ -1173,7 +1173,7 @@ scaled_cancer_all
1173
1173
1174
1174
You may wonder why we are doing so much work just to center and
1175
1175
scale our variables. Can't we just manually scale and center the ` Area ` and
1176
- ` Smoothness ` variables ourselves before building our $K$ -nearest neighbor model? Well,
1176
+ ` Smoothness ` variables ourselves before building our K -nearest neighbors model? Well,
1177
1177
technically * yes* ; but doing so is error-prone. In particular, we might
1178
1178
accidentally forget to apply the same centering / scaling when making
1179
1179
predictions, or accidentally apply a * different* centering / scaling than what
@@ -1400,7 +1400,7 @@ Close-up of three nearest neighbors for unstandardized data.
1400
1400
1401
1401
Another potential issue in a data set for a classifier is * class imbalance* ,
1402
1402
i.e., when one label is much more common than another. Since classifiers like
1403
- the $K$ -nearest neighbor algorithm use the labels of nearby points to predict
1403
+ the K -nearest neighbors algorithm use the labels of nearby points to predict
1404
1404
the label of a new point, if there are many more data points with one label
1405
1405
overall, the algorithm is more likely to pick that label in general (even if
1406
1406
the "pattern" of data suggests otherwise). Class imbalance is actually quite a
@@ -1451,7 +1451,7 @@ rare_cancer["Class"].value_counts()
1451
1451
1452
1452
+++
1453
1453
1454
- Suppose we now decided to use $K = 7$ in $K$ -nearest neighbor classification.
1454
+ Suppose we now decided to use $K = 7$ in K -nearest neighbors classification.
1455
1455
With only 3 observations of malignant tumors, the classifier
1456
1456
will * always predict that the tumor is benign, no matter what its concavity and perimeter
1457
1457
are!* This is because in a majority vote of 7 observations, at most 3 will be
@@ -1525,7 +1525,7 @@ Imbalanced data with 7 nearest neighbors to a new observation highlighted.
1525
1525
+++
1526
1526
1527
1527
{numref}` fig:05-upsample-2 ` shows what happens if we set the background color of
1528
- each area of the plot to the predictions the $K$ -nearest neighbor
1528
+ each area of the plot to the predictions the K -nearest neighbors
1529
1529
classifier would make. We can see that the decision is
1530
1530
always "benign," corresponding to the blue color.
1531
1531
@@ -1610,7 +1610,7 @@ Despite the simplicity of the problem, solving it in a statistically sound manne
1610
1610
fairly nuanced, and a careful treatment would require a lot more detail and mathematics than we will cover in this textbook.
1611
1611
For the present purposes, it will suffice to rebalance the data by * oversampling* the rare class.
1612
1612
In other words, we will replicate rare observations multiple times in our data set to give them more
1613
- voting power in the $K$ -nearest neighbor algorithm. In order to do this, we will
1613
+ voting power in the K -nearest neighbors algorithm. In order to do this, we will
1614
1614
first separate the classes out into their own data frames by filtering.
1615
1615
Then, we will
1616
1616
use the ` sample ` method on the rare class data frame to increase the number of ` Malignant ` observations to be the same as the number
@@ -1638,9 +1638,9 @@ upsampled_cancer = pd.concat((malignant_cancer_upsample, benign_cancer))
1638
1638
upsampled_cancer["Class"].value_counts()
1639
1639
```
1640
1640
1641
- Now suppose we train our $K$ -nearest neighbor classifier with $K=7$ on this * balanced* data.
1641
+ Now suppose we train our K -nearest neighbors classifier with $K=7$ on this * balanced* data.
1642
1642
{numref}` fig:05-upsample-plot ` shows what happens now when we set the background color
1643
- of each area of our scatter plot to the decision the $K$ -nearest neighbor
1643
+ of each area of our scatter plot to the decision the K -nearest neighbors
1644
1644
classifier would make. We can see that the decision is more reasonable; when the points are close
1645
1645
to those labeled malignant, the classifier predicts a malignant tumor, and vice versa when they are
1646
1646
closer to the benign tumor observations.
@@ -1738,13 +1738,13 @@ missing_cancer["Class"] = missing_cancer["Class"].replace({
1738
1738
missing_cancer
1739
1739
```
1740
1740
1741
- Recall that K-nearest neighbor classification makes predictions by computing
1741
+ Recall that K-nearest neighbors classification makes predictions by computing
1742
1742
the straight-line distance to nearby training observations, and hence requires
1743
1743
access to the values of * all* variables for * all* observations in the training
1744
- data. So how can we perform K-nearest neighbor classification in the presence
1744
+ data. So how can we perform K-nearest neighbors classification in the presence
1745
1745
of missing data? Well, since there are not too many observations with missing
1746
1746
entries, one option is to simply remove those observations prior to building
1747
- the K-nearest neighbor classifier. We can accomplish this by using the
1747
+ the K-nearest neighbors classifier. We can accomplish this by using the
1748
1748
` dropna ` method prior to working with the data.
1749
1749
1750
1750
``` {code-cell} ipython3
@@ -1809,7 +1809,7 @@ unscaled_cancer["Class"] = unscaled_cancer["Class"].replace({
1809
1809
})
1810
1810
unscaled_cancer
1811
1811
1812
- # create the KNN model
1812
+ # create the K-NN model
1813
1813
knn = KNeighborsClassifier(n_neighbors=7)
1814
1814
1815
1815
# create the centering / scaling preprocessor
@@ -1859,7 +1859,7 @@ prediction
1859
1859
1860
1860
The classifier predicts that the first observation is benign, while the second is
1861
1861
malignant. {numref}` fig:05-workflow-plot ` visualizes the predictions that this
1862
- trained $K$ -nearest neighbor model will make on a large range of new observations.
1862
+ trained K -nearest neighbors model will make on a large range of new observations.
1863
1863
Although you have seen colored prediction map visualizations like this a few times now,
1864
1864
we have not included the code to generate them, as it is a little bit complicated.
1865
1865
For the interested reader who wants a learning challenge, we now include it below.
0 commit comments