@@ -65,8 +65,8 @@ By the end of the chapter, readers will be able to do the following:
65
65
- Describe what a training data set is and how it is used in classification.
66
66
- Interpret the output of a classifier.
67
67
- Compute, by hand, the straight-line (Euclidean) distance between points on a graph when there are two predictor variables.
68
- - Explain the $K$ -nearest neighbor classification algorithm.
69
- - Perform $K$ -nearest neighbor classification in R using ` tidymodels ` .
68
+ - Explain the K -nearest neighbors classification algorithm.
69
+ - Perform K -nearest neighbors classification in R using ` tidymodels ` .
70
70
- Use a ` recipe ` to center, scale, balance, and impute data as a preprocessing step.
71
71
- Combine preprocessing and model training using a ` workflow ` .
72
72
@@ -93,7 +93,7 @@ the classifier to make predictions on new data for which we do not know the clas
93
93
94
94
There are many possible methods that we could use to predict
95
95
a categorical class/label for an observation. In this book, we will
96
- focus on the widely used ** $K$ -nearest neighbors** \index{K-nearest neighbors} algorithm [ @knnfix ; @knncover ] .
96
+ focus on the widely used ** K -nearest neighbors** \index{K-nearest neighbors} algorithm [ @knnfix ; @knncover ] .
97
97
In your future studies, you might encounter decision trees, support vector machines (SVMs),
98
98
logistic regression, neural networks, and more; see the additional resources
99
99
section at the end of the next chapter for where to begin learning more about
@@ -272,7 +272,7 @@ malignant. Based on our visualization, it seems like it may be possible
272
272
to make accurate predictions of the ` Class ` variable (i.e., a diagnosis) for
273
273
tumor images with unknown diagnoses.
274
274
275
- ## Classification with $K$ -nearest neighbors
275
+ ## Classification with K -nearest neighbors
276
276
277
277
``` {r 05-knn-0, echo = FALSE}
278
278
## Find the distance between new point and all others in data set
@@ -306,15 +306,15 @@ neighbors <- cancer[order(my_distances$Distance), ]
306
306
307
307
In order to actually make predictions for new observations in practice, we
308
308
will need a classification algorithm.
309
- In this book, we will use the $K$ -nearest neighbors \index{K-nearest neighbors!classification} classification algorithm.
309
+ In this book, we will use the K -nearest neighbors \index{K-nearest neighbors!classification} classification algorithm.
310
310
To predict the label of a new observation (here, classify it as either benign
311
- or malignant), the $K$ -nearest neighbors classifier generally finds the $K$
311
+ or malignant), the K -nearest neighbors classifier generally finds the $K$
312
312
"nearest" or "most similar" observations in our training set, and then uses
313
313
their diagnoses to make a prediction for the new observation's diagnosis. $K$
314
314
is a number that we must choose in advance; for now, we will assume that someone has chosen
315
315
$K$ for us. We will cover how to choose $K$ ourselves in the next chapter.
316
316
317
- To illustrate the concept of $K$ -nearest neighbors classification, we
317
+ To illustrate the concept of K -nearest neighbors classification, we
318
318
will walk through an example. Suppose we have a
319
319
new observation, with standardized perimeter of ` r new_point[1] ` and standardized concavity of ` r new_point[2] ` , whose
320
320
diagnosis "Class" is unknown. This new observation is depicted by the red, diamond point in
@@ -554,7 +554,7 @@ perim_concav + annotate("path",
554
554
### More than two explanatory variables
555
555
556
556
Although the above description is directed toward two predictor variables,
557
- exactly the same $K$ -nearest neighbors algorithm applies when you
557
+ exactly the same K -nearest neighbors algorithm applies when you
558
558
have a higher number of predictor variables. Each predictor variable may give us new
559
559
information to help create our classifier. The only difference is the formula
560
560
for the distance between points. Suppose we have $m$ predictor
@@ -675,22 +675,22 @@ if(!is_latex_output()){
675
675
}
676
676
```
677
677
678
- ### Summary of $K$ -nearest neighbors algorithm
678
+ ### Summary of K -nearest neighbors algorithm
679
679
680
- In order to classify a new observation using a $K$ -nearest neighbor classifier, we have to do the following:
680
+ In order to classify a new observation using a K -nearest neighbors classifier, we have to do the following:
681
681
682
682
1 . Compute the distance between the new observation and each observation in the training set.
683
683
2 . Sort the data table in ascending order according to the distances.
684
684
3 . Choose the top $K$ rows of the sorted table.
685
685
4 . Classify the new observation based on a majority vote of the neighbor classes.
686
686
687
687
688
- ## $K$ -nearest neighbors with ` tidymodels `
688
+ ## K -nearest neighbors with ` tidymodels `
689
689
690
- Coding the $K$ -nearest neighbors algorithm in R ourselves can get complicated,
690
+ Coding the K -nearest neighbors algorithm in R ourselves can get complicated,
691
691
especially if we want to handle multiple classes, more than two variables,
692
692
or predict the class for multiple new observations. Thankfully, in R,
693
- the $K$ -nearest neighbors algorithm is
693
+ the K -nearest neighbors algorithm is
694
694
implemented in [ the ` parsnip ` R package] ( https://parsnip.tidymodels.org/ ) [ @parsnip ]
695
695
included in ` tidymodels ` , along with
696
696
many [ other models] ( https://www.tidymodels.org/find/parsnip/ ) \index{tidymodels}\index{parsnip}
@@ -704,7 +704,7 @@ start by loading `tidymodels`.
704
704
library(tidymodels)
705
705
```
706
706
707
- Let's walk through how to use ` tidymodels ` to perform $K$ -nearest neighbors classification.
707
+ Let's walk through how to use ` tidymodels ` to perform K -nearest neighbors classification.
708
708
We will use the ` cancer ` data set from above, with
709
709
perimeter and concavity as predictors and $K = 5$ neighbors to build our classifier. Then
710
710
we will use the classifier to predict the diagnosis label for a new observation with
@@ -717,7 +717,7 @@ cancer_train <- cancer |>
717
717
cancer_train
718
718
```
719
719
720
- Next, we create a * model specification* for \index{tidymodels!model specification} $K$ -nearest neighbors classification
720
+ Next, we create a * model specification* for \index{tidymodels!model specification} K -nearest neighbors classification
721
721
by calling the ` nearest_neighbor ` function, specifying that we want to use $K = 5$ neighbors
722
722
(we will discuss how to choose $K$ in the next chapter) and that each neighboring point should have the same weight when voting
723
723
(` weight_func = "rectangular" ` ). The ` weight_func ` argument controls
@@ -726,7 +726,7 @@ each of the $K$ nearest neighbors gets exactly 1 vote as described above. Other
726
726
which weigh each neighbor's vote differently, can be found on
727
727
[ the ` parsnip ` website] ( https://parsnip.tidymodels.org/reference/nearest_neighbor.html ) .
728
728
In the ` set_engine ` \index{tidymodels!engine} argument, we specify which package or system will be used for training
729
- the model. Here ` kknn ` is the R package we will use for performing $K$ -nearest neighbors classification.
729
+ the model. Here ` kknn ` is the R package we will use for performing K -nearest neighbors classification.
730
730
Finally, we specify that this is a classification problem with the ` set_mode ` function.
731
731
732
732
``` {r 05-tidymodels-3}
@@ -766,7 +766,7 @@ hidden_print(knn_fit)
766
766
767
767
Here you can see the final trained model summary. It confirms that the computational engine used
768
768
to train the model was ` kknn::train.kknn ` . It also shows the fraction of errors made by
769
- the nearest neighbor model, but we will ignore this for now and discuss it in more detail
769
+ the K- nearest neighbors model, but we will ignore this for now and discuss it in more detail
770
770
in the next chapter.
771
771
Finally, it shows (somewhat confusingly) that the "best" weight function
772
772
was "rectangular" and "best" setting of $K$ was 5; but since we specified these earlier,
@@ -775,7 +775,7 @@ let R find the value of $K$ for us.
775
775
776
776
Finally, we make the prediction on the new observation by calling the ` predict ` \index{tidymodels!predict} function,
777
777
passing both the fit object we just created and the new observation itself. As above,
778
- when we ran the $K$ -nearest neighbors
778
+ when we ran the K -nearest neighbors
779
779
classification algorithm manually, the ` knn_fit ` object classifies the new observation as
780
780
malignant. Note that the ` predict ` function outputs a data frame with a single
781
781
variable named ` .pred_class ` .
@@ -795,7 +795,7 @@ learn ways to quantify how accurate we think our predictions are.
795
795
796
796
### Centering and scaling
797
797
798
- When using $K$ -nearest neighbor classification, the * scale* \index{scaling} of each variable
798
+ When using K -nearest neighbors classification, the * scale* \index{scaling} of each variable
799
799
(i.e., its size and range of values) matters. Since the classifier predicts
800
800
classes by identifying observations nearest to it, any variables with
801
801
a large scale will have a much larger effect than variables with a small
@@ -816,7 +816,7 @@ degrees Celsius, the two variables would differ by a constant shift of 273
816
816
hypothetical job classification example, we would likely see that the center of
817
817
the salary variable is in the tens of thousands, while the center of the years
818
818
of education variable is in the single digits. Although this doesn't affect the
819
- $K$ -nearest neighbor classification algorithm, this large shift can change the
819
+ K -nearest neighbors classification algorithm, this large shift can change the
820
820
outcome of using many other predictive models. \index{centering}
821
821
822
822
To scale and center our data, we need to find
@@ -825,8 +825,8 @@ set of numbers) and *standard deviation* (a number quantifying how spread out va
825
825
For each observed value of the variable, we subtract the mean (i.e., center the variable)
826
826
and divide by the standard deviation (i.e., scale the variable). When we do this, the data
827
827
is said to be * standardized* , \index{standardization!K-nearest neighbors} and all variables in a data set will have a mean of 0
828
- and a standard deviation of 1. To illustrate the effect that standardization can have on the $K$ -nearest
829
- neighbor algorithm, we will read in the original, unstandardized Wisconsin breast
828
+ and a standard deviation of 1. To illustrate the effect that standardization can have on the K -nearest
829
+ neighbors algorithm, we will read in the original, unstandardized Wisconsin breast
830
830
cancer data set; we have been using a standardized version of the data set up
831
831
until now. As before, we will convert the ` Class ` variable to the factor type
832
832
and rename the values to "Malignant" and "Benign."
@@ -918,7 +918,7 @@ It may seem redundant that we had to both `bake` *and* `prep` to scale and cente
918
918
919
919
You may wonder why we are doing so much work just to center and
920
920
scale our variables. Can't we just manually scale and center the ` Area ` and
921
- ` Smoothness ` variables ourselves before building our $K$ -nearest neighbor model? Well,
921
+ ` Smoothness ` variables ourselves before building our K -nearest neighbors model? Well,
922
922
technically * yes* ; but doing so is error-prone. In particular, we might
923
923
accidentally forget to apply the same centering / scaling when making
924
924
predictions, or accidentally apply a * different* centering / scaling than what
@@ -1074,7 +1074,7 @@ ggplot(unscaled_cancer, aes(x = Area,
1074
1074
1075
1075
Another potential issue in a data set for a classifier is * class imbalance* , \index{balance}\index{imbalance}
1076
1076
i.e., when one label is much more common than another. Since classifiers like
1077
- the $K$ -nearest neighbor algorithm use the labels of nearby points to predict
1077
+ the K -nearest neighbors algorithm use the labels of nearby points to predict
1078
1078
the label of a new point, if there are many more data points with one label
1079
1079
overall, the algorithm is more likely to pick that label in general (even if
1080
1080
the "pattern" of data suggests otherwise). Class imbalance is actually quite a
@@ -1121,7 +1121,7 @@ rare_plot <- rare_cancer |>
1121
1121
rare_plot
1122
1122
```
1123
1123
1124
- Suppose we now decided to use $K = 7$ in $K$ -nearest neighbor classification.
1124
+ Suppose we now decided to use $K = 7$ in K -nearest neighbors classification.
1125
1125
With only 3 observations of malignant tumors, the classifier
1126
1126
will * always predict that the tumor is benign, no matter what its concavity and perimeter
1127
1127
are!* This is because in a majority vote of 7 observations, at most 3 will be
@@ -1175,7 +1175,7 @@ rare_plot + geom_point(aes(x = new_point[1], y = new_point[2]),
1175
1175
```
1176
1176
1177
1177
Figure \@ ref(fig:05-upsample-2) shows what happens if we set the background color of
1178
- each area of the plot to the predictions the $K$ -nearest neighbor
1178
+ each area of the plot to the predictions the K -nearest neighbors
1179
1179
classifier would make. We can see that the decision is
1180
1180
always "benign," corresponding to the blue color.
1181
1181
@@ -1226,7 +1226,7 @@ Despite the simplicity of the problem, solving it in a statistically sound manne
1226
1226
fairly nuanced, and a careful treatment would require a lot more detail and mathematics than we will cover in this textbook.
1227
1227
For the present purposes, it will suffice to rebalance the data by * oversampling* the rare class. \index{oversampling}
1228
1228
In other words, we will replicate rare observations multiple times in our data set to give them more
1229
- voting power in the $K$ -nearest neighbor algorithm. In order to do this, we will add an oversampling
1229
+ voting power in the K -nearest neighbors algorithm. In order to do this, we will add an oversampling
1230
1230
step to the earlier ` uc_recipe ` recipe with the ` step_upsample ` function from the ` themis ` R package. \index{recipe!step\_ upsample}
1231
1231
We show below how to do this, and also
1232
1232
use the ` group_by ` and ` summarize ` functions to see that our classes are now balanced:
@@ -1252,9 +1252,9 @@ upsampled_cancer |>
1252
1252
summarize(n = n())
1253
1253
```
1254
1254
1255
- Now suppose we train our $K$ -nearest neighbor classifier with $K=7$ on this * balanced* data.
1255
+ Now suppose we train our K -nearest neighbors classifier with $K=7$ on this * balanced* data.
1256
1256
Figure \@ ref(fig:05-upsample-plot) shows what happens now when we set the background color
1257
- of each area of our scatter plot to the decision the $K$ -nearest neighbor
1257
+ of each area of our scatter plot to the decision the K -nearest neighbors
1258
1258
classifier would make. We can see that the decision is more reasonable; when the points are close
1259
1259
to those labeled malignant, the classifier predicts a malignant tumor, and vice versa when they are
1260
1260
closer to the benign tumor observations.
@@ -1322,13 +1322,13 @@ missing_cancer <- read_csv("data/wdbc_missing.csv") |>
1322
1322
mutate(Class = fct_recode(Class, "Malignant" = "M", "Benign" = "B"))
1323
1323
missing_cancer
1324
1324
```
1325
- Recall that K-nearest neighbor classification makes predictions by computing
1325
+ Recall that K-nearest neighbors classification makes predictions by computing
1326
1326
the straight-line distance to nearby training observations, and hence requires
1327
1327
access to the values of * all* variables for * all* observations in the training
1328
- data. So how can we perform K-nearest neighbor classification in the presence
1328
+ data. So how can we perform K-nearest neighbors classification in the presence
1329
1329
of missing data? Well, since there are not too many observations with missing
1330
1330
entries, one option is to simply remove those observations prior to building
1331
- the K-nearest neighbor classifier. We can accomplish this by using the
1331
+ the K-nearest neighbors classifier. We can accomplish this by using the
1332
1332
` drop_na ` function from ` tidyverse ` prior to working with the data.
1333
1333
1334
1334
``` {r 05-naomit}
@@ -1386,7 +1386,7 @@ unscaled_cancer <- read_csv("data/wdbc_unscaled.csv") |>
1386
1386
mutate(Class = as_factor(Class)) |>
1387
1387
mutate(Class = fct_recode(Class, "Malignant" = "M", "Benign" = "B"))
1388
1388
1389
- # create the KNN model
1389
+ # create the K-NN model
1390
1390
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 7) |>
1391
1391
set_engine("kknn") |>
1392
1392
set_mode("classification")
@@ -1440,7 +1440,7 @@ prediction
1440
1440
1441
1441
The classifier predicts that the first observation is benign, while the second is
1442
1442
malignant. Figure \@ ref(fig:05-workflow-plot-show) visualizes the predictions that this
1443
- trained $K$ -nearest neighbor model will make on a large range of new observations.
1443
+ trained K -nearest neighbors model will make on a large range of new observations.
1444
1444
Although you have seen colored prediction map visualizations like this a few times now,
1445
1445
we have not included the code to generate them, as it is a little bit complicated.
1446
1446
For the interested reader who wants a learning challenge, we now include it below.
0 commit comments