Skip to content

Commit 05c2de9

Browse files
knn uniformization
1 parent f288f4a commit 05c2de9

File tree

4 files changed

+122
-122
lines changed

4 files changed

+122
-122
lines changed

source/classification1.Rmd

Lines changed: 34 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -65,8 +65,8 @@ By the end of the chapter, readers will be able to do the following:
6565
- Describe what a training data set is and how it is used in classification.
6666
- Interpret the output of a classifier.
6767
- Compute, by hand, the straight-line (Euclidean) distance between points on a graph when there are two predictor variables.
68-
- Explain the $K$-nearest neighbor classification algorithm.
69-
- Perform $K$-nearest neighbor classification in R using `tidymodels`.
68+
- Explain the K-nearest neighbors classification algorithm.
69+
- Perform K-nearest neighbors classification in R using `tidymodels`.
7070
- Use a `recipe` to center, scale, balance, and impute data as a preprocessing step.
7171
- Combine preprocessing and model training using a `workflow`.
7272

@@ -93,7 +93,7 @@ the classifier to make predictions on new data for which we do not know the clas
9393

9494
There are many possible methods that we could use to predict
9595
a categorical class/label for an observation. In this book, we will
96-
focus on the widely used **$K$-nearest neighbors** \index{K-nearest neighbors} algorithm [@knnfix; @knncover].
96+
focus on the widely used **K-nearest neighbors** \index{K-nearest neighbors} algorithm [@knnfix; @knncover].
9797
In your future studies, you might encounter decision trees, support vector machines (SVMs),
9898
logistic regression, neural networks, and more; see the additional resources
9999
section at the end of the next chapter for where to begin learning more about
@@ -272,7 +272,7 @@ malignant. Based on our visualization, it seems like it may be possible
272272
to make accurate predictions of the `Class` variable (i.e., a diagnosis) for
273273
tumor images with unknown diagnoses.
274274

275-
## Classification with $K$-nearest neighbors
275+
## Classification with K-nearest neighbors
276276

277277
```{r 05-knn-0, echo = FALSE}
278278
## Find the distance between new point and all others in data set
@@ -306,15 +306,15 @@ neighbors <- cancer[order(my_distances$Distance), ]
306306

307307
In order to actually make predictions for new observations in practice, we
308308
will need a classification algorithm.
309-
In this book, we will use the $K$-nearest neighbors \index{K-nearest neighbors!classification} classification algorithm.
309+
In this book, we will use the K-nearest neighbors \index{K-nearest neighbors!classification} classification algorithm.
310310
To predict the label of a new observation (here, classify it as either benign
311-
or malignant), the $K$-nearest neighbors classifier generally finds the $K$
311+
or malignant), the K-nearest neighbors classifier generally finds the $K$
312312
"nearest" or "most similar" observations in our training set, and then uses
313313
their diagnoses to make a prediction for the new observation's diagnosis. $K$
314314
is a number that we must choose in advance; for now, we will assume that someone has chosen
315315
$K$ for us. We will cover how to choose $K$ ourselves in the next chapter.
316316

317-
To illustrate the concept of $K$-nearest neighbors classification, we
317+
To illustrate the concept of K-nearest neighbors classification, we
318318
will walk through an example. Suppose we have a
319319
new observation, with standardized perimeter of `r new_point[1]` and standardized concavity of `r new_point[2]`, whose
320320
diagnosis "Class" is unknown. This new observation is depicted by the red, diamond point in
@@ -554,7 +554,7 @@ perim_concav + annotate("path",
554554
### More than two explanatory variables
555555

556556
Although the above description is directed toward two predictor variables,
557-
exactly the same $K$-nearest neighbors algorithm applies when you
557+
exactly the same K-nearest neighbors algorithm applies when you
558558
have a higher number of predictor variables. Each predictor variable may give us new
559559
information to help create our classifier. The only difference is the formula
560560
for the distance between points. Suppose we have $m$ predictor
@@ -675,22 +675,22 @@ if(!is_latex_output()){
675675
}
676676
```
677677

678-
### Summary of $K$-nearest neighbors algorithm
678+
### Summary of K-nearest neighbors algorithm
679679

680-
In order to classify a new observation using a $K$-nearest neighbor classifier, we have to do the following:
680+
In order to classify a new observation using a K-nearest neighbors classifier, we have to do the following:
681681

682682
1. Compute the distance between the new observation and each observation in the training set.
683683
2. Sort the data table in ascending order according to the distances.
684684
3. Choose the top $K$ rows of the sorted table.
685685
4. Classify the new observation based on a majority vote of the neighbor classes.
686686

687687

688-
## $K$-nearest neighbors with `tidymodels`
688+
## K-nearest neighbors with `tidymodels`
689689

690-
Coding the $K$-nearest neighbors algorithm in R ourselves can get complicated,
690+
Coding the K-nearest neighbors algorithm in R ourselves can get complicated,
691691
especially if we want to handle multiple classes, more than two variables,
692692
or predict the class for multiple new observations. Thankfully, in R,
693-
the $K$-nearest neighbors algorithm is
693+
the K-nearest neighbors algorithm is
694694
implemented in [the `parsnip` R package](https://parsnip.tidymodels.org/) [@parsnip]
695695
included in `tidymodels`, along with
696696
many [other models](https://www.tidymodels.org/find/parsnip/) \index{tidymodels}\index{parsnip}
@@ -704,7 +704,7 @@ start by loading `tidymodels`.
704704
library(tidymodels)
705705
```
706706

707-
Let's walk through how to use `tidymodels` to perform $K$-nearest neighbors classification.
707+
Let's walk through how to use `tidymodels` to perform K-nearest neighbors classification.
708708
We will use the `cancer` data set from above, with
709709
perimeter and concavity as predictors and $K = 5$ neighbors to build our classifier. Then
710710
we will use the classifier to predict the diagnosis label for a new observation with
@@ -717,7 +717,7 @@ cancer_train <- cancer |>
717717
cancer_train
718718
```
719719

720-
Next, we create a *model specification* for \index{tidymodels!model specification} $K$-nearest neighbors classification
720+
Next, we create a *model specification* for \index{tidymodels!model specification} K-nearest neighbors classification
721721
by calling the `nearest_neighbor` function, specifying that we want to use $K = 5$ neighbors
722722
(we will discuss how to choose $K$ in the next chapter) and that each neighboring point should have the same weight when voting
723723
(`weight_func = "rectangular"`). The `weight_func` argument controls
@@ -726,7 +726,7 @@ each of the $K$ nearest neighbors gets exactly 1 vote as described above. Other
726726
which weigh each neighbor's vote differently, can be found on
727727
[the `parsnip` website](https://parsnip.tidymodels.org/reference/nearest_neighbor.html).
728728
In the `set_engine` \index{tidymodels!engine} argument, we specify which package or system will be used for training
729-
the model. Here `kknn` is the R package we will use for performing $K$-nearest neighbors classification.
729+
the model. Here `kknn` is the R package we will use for performing K-nearest neighbors classification.
730730
Finally, we specify that this is a classification problem with the `set_mode` function.
731731

732732
```{r 05-tidymodels-3}
@@ -766,7 +766,7 @@ hidden_print(knn_fit)
766766

767767
Here you can see the final trained model summary. It confirms that the computational engine used
768768
to train the model was `kknn::train.kknn`. It also shows the fraction of errors made by
769-
the nearest neighbor model, but we will ignore this for now and discuss it in more detail
769+
the K-nearest neighbors model, but we will ignore this for now and discuss it in more detail
770770
in the next chapter.
771771
Finally, it shows (somewhat confusingly) that the "best" weight function
772772
was "rectangular" and "best" setting of $K$ was 5; but since we specified these earlier,
@@ -775,7 +775,7 @@ let R find the value of $K$ for us.
775775

776776
Finally, we make the prediction on the new observation by calling the `predict` \index{tidymodels!predict} function,
777777
passing both the fit object we just created and the new observation itself. As above,
778-
when we ran the $K$-nearest neighbors
778+
when we ran the K-nearest neighbors
779779
classification algorithm manually, the `knn_fit` object classifies the new observation as
780780
malignant. Note that the `predict` function outputs a data frame with a single
781781
variable named `.pred_class`.
@@ -795,7 +795,7 @@ learn ways to quantify how accurate we think our predictions are.
795795

796796
### Centering and scaling
797797

798-
When using $K$-nearest neighbor classification, the *scale* \index{scaling} of each variable
798+
When using K-nearest neighbors classification, the *scale* \index{scaling} of each variable
799799
(i.e., its size and range of values) matters. Since the classifier predicts
800800
classes by identifying observations nearest to it, any variables with
801801
a large scale will have a much larger effect than variables with a small
@@ -816,7 +816,7 @@ degrees Celsius, the two variables would differ by a constant shift of 273
816816
hypothetical job classification example, we would likely see that the center of
817817
the salary variable is in the tens of thousands, while the center of the years
818818
of education variable is in the single digits. Although this doesn't affect the
819-
$K$-nearest neighbor classification algorithm, this large shift can change the
819+
K-nearest neighbors classification algorithm, this large shift can change the
820820
outcome of using many other predictive models. \index{centering}
821821

822822
To scale and center our data, we need to find
@@ -825,8 +825,8 @@ set of numbers) and *standard deviation* (a number quantifying how spread out va
825825
For each observed value of the variable, we subtract the mean (i.e., center the variable)
826826
and divide by the standard deviation (i.e., scale the variable). When we do this, the data
827827
is said to be *standardized*, \index{standardization!K-nearest neighbors} and all variables in a data set will have a mean of 0
828-
and a standard deviation of 1. To illustrate the effect that standardization can have on the $K$-nearest
829-
neighbor algorithm, we will read in the original, unstandardized Wisconsin breast
828+
and a standard deviation of 1. To illustrate the effect that standardization can have on the K-nearest
829+
neighbors algorithm, we will read in the original, unstandardized Wisconsin breast
830830
cancer data set; we have been using a standardized version of the data set up
831831
until now. As before, we will convert the `Class` variable to the factor type
832832
and rename the values to "Malignant" and "Benign."
@@ -918,7 +918,7 @@ It may seem redundant that we had to both `bake` *and* `prep` to scale and cente
918918

919919
You may wonder why we are doing so much work just to center and
920920
scale our variables. Can't we just manually scale and center the `Area` and
921-
`Smoothness` variables ourselves before building our $K$-nearest neighbor model? Well,
921+
`Smoothness` variables ourselves before building our K-nearest neighbors model? Well,
922922
technically *yes*; but doing so is error-prone. In particular, we might
923923
accidentally forget to apply the same centering / scaling when making
924924
predictions, or accidentally apply a *different* centering / scaling than what
@@ -1074,7 +1074,7 @@ ggplot(unscaled_cancer, aes(x = Area,
10741074

10751075
Another potential issue in a data set for a classifier is *class imbalance*, \index{balance}\index{imbalance}
10761076
i.e., when one label is much more common than another. Since classifiers like
1077-
the $K$-nearest neighbor algorithm use the labels of nearby points to predict
1077+
the K-nearest neighbors algorithm use the labels of nearby points to predict
10781078
the label of a new point, if there are many more data points with one label
10791079
overall, the algorithm is more likely to pick that label in general (even if
10801080
the "pattern" of data suggests otherwise). Class imbalance is actually quite a
@@ -1121,7 +1121,7 @@ rare_plot <- rare_cancer |>
11211121
rare_plot
11221122
```
11231123

1124-
Suppose we now decided to use $K = 7$ in $K$-nearest neighbor classification.
1124+
Suppose we now decided to use $K = 7$ in K-nearest neighbors classification.
11251125
With only 3 observations of malignant tumors, the classifier
11261126
will *always predict that the tumor is benign, no matter what its concavity and perimeter
11271127
are!* This is because in a majority vote of 7 observations, at most 3 will be
@@ -1175,7 +1175,7 @@ rare_plot + geom_point(aes(x = new_point[1], y = new_point[2]),
11751175
```
11761176

11771177
Figure \@ref(fig:05-upsample-2) shows what happens if we set the background color of
1178-
each area of the plot to the predictions the $K$-nearest neighbor
1178+
each area of the plot to the predictions the K-nearest neighbors
11791179
classifier would make. We can see that the decision is
11801180
always "benign," corresponding to the blue color.
11811181

@@ -1226,7 +1226,7 @@ Despite the simplicity of the problem, solving it in a statistically sound manne
12261226
fairly nuanced, and a careful treatment would require a lot more detail and mathematics than we will cover in this textbook.
12271227
For the present purposes, it will suffice to rebalance the data by *oversampling* the rare class. \index{oversampling}
12281228
In other words, we will replicate rare observations multiple times in our data set to give them more
1229-
voting power in the $K$-nearest neighbor algorithm. In order to do this, we will add an oversampling
1229+
voting power in the K-nearest neighbors algorithm. In order to do this, we will add an oversampling
12301230
step to the earlier `uc_recipe` recipe with the `step_upsample` function from the `themis` R package. \index{recipe!step\_upsample}
12311231
We show below how to do this, and also
12321232
use the `group_by` and `summarize` functions to see that our classes are now balanced:
@@ -1252,9 +1252,9 @@ upsampled_cancer |>
12521252
summarize(n = n())
12531253
```
12541254

1255-
Now suppose we train our $K$-nearest neighbor classifier with $K=7$ on this *balanced* data.
1255+
Now suppose we train our K-nearest neighbors classifier with $K=7$ on this *balanced* data.
12561256
Figure \@ref(fig:05-upsample-plot) shows what happens now when we set the background color
1257-
of each area of our scatter plot to the decision the $K$-nearest neighbor
1257+
of each area of our scatter plot to the decision the K-nearest neighbors
12581258
classifier would make. We can see that the decision is more reasonable; when the points are close
12591259
to those labeled malignant, the classifier predicts a malignant tumor, and vice versa when they are
12601260
closer to the benign tumor observations.
@@ -1322,13 +1322,13 @@ missing_cancer <- read_csv("data/wdbc_missing.csv") |>
13221322
mutate(Class = fct_recode(Class, "Malignant" = "M", "Benign" = "B"))
13231323
missing_cancer
13241324
```
1325-
Recall that K-nearest neighbor classification makes predictions by computing
1325+
Recall that K-nearest neighbors classification makes predictions by computing
13261326
the straight-line distance to nearby training observations, and hence requires
13271327
access to the values of *all* variables for *all* observations in the training
1328-
data. So how can we perform K-nearest neighbor classification in the presence
1328+
data. So how can we perform K-nearest neighbors classification in the presence
13291329
of missing data? Well, since there are not too many observations with missing
13301330
entries, one option is to simply remove those observations prior to building
1331-
the K-nearest neighbor classifier. We can accomplish this by using the
1331+
the K-nearest neighbors classifier. We can accomplish this by using the
13321332
`drop_na` function from `tidyverse` prior to working with the data.
13331333

13341334
```{r 05-naomit}
@@ -1386,7 +1386,7 @@ unscaled_cancer <- read_csv("data/wdbc_unscaled.csv") |>
13861386
mutate(Class = as_factor(Class)) |>
13871387
mutate(Class = fct_recode(Class, "Malignant" = "M", "Benign" = "B"))
13881388
1389-
# create the KNN model
1389+
# create the K-NN model
13901390
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 7) |>
13911391
set_engine("kknn") |>
13921392
set_mode("classification")
@@ -1440,7 +1440,7 @@ prediction
14401440

14411441
The classifier predicts that the first observation is benign, while the second is
14421442
malignant. Figure \@ref(fig:05-workflow-plot-show) visualizes the predictions that this
1443-
trained $K$-nearest neighbor model will make on a large range of new observations.
1443+
trained K-nearest neighbors model will make on a large range of new observations.
14441444
Although you have seen colored prediction map visualizations like this a few times now,
14451445
we have not included the code to generate them, as it is a little bit complicated.
14461446
For the interested reader who wants a learning challenge, we now include it below.

0 commit comments

Comments
 (0)