Skip to content

Commit db68ece

Browse files
k-nn uniformization
1 parent 67c15b8 commit db68ece

File tree

4 files changed

+123
-123
lines changed

4 files changed

+123
-123
lines changed

source/classification1.md

Lines changed: 34 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -46,8 +46,8 @@ By the end of the chapter, readers will be able to do the following:
4646
- Describe what a training data set is and how it is used in classification.
4747
- Interpret the output of a classifier.
4848
- Compute, by hand, the straight-line (Euclidean) distance between points on a graph when there are two predictor variables.
49-
- Explain the $K$-nearest neighbor classification algorithm.
50-
- Perform $K$-nearest neighbor classification in Python using `scikit-learn`.
49+
- Explain the K-nearest neighbors classification algorithm.
50+
- Perform K-nearest neighbors classification in Python using `scikit-learn`.
5151
- Use methods from `scikit-learn` to center, scale, balance, and impute data as a preprocessing step.
5252
- Combine preprocessing and model training into a `Pipeline` using `make_pipeline`.
5353

@@ -88,7 +88,7 @@ the classifier to make predictions on new data for which we do not know the clas
8888

8989
There are many possible methods that we could use to predict
9090
a categorical class/label for an observation. In this book, we will
91-
focus on the widely used **$K$-nearest neighbors** algorithm {cite:p}`knnfix,knncover`.
91+
focus on the widely used **K-nearest neighbors** algorithm {cite:p}`knnfix,knncover`.
9292
In your future studies, you might encounter decision trees, support vector machines (SVMs),
9393
logistic regression, neural networks, and more; see the additional resources
9494
section at the end of the next chapter for where to begin learning more about
@@ -317,7 +317,7 @@ tumor images with unknown diagnoses.
317317

318318
+++
319319

320-
## Classification with $K$-nearest neighbors
320+
## Classification with K-nearest neighbors
321321

322322
```{code-cell} ipython3
323323
:tags: [remove-cell]
@@ -342,15 +342,15 @@ my_distances = euclidean_distances(perim_concav_with_new_point_df[attrs])[
342342

343343
In order to actually make predictions for new observations in practice, we
344344
will need a classification algorithm.
345-
In this book, we will use the $K$-nearest neighbors classification algorithm.
345+
In this book, we will use the K-nearest neighbors classification algorithm.
346346
To predict the label of a new observation (here, classify it as either benign
347-
or malignant), the $K$-nearest neighbors classifier generally finds the $K$
347+
or malignant), the K-nearest neighbors classifier generally finds the $K$
348348
"nearest" or "most similar" observations in our training set, and then uses
349349
their diagnoses to make a prediction for the new observation's diagnosis. $K$
350350
is a number that we must choose in advance; for now, we will assume that someone has chosen
351351
$K$ for us. We will cover how to choose $K$ ourselves in the next chapter.
352352

353-
To illustrate the concept of $K$-nearest neighbors classification, we
353+
To illustrate the concept of K-nearest neighbors classification, we
354354
will walk through an example. Suppose we have a
355355
new observation, with standardized perimeter
356356
of {glue:text}`new_point_1_0` and standardized concavity
@@ -716,7 +716,7 @@ Scatter plot of concavity versus perimeter with 5 nearest neighbors circled.
716716
### More than two explanatory variables
717717

718718
Although the above description is directed toward two predictor variables,
719-
exactly the same $K$-nearest neighbors algorithm applies when you
719+
exactly the same K-nearest neighbors algorithm applies when you
720720
have a higher number of predictor variables. Each predictor variable may give us new
721721
information to help create our classifier. The only difference is the formula
722722
for the distance between points. Suppose we have $m$ predictor
@@ -872,30 +872,30 @@ nearest neighbors look like, for learning purposes.
872872

873873
+++
874874

875-
### Summary of $K$-nearest neighbors algorithm
875+
### Summary of K-nearest neighbors algorithm
876876

877-
In order to classify a new observation using a $K$-nearest neighbor classifier, we have to do the following:
877+
In order to classify a new observation using a K-nearest neighbors classifier, we have to do the following:
878878

879879
1. Compute the distance between the new observation and each observation in the training set.
880880
2. Find the $K$ rows corresponding to the $K$ smallest distances.
881881
3. Classify the new observation based on a majority vote of the neighbor classes.
882882

883883
+++
884884

885-
## $K$-nearest neighbors with `scikit-learn`
885+
## K-nearest neighbors with `scikit-learn`
886886

887887
```{index} scikit-learn
888888
```
889889

890-
Coding the $K$-nearest neighbors algorithm in Python ourselves can get complicated,
890+
Coding the K-nearest neighbors algorithm in Python ourselves can get complicated,
891891
especially if we want to handle multiple classes, more than two variables,
892892
or predict the class for multiple new observations. Thankfully, in Python,
893-
the $K$-nearest neighbors algorithm is
893+
the K-nearest neighbors algorithm is
894894
implemented in [the `scikit-learn` Python package](https://scikit-learn.org/stable/index.html) {cite:p}`sklearn_api` along with
895895
many [other models](https://scikit-learn.org/stable/user_guide.html) that you will encounter in this and future chapters of the book. Using the functions
896896
in the `scikit-learn` package (named `sklearn` in Python) will help keep our code simple, readable and accurate; the
897897
less we have to code ourselves, the fewer mistakes we will likely make.
898-
Before getting started with $K$-nearest neighbors, we need to tell the `sklearn` package
898+
Before getting started with K-nearest neighbors, we need to tell the `sklearn` package
899899
that we prefer using `pandas` data frames over regular arrays via the `set_config` function.
900900
```{note}
901901
You will notice a new way of importing functions in the code below: `from ... import ...`. This lets us
@@ -913,14 +913,14 @@ from sklearn import set_config
913913
set_config(transform_output="pandas")
914914
```
915915

916-
We can now get started with $K$-nearest neighbors. The first step is to
916+
We can now get started with K-nearest neighbors. The first step is to
917917
import the `KNeighborsClassifier` from the `sklearn.neighbors` module.
918918

919919
```{code-cell} ipython3
920920
from sklearn.neighbors import KNeighborsClassifier
921921
```
922922

923-
Let's walk through how to use `KNeighborsClassifier` to perform $K$-nearest neighbors classification.
923+
Let's walk through how to use `KNeighborsClassifier` to perform K-nearest neighbors classification.
924924
We will use the `cancer` data set from above, with
925925
perimeter and concavity as predictors and $K = 5$ neighbors to build our classifier. Then
926926
we will use the classifier to predict the diagnosis label for a new observation with
@@ -935,7 +935,7 @@ cancer_train
935935
```{index} scikit-learn; model object, scikit-learn; KNeighborsClassifier
936936
```
937937

938-
Next, we create a *model object* for $K$-nearest neighbors classification
938+
Next, we create a *model object* for K-nearest neighbors classification
939939
by creating a `KNeighborsClassifier` instance, specifying that we want to use $K = 5$ neighbors;
940940
we will discuss how to choose $K$ in the next chapter.
941941

@@ -974,7 +974,7 @@ knn.fit(X=cancer_train[["Perimeter", "Concavity"]], y=cancer_train["Class"]);
974974

975975
After using the `fit` function, we can make a prediction on a new observation
976976
by calling `predict` on the classifier object, passing the new observation
977-
itself. As above, when we ran the $K$-nearest neighbors classification
977+
itself. As above, when we ran the K-nearest neighbors classification
978978
algorithm manually, the `knn` model object classifies the new observation as
979979
"Malignant". Note that the `predict` function outputs an `array` with the
980980
model's prediction; you can actually make multiple predictions at the same
@@ -1000,7 +1000,7 @@ learn ways to quantify how accurate we think our predictions are.
10001000
```{index} scaling
10011001
```
10021002

1003-
When using $K$-nearest neighbor classification, the *scale* of each variable
1003+
When using K-nearest neighbors classification, the *scale* of each variable
10041004
(i.e., its size and range of values) matters. Since the classifier predicts
10051005
classes by identifying observations nearest to it, any variables with
10061006
a large scale will have a much larger effect than variables with a small
@@ -1026,7 +1026,7 @@ degrees Celsius, the two variables would differ by a constant shift of 273
10261026
hypothetical job classification example, we would likely see that the center of
10271027
the salary variable is in the tens of thousands, while the center of the years
10281028
of education variable is in the single digits. Although this doesn't affect the
1029-
$K$-nearest neighbor classification algorithm, this large shift can change the
1029+
K-nearest neighbors classification algorithm, this large shift can change the
10301030
outcome of using many other predictive models.
10311031

10321032
```{index} standardization; K-nearest neighbors
@@ -1038,8 +1038,8 @@ set of numbers) and *standard deviation* (a number quantifying how spread out va
10381038
For each observed value of the variable, we subtract the mean (i.e., center the variable)
10391039
and divide by the standard deviation (i.e., scale the variable). When we do this, the data
10401040
is said to be *standardized*, and all variables in a data set will have a mean of 0
1041-
and a standard deviation of 1. To illustrate the effect that standardization can have on the $K$-nearest
1042-
neighbor algorithm, we will read in the original, unstandardized Wisconsin breast
1041+
and a standard deviation of 1. To illustrate the effect that standardization can have on the K-nearest
1042+
neighbors algorithm, we will read in the original, unstandardized Wisconsin breast
10431043
cancer data set; we have been using a standardized version of the data set up
10441044
until now. We will apply the same initial wrangling steps as we did earlier,
10451045
and to keep things simple we will just use the `Area`, `Smoothness`, and `Class`
@@ -1173,7 +1173,7 @@ scaled_cancer_all
11731173

11741174
You may wonder why we are doing so much work just to center and
11751175
scale our variables. Can't we just manually scale and center the `Area` and
1176-
`Smoothness` variables ourselves before building our $K$-nearest neighbor model? Well,
1176+
`Smoothness` variables ourselves before building our K-nearest neighbors model? Well,
11771177
technically *yes*; but doing so is error-prone. In particular, we might
11781178
accidentally forget to apply the same centering / scaling when making
11791179
predictions, or accidentally apply a *different* centering / scaling than what
@@ -1400,7 +1400,7 @@ Close-up of three nearest neighbors for unstandardized data.
14001400

14011401
Another potential issue in a data set for a classifier is *class imbalance*,
14021402
i.e., when one label is much more common than another. Since classifiers like
1403-
the $K$-nearest neighbor algorithm use the labels of nearby points to predict
1403+
the K-nearest neighbors algorithm use the labels of nearby points to predict
14041404
the label of a new point, if there are many more data points with one label
14051405
overall, the algorithm is more likely to pick that label in general (even if
14061406
the "pattern" of data suggests otherwise). Class imbalance is actually quite a
@@ -1451,7 +1451,7 @@ rare_cancer["Class"].value_counts()
14511451

14521452
+++
14531453

1454-
Suppose we now decided to use $K = 7$ in $K$-nearest neighbor classification.
1454+
Suppose we now decided to use $K = 7$ in K-nearest neighbors classification.
14551455
With only 3 observations of malignant tumors, the classifier
14561456
will *always predict that the tumor is benign, no matter what its concavity and perimeter
14571457
are!* This is because in a majority vote of 7 observations, at most 3 will be
@@ -1525,7 +1525,7 @@ Imbalanced data with 7 nearest neighbors to a new observation highlighted.
15251525
+++
15261526

15271527
{numref}`fig:05-upsample-2` shows what happens if we set the background color of
1528-
each area of the plot to the predictions the $K$-nearest neighbor
1528+
each area of the plot to the predictions the K-nearest neighbors
15291529
classifier would make. We can see that the decision is
15301530
always "benign," corresponding to the blue color.
15311531

@@ -1610,7 +1610,7 @@ Despite the simplicity of the problem, solving it in a statistically sound manne
16101610
fairly nuanced, and a careful treatment would require a lot more detail and mathematics than we will cover in this textbook.
16111611
For the present purposes, it will suffice to rebalance the data by *oversampling* the rare class.
16121612
In other words, we will replicate rare observations multiple times in our data set to give them more
1613-
voting power in the $K$-nearest neighbor algorithm. In order to do this, we will
1613+
voting power in the K-nearest neighbors algorithm. In order to do this, we will
16141614
first separate the classes out into their own data frames by filtering.
16151615
Then, we will
16161616
use the `sample` method on the rare class data frame to increase the number of `Malignant` observations to be the same as the number
@@ -1638,9 +1638,9 @@ upsampled_cancer = pd.concat((malignant_cancer_upsample, benign_cancer))
16381638
upsampled_cancer["Class"].value_counts()
16391639
```
16401640

1641-
Now suppose we train our $K$-nearest neighbor classifier with $K=7$ on this *balanced* data.
1641+
Now suppose we train our K-nearest neighbors classifier with $K=7$ on this *balanced* data.
16421642
{numref}`fig:05-upsample-plot` shows what happens now when we set the background color
1643-
of each area of our scatter plot to the decision the $K$-nearest neighbor
1643+
of each area of our scatter plot to the decision the K-nearest neighbors
16441644
classifier would make. We can see that the decision is more reasonable; when the points are close
16451645
to those labeled malignant, the classifier predicts a malignant tumor, and vice versa when they are
16461646
closer to the benign tumor observations.
@@ -1738,13 +1738,13 @@ missing_cancer["Class"] = missing_cancer["Class"].replace({
17381738
missing_cancer
17391739
```
17401740

1741-
Recall that K-nearest neighbor classification makes predictions by computing
1741+
Recall that K-nearest neighbors classification makes predictions by computing
17421742
the straight-line distance to nearby training observations, and hence requires
17431743
access to the values of *all* variables for *all* observations in the training
1744-
data. So how can we perform K-nearest neighbor classification in the presence
1744+
data. So how can we perform K-nearest neighbors classification in the presence
17451745
of missing data? Well, since there are not too many observations with missing
17461746
entries, one option is to simply remove those observations prior to building
1747-
the K-nearest neighbor classifier. We can accomplish this by using the
1747+
the K-nearest neighbors classifier. We can accomplish this by using the
17481748
`dropna` method prior to working with the data.
17491749

17501750
```{code-cell} ipython3
@@ -1809,7 +1809,7 @@ unscaled_cancer["Class"] = unscaled_cancer["Class"].replace({
18091809
})
18101810
unscaled_cancer
18111811
1812-
# create the KNN model
1812+
# create the K-NN model
18131813
knn = KNeighborsClassifier(n_neighbors=7)
18141814
18151815
# create the centering / scaling preprocessor
@@ -1859,7 +1859,7 @@ prediction
18591859

18601860
The classifier predicts that the first observation is benign, while the second is
18611861
malignant. {numref}`fig:05-workflow-plot` visualizes the predictions that this
1862-
trained $K$-nearest neighbor model will make on a large range of new observations.
1862+
trained K-nearest neighbors model will make on a large range of new observations.
18631863
Although you have seen colored prediction map visualizations like this a few times now,
18641864
we have not included the code to generate them, as it is a little bit complicated.
18651865
For the interested reader who wants a learning challenge, we now include it below.

0 commit comments

Comments
 (0)