@@ -610,7 +610,7 @@ especially if we want to handle multiple classes, more than two variables,
610
610
or predicting the class for multiple new observations. Thankfully, in R,
611
611
the $K$-nearest neighbors algorithm is implemented in the ` parsnip ` package
612
612
included in the
613
- [ ` tidymodels ` meta package] ( https://www.tidymodels.org/ ) , along with
613
+ [ ` tidymodels ` package] ( https://www.tidymodels.org/ ) , along with
614
614
many [ other models] ( https://www.tidymodels.org/find/parsnip/ )
615
615
that you will encounter in this and future chapters of the book. The ` tidymodels ` collection
616
616
provides tools to help make and use models, such as classifiers. Using the packages
@@ -627,7 +627,7 @@ We will use the `cancer` data set from above, with
627
627
perimeter and concavity as predictors and $K = 5$ neighbors to build our classifier. Then
628
628
we will use the classifier to predict the diagnosis label for a new observation with
629
629
perimeter 0, concavity 3.5, and an unknown diagnosis label. Let's pick out our two desired
630
- predictor variables and class label and store it as a new dataset named ` cancer_train ` :
630
+ predictor variables and class label and store them as a new data set named ` cancer_train ` :
631
631
632
632
``` {r 05-tidymodels-2}
633
633
cancer_train <- cancer |>
@@ -655,7 +655,7 @@ knn_spec
655
655
```
656
656
657
657
In order to fit the model on the breast cancer data, we need to pass the model specification
658
- and the dataset to the ` fit ` function. We also need to specify what variables to use as predictors
658
+ and the data setto the ` fit ` function. We also need to specify what variables to use as predictors
659
659
and what variable to use as the target. Below, the ` Class ~ Perimeter + Concavity ` argument specifies
660
660
that ` Class ` is the target variable (the one we want to predict),
661
661
and both ` Perimeter ` and ` Concavity ` are to be used as the predictors.
@@ -698,8 +698,8 @@ predict(knn_fit, new_obs)
698
698
699
699
Is this predicted malignant label the true class for this observation?
700
700
Well, we don't know because we do not have this
701
- observation's diagnosis&mdash ; that is what we were trying to predict.
702
- In the next chapter, we will
701
+ observation's diagnosis&mdash ; that is what we were trying to predict! The
702
+ classifier's prediction is not necessarily correct, but in the next chapter, we will
703
703
learn ways to quantify how accurate we think our predictions are.
704
704
705
705
## Data preprocessing with ` tidymodels `
@@ -731,7 +731,8 @@ $K$-nearest neighbor classification algorithm, this large shift can change the
731
731
outcome of using many other predictive models.
732
732
733
733
To scale and center our data, we need to find
734
- our variables' mean and * standard deviation* (a number quantifying how spread out values are).
734
+ our variables' * mean* (the average, which quantifies the "central" value of a
735
+ set of numbers) and * standard deviation* (a number quantifying how spread out values are).
735
736
For each observed value of the variable, we subtract the mean (center the variable)
736
737
and divide by the standard deviation (scale the variable). When we do this, the data
737
738
is said to be * standardized* , and all variables in a data set will have a mean of 0
@@ -795,7 +796,7 @@ For example:
795
796
You can find [ a full set of all the steps and variable selection functions] ( https://tidymodels.github.io/recipes/reference/index.html )
796
797
on the recipes home page.
797
798
798
- Here we have calculated the required statistics based on the data input into the
799
+ At this point, we have calculated the required statistics based on the data input into the
799
800
recipe, but the data are not yet scaled and centred. To actually scale and center
800
801
the data, we need to apply the bake function to the unscaled data.
801
802
@@ -805,10 +806,10 @@ scaled_cancer
805
806
```
806
807
807
808
It may seem redundant that we had to both ` bake ` * and* ` prep ` to scale and center the data.
808
- However, we do this in two steps so we could specify a different data set in the ` bake ` step
809
- if desired, say , new data you want to predict, which were not part of the training set.
809
+ However, we do this in two steps so we can specify a different data set in the ` bake ` step,
810
+ for instance , new data that were not part of the training set.
810
811
811
- At this point, you may wonder why we are doing so much work just to center and
812
+ You may wonder why we are doing so much work just to center and
812
813
scale our variables. Can't we just manually scale and center the ` Area ` and
813
814
` Smoothness ` variables ourselves before building our $K$-nearest neighbor model? Well,
814
815
technically * yes* ; but doing so is error-prone. In particular, we might
@@ -951,7 +952,10 @@ ggplot(unscaled_cancer, aes(x = Area, y = Smoothness, group = Class, color = Cla
951
952
xend = unlist(neighbors[3, attrs[1]]),
952
953
yend = unlist(neighbors[3, attrs[2]])
953
954
), color = "black") + theme_light() +
954
- facet_zoom( xlim= c(399.7, 401.6), ylim = c(0.08, 0.14), zoom.size = 2)
955
+ # facet_zoom( xlim = c(399.7, 401.6), ylim = c(0.08, 0.14), zoom.size = 2) +
956
+ facet_zoom(x = ( Area > 380 & Area < 420) ,
957
+ y = (Smoothness > 0.08 & Smoothness < 0.14), zoom.size = 2) +
958
+ theme_bw()
955
959
```
956
960
957
961
### Balancing
@@ -1000,7 +1004,10 @@ rare_plot
1000
1004
> process, which then guarantees the same result, i.e., the same choice of 3
1001
1005
> observations, each time the code is run. In general, when your code involves
1002
1006
> random numbers, if you want * the same result* each time, you should use
1003
- > ` set.seed ` ; if you want a * different result* each time, you should not.
1007
+ > ` set.seed ` ; if you want a * different result* each time, you should not.
1008
+ > You only need to ` set.seed ` once at the beginning of your analysis, so the
1009
+ rest of the analysis uses seemingly random numbers.
1010
+
1004
1011
1005
1012
Suppose we now decided to use $K = 7$ in $K$-nearest neighbor classification.
1006
1013
With only 3 observations of malignant tumors, the classifier
0 commit comments