@@ -25,7 +25,7 @@ predictions from our classifier are, as well as how to improve our classifier
25
25
- Perform K-nearest neighbour classification in R using ` tidymodels `
26
26
- Explain why one should center, scale, and balance data in predictive modelling
27
27
- Preprocess data to center, scale, and balance a dataset using a ` recipe `
28
- - Combine preprocessing and model training using a ` workflow `
28
+ - Combine preprocessing and model training using a Tidymodels ` workflow `
29
29
30
30
31
31
## The classification problem
@@ -483,7 +483,7 @@ many [other models](https://www.tidymodels.org/find/parsnip/)
483
483
that you will encounter in this and future classes. The ` tidymodels ` collection
484
484
provides tools to help make and use models, such as classifiers. Using the packages
485
485
in this collection will help keep our code simple, readable and accurate; the
486
- less we have to code ourselves, the less mistakes we are likely to make. We
486
+ less we have to code ourselves, the fewer mistakes we are likely to make. We
487
487
start off by loading ` tidymodels ` :
488
488
489
489
``` {r 05-tidymodels}
@@ -504,7 +504,12 @@ head(cancer_train)
504
504
505
505
Next, we create a * model specification* for K-nearest neighbours classification
506
506
by calling the ` nearest_neighbor ` function, specifying that we want to use $K = 5$ neighbours
507
- (we will discuss how to choose $K$ in the next chapter) and the straight-line distance (` weight_func = "rectangular" ` ).
507
+ (we will discuss how to choose $K$ in the next chapter) and the straight-line
508
+ distance (` weight_func = "rectangular" ` ). The ` weight_func ` argument controls
509
+ how neighbours vote when classifying a new observation; by setting it to ` "rectangular" ` ,
510
+ each of the $K$ nearest neighbours gets exactly 1 vote as described above. Other choices,
511
+ which weight each neighbour's vote differently, can be found on
512
+ [ the tidymodels website] ( https://parsnip.tidymodels.org/reference/nearest_neighbor.html ) .
508
513
We specify the particular computational
509
514
engine (in this case, the ` kknn ` engine) for training the model with the ` set_engine ` function.
510
515
Finally we specify that this is a classification problem with the ` set_mode ` function.
@@ -513,6 +518,7 @@ Finally we specify that this is a classification problem with the `set_mode` fun
513
518
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 5) %>%
514
519
set_engine("kknn") %>%
515
520
set_mode("classification")
521
+ knn_spec
516
522
```
517
523
518
524
In order to fit the model on the breast cancer data, we need to pass the model specification
@@ -526,6 +532,15 @@ knn_fit <- knn_spec %>%
526
532
fit(Class ~ ., data = cancer_train)
527
533
knn_fit
528
534
```
535
+ Here you can see the final trained model summary. It confirms that the computational engine used
536
+ to train the model was ` kknn::train.kknn ` . It also shows the fraction of errors made by
537
+ the nearest neighbour model, but we will ignore this for now and discuss it in more detail
538
+ in the next chapter.
539
+ Finally it shows (somewhat confusingly) that the "best" weight function
540
+ was "rectangular" and "best" setting of $K$ was 5; but since we specified these earlier,
541
+ R is just repeating those settings to us here. In the next chapter, we will actually
542
+ let R tune the model for us.
543
+
529
544
Finally, we make the prediction on the new observation by calling the ` predict ` function,
530
545
passing the fit object we just created. As above when we ran the K-nearest neighbours
531
546
classification algorithm manually, the ` knn_fit ` object classifies the new observation as
@@ -623,8 +638,7 @@ For example:
623
638
624
639
You can find [ a full set of all the steps and variable selection functions] ( https://tidymodels.github.io/recipes/reference/index.html )
625
640
on the recipes home page.
626
- We now use the ` prep ` function to create an object that represents how to apply the recipe
627
- to our ` unscaled_cancer ` dataframe, and then the ` bake ` function to apply the recipe.
641
+ We finally use the ` bake ` function to apply the recipe.
628
642
``` {r 05-scaling-4}
629
643
scaled_cancer <- bake(uc_recipe, unscaled_cancer)
630
644
head(scaled_cancer)
@@ -908,6 +922,11 @@ knn_fit <- workflow() %>%
908
922
fit(data = unscaled_cancer)
909
923
knn_fit
910
924
```
925
+ As before, the fit object lists the function that trains the model as well as the "best" settings
926
+ for the number of neighbours and weight function (for now, these are just the values we chose
927
+ manually when we created ` knn_spec ` above). But now the fit object also includes information about
928
+ the overall workflow, including the centering and scaling preprocessing steps.
929
+
911
930
Let's visualize the predictions that this trained K-nearest neighbour model will make on new observations.
912
931
Below you will see how to make the coloured prediction map plots from earlier in this chapter.
913
932
The basic idea is to create a grid of synthetic new observations using the ` expand.grid ` function,
0 commit comments