Skip to content

Commit d41f056

Browse files
addressing mlee's comments on the tidymodels PR
1 parent d67b080 commit d41f056

16 files changed

+1419
-1916
lines changed

05-classification.Rmd

Lines changed: 24 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ predictions from our classifier are, as well as how to improve our classifier
2525
- Perform K-nearest neighbour classification in R using `tidymodels`
2626
- Explain why one should center, scale, and balance data in predictive modelling
2727
- Preprocess data to center, scale, and balance a dataset using a `recipe`
28-
- Combine preprocessing and model training using a `workflow`
28+
- Combine preprocessing and model training using a Tidymodels `workflow`
2929

3030

3131
## The classification problem
@@ -483,7 +483,7 @@ many [other models](https://www.tidymodels.org/find/parsnip/)
483483
that you will encounter in this and future classes. The `tidymodels` collection
484484
provides tools to help make and use models, such as classifiers. Using the packages
485485
in this collection will help keep our code simple, readable and accurate; the
486-
less we have to code ourselves, the less mistakes we are likely to make. We
486+
less we have to code ourselves, the fewer mistakes we are likely to make. We
487487
start off by loading `tidymodels`:
488488

489489
```{r 05-tidymodels}
@@ -504,7 +504,12 @@ head(cancer_train)
504504

505505
Next, we create a *model specification* for K-nearest neighbours classification
506506
by calling the `nearest_neighbor` function, specifying that we want to use $K = 5$ neighbours
507-
(we will discuss how to choose $K$ in the next chapter) and the straight-line distance (`weight_func = "rectangular"`).
507+
(we will discuss how to choose $K$ in the next chapter) and the straight-line
508+
distance (`weight_func = "rectangular"`). The `weight_func` argument controls
509+
how neighbours vote when classifying a new observation; by setting it to `"rectangular"`,
510+
each of the $K$ nearest neighbours gets exactly 1 vote as described above. Other choices,
511+
which weight each neighbour's vote differently, can be found on
512+
[the tidymodels website](https://parsnip.tidymodels.org/reference/nearest_neighbor.html).
508513
We specify the particular computational
509514
engine (in this case, the `kknn` engine) for training the model with the `set_engine` function.
510515
Finally we specify that this is a classification problem with the `set_mode` function.
@@ -513,6 +518,7 @@ Finally we specify that this is a classification problem with the `set_mode` fun
513518
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 5) %>%
514519
set_engine("kknn") %>%
515520
set_mode("classification")
521+
knn_spec
516522
```
517523

518524
In order to fit the model on the breast cancer data, we need to pass the model specification
@@ -526,6 +532,15 @@ knn_fit <- knn_spec %>%
526532
fit(Class ~ ., data = cancer_train)
527533
knn_fit
528534
```
535+
Here you can see the final trained model summary. It confirms that the computational engine used
536+
to train the model was `kknn::train.kknn`. It also shows the fraction of errors made by
537+
the nearest neighbour model, but we will ignore this for now and discuss it in more detail
538+
in the next chapter.
539+
Finally it shows (somewhat confusingly) that the "best" weight function
540+
was "rectangular" and "best" setting of $K$ was 5; but since we specified these earlier,
541+
R is just repeating those settings to us here. In the next chapter, we will actually
542+
let R tune the model for us.
543+
529544
Finally, we make the prediction on the new observation by calling the `predict` function,
530545
passing the fit object we just created. As above when we ran the K-nearest neighbours
531546
classification algorithm manually, the `knn_fit` object classifies the new observation as
@@ -623,8 +638,7 @@ For example:
623638

624639
You can find [a full set of all the steps and variable selection functions](https://tidymodels.github.io/recipes/reference/index.html)
625640
on the recipes home page.
626-
We now use the `prep` function to create an object that represents how to apply the recipe
627-
to our `unscaled_cancer` dataframe, and then the `bake` function to apply the recipe.
641+
We finally use the `bake` function to apply the recipe.
628642
```{r 05-scaling-4}
629643
scaled_cancer <- bake(uc_recipe, unscaled_cancer)
630644
head(scaled_cancer)
@@ -908,6 +922,11 @@ knn_fit <- workflow() %>%
908922
fit(data = unscaled_cancer)
909923
knn_fit
910924
```
925+
As before, the fit object lists the function that trains the model as well as the "best" settings
926+
for the number of neighbours and weight function (for now, these are just the values we chose
927+
manually when we created `knn_spec` above). But now the fit object also includes information about
928+
the overall workflow, including the centering and scaling preprocessing steps.
929+
911930
Let's visualize the predictions that this trained K-nearest neighbour model will make on new observations.
912931
Below you will see how to make the coloured prediction map plots from earlier in this chapter.
913932
The basic idea is to create a grid of synthetic new observations using the `expand.grid` function,

06-classification_continued.Rmd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
NN`{r 06-setup, include=FALSE}
1+
```{r 06-setup, include=FALSE}
22
knitr::opts_chunk$set(message = FALSE)
33
```
44

07-regression1.Rmd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -544,7 +544,7 @@ zvals <- knn_mult_fit %>%
544544
predict(crossing(xvals, yvals) %>% mutate(sqft = xvals, beds = yvals)) %>%
545545
pull(.pred)
546546
547-
zvalsm <- matrix(zvals, nrow=length(sqft))
547+
zvalsm <- matrix(zvals, nrow=length(xvals))
548548
549549
plot_ly() %>%
550550
add_markers(data = sacramento_train,

0 commit comments

Comments
 (0)