Skip to content

Commit 744d177

Browse files
committed
fixing based on Trevor's comments
1 parent c474f25 commit 744d177

File tree

2 files changed

+24
-18
lines changed

2 files changed

+24
-18
lines changed

classification1.Rmd

Lines changed: 19 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -610,7 +610,7 @@ especially if we want to handle multiple classes, more than two variables,
610610
or predicting the class for multiple new observations. Thankfully, in R,
611611
the $K$-nearest neighbors algorithm is implemented in the `parsnip` package
612612
included in the
613-
[`tidymodels` meta package](https://www.tidymodels.org/), along with
613+
[`tidymodels` package](https://www.tidymodels.org/), along with
614614
many [other models](https://www.tidymodels.org/find/parsnip/)
615615
that you will encounter in this and future chapters of the book. The `tidymodels` collection
616616
provides tools to help make and use models, such as classifiers. Using the packages
@@ -627,7 +627,7 @@ We will use the `cancer` data set from above, with
627627
perimeter and concavity as predictors and $K = 5$ neighbors to build our classifier. Then
628628
we will use the classifier to predict the diagnosis label for a new observation with
629629
perimeter 0, concavity 3.5, and an unknown diagnosis label. Let's pick out our two desired
630-
predictor variables and class label and store it as a new dataset named `cancer_train`:
630+
predictor variables and class label and store them as a new data set named `cancer_train`:
631631

632632
```{r 05-tidymodels-2}
633633
cancer_train <- cancer |>
@@ -655,7 +655,7 @@ knn_spec
655655
```
656656

657657
In order to fit the model on the breast cancer data, we need to pass the model specification
658-
and the dataset to the `fit` function. We also need to specify what variables to use as predictors
658+
and the data setto the `fit` function. We also need to specify what variables to use as predictors
659659
and what variable to use as the target. Below, the `Class ~ Perimeter + Concavity` argument specifies
660660
that `Class` is the target variable (the one we want to predict),
661661
and both `Perimeter` and `Concavity` are to be used as the predictors.
@@ -698,8 +698,8 @@ predict(knn_fit, new_obs)
698698

699699
Is this predicted malignant label the true class for this observation?
700700
Well, we don't know because we do not have this
701-
observation's diagnosis&mdash; that is what we were trying to predict.
702-
In the next chapter, we will
701+
observation's diagnosis&mdash; that is what we were trying to predict! The
702+
classifier's prediction is not necessarily correct, but in the next chapter, we will
703703
learn ways to quantify how accurate we think our predictions are.
704704

705705
## Data preprocessing with `tidymodels`
@@ -731,7 +731,8 @@ $K$-nearest neighbor classification algorithm, this large shift can change the
731731
outcome of using many other predictive models.
732732

733733
To scale and center our data, we need to find
734-
our variables' mean and *standard deviation* (a number quantifying how spread out values are).
734+
our variables' *mean* (the average, which quantifies the "central" value of a
735+
set of numbers) and *standard deviation* (a number quantifying how spread out values are).
735736
For each observed value of the variable, we subtract the mean (center the variable)
736737
and divide by the standard deviation (scale the variable). When we do this, the data
737738
is said to be *standardized*, and all variables in a data set will have a mean of 0
@@ -795,7 +796,7 @@ For example:
795796
You can find [a full set of all the steps and variable selection functions](https://tidymodels.github.io/recipes/reference/index.html)
796797
on the recipes home page.
797798

798-
Here we have calculated the required statistics based on the data input into the
799+
At this point, we have calculated the required statistics based on the data input into the
799800
recipe, but the data are not yet scaled and centred. To actually scale and center
800801
the data, we need to apply the bake function to the unscaled data.
801802

@@ -805,10 +806,10 @@ scaled_cancer
805806
```
806807

807808
It may seem redundant that we had to both `bake` *and* `prep` to scale and center the data.
808-
However, we do this in two steps so we could specify a different data set in the `bake` step
809-
if desired, say, new data you want to predict, which were not part of the training set.
809+
However, we do this in two steps so we can specify a different data set in the `bake` step,
810+
for instance, new data that were not part of the training set.
810811

811-
At this point, you may wonder why we are doing so much work just to center and
812+
You may wonder why we are doing so much work just to center and
812813
scale our variables. Can't we just manually scale and center the `Area` and
813814
`Smoothness` variables ourselves before building our $K$-nearest neighbor model? Well,
814815
technically *yes*; but doing so is error-prone. In particular, we might
@@ -951,7 +952,10 @@ ggplot(unscaled_cancer, aes(x = Area, y = Smoothness, group = Class, color = Cla
951952
xend = unlist(neighbors[3, attrs[1]]),
952953
yend = unlist(neighbors[3, attrs[2]])
953954
), color = "black") + theme_light() +
954-
facet_zoom( xlim= c(399.7, 401.6), ylim = c(0.08, 0.14), zoom.size = 2)
955+
# facet_zoom( xlim = c(399.7, 401.6), ylim = c(0.08, 0.14), zoom.size = 2) +
956+
facet_zoom(x = ( Area > 380 & Area < 420) ,
957+
y = (Smoothness > 0.08 & Smoothness < 0.14), zoom.size = 2) +
958+
theme_bw()
955959
```
956960

957961
### Balancing
@@ -1000,7 +1004,10 @@ rare_plot
10001004
> process, which then guarantees the same result, i.e., the same choice of 3
10011005
> observations, each time the code is run. In general, when your code involves
10021006
> random numbers, if you want *the same result* each time, you should use
1003-
> `set.seed`; if you want a *different result* each time, you should not.
1007+
> `set.seed`; if you want a *different result* each time, you should not.
1008+
> You only need to `set.seed` once at the beginning of your analysis, so the
1009+
rest of the analysis uses seemingly random numbers.
1010+
10041011

10051012
Suppose we now decided to use $K = 7$ in $K$-nearest neighbor classification.
10061013
With only 3 observations of malignant tumors, the classifier

classification2.Rmd

Lines changed: 5 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,7 @@ the observations in the test set? One way we can do this is to calculate the
5353
classifier made the correct prediction. To calculate this we divide the number
5454
of correct predictions by the number of predictions made.
5555

56-
$$prediction \; accuracy = \frac{number \; of \; correct \; predictions}{total \; number \; of \; predictions}$$
56+
$$\mathrm{prediction \; accuracy} = \frac{\mathrm{number \; of \; correct \; predictions}}{\mathrm{total \; number \; of \; predictions}}$$
5757

5858

5959
The process for assessing if our predictions match the true labels in the
@@ -73,7 +73,7 @@ We start by loading the necessary packages, reading in the breast cancer data
7373
from the previous chapter, and making a quick scatter plot visualization of
7474
tumor cell concavity versus smoothness colored by diagnosis in Figure \@ref(fig:06-precode).
7575

76-
```{r 06-precode, fig.height = 4, fig.width = 5, fig.cap="Scatterplot of tumor cell concavity versus smoothness coloured by diagnosis label", message = F, warning = F}
76+
```{r 06-precode, fig.height = 4, fig.width = 5, fig.cap="Scatter plot of tumor cell concavity versus smoothness coloured by diagnosis label", message = F, warning = F}
7777
# load packages
7878
library(tidyverse)
7979
library(tidymodels)
@@ -180,7 +180,7 @@ our test data does not influence any aspect of our model training. Once we have
180180
created the standardization preprocessor, we can then apply it separately to both the
181181
training and test data sets.
182182

183-
Fortunately, the `recipe` framework from `tidymodels` makes it simple to handle
183+
Fortunately, the `recipe` framework from `tidymodels` helps us handle
184184
this properly. Below we construct and prepare the recipe using only the training
185185
data (due to `data = cancer_train` in the first line).
186186

@@ -218,7 +218,7 @@ knn_fit
218218
> if there is a tie for the majority neighbor class, the winner is randomly selected. Although there is no chance
219219
> of a tie when $K$ is odd (here $K=3$), it is possible that the code may be changed in the future to have an even value of $K$.
220220
> Thus, to prevent potential issues with reproducibility, we have set the seed. Note that in your own code,
221-
> you only have to set the seed once at the beginning of your analysis.
221+
> you should have to set the seed once at the beginning of your analysis.
222222
223223
### Predict the labels in the test set
224224

@@ -591,7 +591,6 @@ variable that contains the sequence of values of $K$ to try; below we create the
591591
data frame with the `neighbors` variable containing each value from $K=1$ to $K=15$ using
592592
the `seq` function.
593593
Then we pass that data frame to the `grid` argument of `tune_grid`.
594-
We set the seed prior to tuning to ensure results are reproducible:
595594
```{r 06-range-cross-val-2}
596595
set.seed(1)
597596
k_vals <- tibble(neighbors = seq(from = 1, to = 15, by = 1))
@@ -946,7 +945,7 @@ is less obvious, as all seem like reasonable candidates. It
946945
is not clear which subset of them will create the best classifier. One could use visualizations and
947946
other exploratory analyses to try to help understand which variables are potentially relevant, but
948947
this process is both time-consuming and error-prone when there are many variables to consider.
949-
We therefore, need a more systematic and programmatic way of choosing variables.
948+
Therefore we need a more systematic and programmatic way of choosing variables.
950949
This is a very difficult problem to solve in
951950
general, and there are a number of methods that have been developed that apply
952951
in particular cases of interest. Here we will discuss two basic

0 commit comments

Comments
 (0)