Skip to content

Commit c26d3c5

Browse files
committed
minor spelling, grammar fixes, changing to american spelling, minor writing changes
1 parent 744d177 commit c26d3c5

File tree

2 files changed

+40
-40
lines changed

2 files changed

+40
-40
lines changed

classification1.Rmd

Lines changed: 37 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ focus on *classification*, i.e., using one or more
1313
variables to predict the value of a categorical variable of interest. This chapter
1414
will cover the basics of classification, how to preprocess data to make it
1515
suitable for use in a classifier, and how to use our observed data to make
16-
predictions. The next will focus on how to evaluate how accurate the
16+
predictions. The next chapter will focus on how to evaluate how accurate the
1717
predictions from our classifier are, as well as how to improve our classifier
1818
(where possible) to maximize its accuracy.
1919

@@ -161,8 +161,8 @@ can verify the levels of the `Class` column by using the `levels` function.
161161
This function should return the name of each category in that column. Given
162162
that we only have two different values in our `Class` column (B for benign and M
163163
for malignant), we only expect to get two names back. Note that the `levels` function requires a *vector* argument;
164-
so we use the `pull` function to convert the `Class`
165-
column into a vector and pass that into the `levels` function to see the categories
164+
so we use the `pull` function to extract a single column (`Class`) and
165+
pass that into the `levels` function to see the categories
166166
in the `Class` column.
167167

168168
```{r 05-levels}
@@ -176,7 +176,7 @@ cancer |>
176176
Before we start doing any modelling, let's explore our data set. Below we use
177177
the `group_by`, `summarize` and `n` functions to find the number and percentage
178178
of benign and malignant tumor observations in our data set. The `n` function within
179-
`summarize` counts the number of observations in each `Class` group.
179+
`summarize` when paired with `group_by` counts the number of observations in each `Class` group.
180180
Then we calculate the percentage in each group by dividing by the total number of observations. We have 357 (63\%) benign and 212 (37\%) malignant tumor observations.
181181
```{r 05-tally}
182182
num_obs <- nrow(cancer)
@@ -190,13 +190,13 @@ cancer |>
190190

191191
Next, let's draw a scatter plot to visualize the relationship between the
192192
perimeter and concavity variables. Rather than use `ggplot's` default palette,
193-
we select our own colourblind-friendly colors&mdash;`"orange2"`
193+
we select our own colorblind-friendly colors&mdash;`"orange2"`
194194
for light orange and `"steelblue2"` for light blue&mdash;and
195195
pass them as the `values` argument to the `scale_color_manual` function.
196196
We also make the category labels ("B" and "M") more readable by
197197
changing them to "Benign" and "Malignant" using the `labels` argument.
198198

199-
```{r 05-scatter, fig.height = 4, fig.width = 5, fig.cap= "Scatter plot of concavity versus perimeter coloured by diagnosis label"}
199+
```{r 05-scatter, fig.height = 4, fig.width = 5, fig.cap= "Scatter plot of concavity versus perimeter colored by diagnosis label"}
200200
perim_concav <- cancer %>%
201201
ggplot(aes(x = Perimeter, y = Concavity, color = Class)) +
202202
geom_point(alpha = 0.6) +
@@ -215,7 +215,7 @@ measured *except* the label (i.e., an image without the physician's diagnosis
215215
for the tumor class). We could compute the standardized perimeter and concavity values,
216216
resulting in values of, say, 1 and 1. Could we use this information to classify
217217
that observation as benign or malignant? Based on the scatter plot, how might
218-
you classify that new observation? If the standardized concavity and perimeter values are 1 and 1, the point would lie in the middle of the orange cloud of malignant points and thus we could probably classify it as malignant. Based on our visualization, it seems like the *prediction of an unobserved label* might be possible.
218+
you classify that new observation? If the standardized concavity and perimeter values are 1 and 1 respectively, the point would lie in the middle of the orange cloud of malignant points and thus we could probably classify it as malignant. Based on our visualization, it seems like the *prediction of an unobserved label* might be possible.
219219

220220
## Classification with $K$-nearest neighbors
221221

@@ -261,7 +261,7 @@ $K$ for us. We will cover how to choose $K$ ourselves in the next chapter.
261261

262262
To illustrate the concept of $K$-nearest neighbors classification, we
263263
will walk through an example. Suppose we have a
264-
new observation, with perimeter of `r new_point[1]` and concavity of `r new_point[2]`, whose
264+
new observation, with standardized perimeter of `r new_point[1]` and standardized concavity of `r new_point[2]`, whose
265265
diagnosis "Class" is unknown. This new observation is depicted by the red, diamond point in
266266
Figure \@ref(fig:05-knn-1).
267267

@@ -291,7 +291,7 @@ then the perimeter and concavity values are similar, and so we may expect that
291291
they would have the same diagnosis.
292292

293293

294-
```{r 05-knn-2, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap="Scatter plot of concavity versus perimeter. The new observation is represented as a red diamond with a line to the malignant nearest neighbor."}
294+
```{r 05-knn-2, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap="Scatter plot of concavity versus perimeter. The new observation is represented as a red diamond with a line to the one nearest neighbor, which has a malignant label."}
295295
perim_concav_with_new_point +
296296
geom_segment(aes(
297297
x = new_point[1],
@@ -317,7 +317,7 @@ Does this seem like the right prediction to make for this observation? Probably
317317
not, if you consider the other nearby points...
318318

319319

320-
```{r 05-knn-4, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap="Scatter plot of concavity versus perimeter. The new observation is represented as a red diamond with a line to the benign nearest neighbor."}
320+
```{r 05-knn-4, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap="Scatter plot of concavity versus perimeter. The new observation is represented as a red diamond with a line to the one nearest neighbor, which has a benign label."}
321321
322322
perim_concav_with_new_point2 <- bind_rows(cancer, tibble(Perimeter = new_point[1], Concavity = new_point[2], Class = "unknown")) %>%
323323
ggplot(aes(x = Perimeter, y = Concavity, color = Class, shape = Class, size = Class)) +
@@ -383,7 +383,7 @@ We decide which points are the $K$ "nearest" to our new observation
383383
using the *straight-line distance* (we will often just refer to this as *distance*).
384384
Suppose we have two observations $a$ and $b$, each having two predictor variables, $x$ and $y$.
385385
Denote $a_x$ and $a_y$ to be the values of variables $x$ and $y$ for observation $a$;
386-
$b_x$ and $b_y$ have similar definitions for observaiton $b$.
386+
$b_x$ and $b_y$ have similar definitions for observation $b$.
387387
Then the straight-line distance between observation $a$ and $b$ on the x-y plane can
388388
be computed using the following formula:
389389

@@ -396,7 +396,7 @@ To find the $K$ nearest neighbors to our new observation, we compute the distanc
396396
from that new observation to each observation in our training data, and select the $K$ observations corresponding to the
397397
$K$ *smallest* distance values. For example, suppose we want to use $K=5$ neighbors to classify a new
398398
observation with perimeter of `r new_point[1]` and
399-
concavity of `r new_point[2]`, shown as a red, diamond in Figure \@ref(fig:05-multiknn-1). Let's calculate the distances
399+
concavity of `r new_point[2]`, shown as a red diamond in Figure \@ref(fig:05-multiknn-1). Let's calculate the distances
400400
between our new point and each of the observations in the training set to find
401401
the $K=5$ neighbors that are nearest to our new point.
402402
You will see in the `mutate` step below, we compute the straight-line
@@ -486,7 +486,7 @@ perim_concav + annotate("path",
486486

487487
Although the above description is directed toward two predictor variables,
488488
exactly the same $K$-nearest neighbors algorithm applies when you
489-
have a higher number of predictor variable. Each predictor variable may give us new
489+
have a higher number of predictor variables. Each predictor variable may give us new
490490
information to help create our classifier. The only difference is the formula
491491
for the distance between points. Suppose we have $m$ predictor
492492
variables for two observations $a$ and $b$, i.e.,
@@ -607,7 +607,7 @@ In order to classify a new observation using a $K$-nearest neighbor classifier,
607607

608608
Coding the $K$-nearest neighbors algorithm in R ourselves can get complicated,
609609
especially if we want to handle multiple classes, more than two variables,
610-
or predicting the class for multiple new observations. Thankfully, in R,
610+
or predict the class for multiple new observations. Thankfully, in R,
611611
the $K$-nearest neighbors algorithm is implemented in the `parsnip` package
612612
included in the
613613
[`tidymodels` package](https://www.tidymodels.org/), along with
@@ -642,9 +642,9 @@ distance (`weight_func = "rectangular"`). The `weight_func` argument controls
642642
how neighbors vote when classifying a new observation; by setting it to `"rectangular"`,
643643
each of the $K$ nearest neighbors gets exactly 1 vote as described above. Other choices,
644644
which weigh each neighbor's vote differently, can be found on
645-
[the tidymodels website](https://parsnip.tidymodels.org/reference/nearest_neighbor.html).
645+
[the `tidymodels` website](https://parsnip.tidymodels.org/reference/nearest_neighbor.html).
646646
In the `set_engine` argument, we specify which package or system will be used for training
647-
the model. In this case, `kknn` is an R package for performing $K$-nearest neighbors classification.
647+
the model. Here `kknn` is the R package we will use for performing $K$-nearest neighbors classification.
648648
Finally, we specify that this is a classification problem with the `set_mode` function.
649649

650650
```{r 05-tidymodels-3}
@@ -655,7 +655,7 @@ knn_spec
655655
```
656656

657657
In order to fit the model on the breast cancer data, we need to pass the model specification
658-
and the data setto the `fit` function. We also need to specify what variables to use as predictors
658+
and the data set to the `fit` function. We also need to specify what variables to use as predictors
659659
and what variable to use as the target. Below, the `Class ~ Perimeter + Concavity` argument specifies
660660
that `Class` is the target variable (the one we want to predict),
661661
and both `Perimeter` and `Concavity` are to be used as the predictors.
@@ -682,7 +682,7 @@ in the next chapter.
682682
Finally it shows (somewhat confusingly) that the "best" weight function
683683
was "rectangular" and "best" setting of $K$ was 5; but since we specified these earlier,
684684
R is just repeating those settings to us here. In the next chapter, we will actually
685-
let R find the $K$ value for us.
685+
let R find the value of $K$ for us.
686686

687687
Finally, we make the prediction on the new observation by calling the `predict` function,
688688
passing both the fit object we just created and the new observation itself. As above
@@ -733,8 +733,8 @@ outcome of using many other predictive models.
733733
To scale and center our data, we need to find
734734
our variables' *mean* (the average, which quantifies the "central" value of a
735735
set of numbers) and *standard deviation* (a number quantifying how spread out values are).
736-
For each observed value of the variable, we subtract the mean (center the variable)
737-
and divide by the standard deviation (scale the variable). When we do this, the data
736+
For each observed value of the variable, we subtract the mean (i.e., center the variable)
737+
and divide by the standard deviation (i.e., scale the variable). When we do this, the data
738738
is said to be *standardized*, and all variables in a data set will have a mean of 0
739739
and a standard deviation of 1. To illustrate the effect that standardization can have on the $K$-nearest
740740
neighbor algorithm, we will read in the original, unstandardized Wisconsin breast
@@ -753,7 +753,7 @@ Looking at the unscaled and uncentered data above, you can see that the differen
753753
between the values for area measurements are much larger than those for
754754
smoothness. Will this affect
755755
predictions? In order to find out, we will create a scatter plot of these two
756-
predictors (coloured by diagnosis) for both the unstandardized data we just
756+
predictors (colored by diagnosis) for both the unstandardized data we just
757757
loaded, and the standardized version of that same data. But first, we need to
758758
standardize the `unscaled_cancer` data set with `tidymodels`.
759759

@@ -794,23 +794,23 @@ For example:
794794
- `-Class`: specify everything except the `Class` variable
795795

796796
You can find [a full set of all the steps and variable selection functions](https://tidymodels.github.io/recipes/reference/index.html)
797-
on the recipes home page.
797+
on the `recipes` home page.
798798

799799
At this point, we have calculated the required statistics based on the data input into the
800800
recipe, but the data are not yet scaled and centred. To actually scale and center
801-
the data, we need to apply the bake function to the unscaled data.
801+
the data, we need to apply the `bake` function to the unscaled data.
802802

803803
```{r 05-scaling-4}
804804
scaled_cancer <- bake(uc_recipe, unscaled_cancer)
805805
scaled_cancer
806806
```
807807

808808
It may seem redundant that we had to both `bake` *and* `prep` to scale and center the data.
809-
However, we do this in two steps so we can specify a different data set in the `bake` step,
810-
for instance, new data that were not part of the training set.
809+
However, we do this in two steps so we can specify a different data set in the `bake` step if we want.
810+
For example, we may want to specify new data that were not part of the training set.
811811

812812
You may wonder why we are doing so much work just to center and
813-
scale our variables. Can't we just manually scale and center the `Area` and
813+
scale our variables. Can't we just manually scale and center the `Area` and
814814
`Smoothness` variables ourselves before building our $K$-nearest neighbor model? Well,
815815
technically *yes*; but doing so is error-prone. In particular, we might
816816
accidentally forget to apply the same centering / scaling when making
@@ -931,7 +931,7 @@ ggplot(unscaled_cancer, aes(x = Area, y = Smoothness, group = Class, color = Cla
931931
labels = c("Benign", "Malignant", "Unknown"),
932932
values = c("steelblue2", "orange2", "red")) +
933933
scale_shape_manual(name = "Diagnosis",
934-
labels = c("Benign", "Malignant", "Unknown"),
934+
labels = c("Benign", "Malignant", "Unknown"),
935935
values= c(16, 16, 18)) +
936936
scale_size_manual(name = "Diagnosis",
937937
labels = c("Benign", "Malignant", "Unknown"),
@@ -1063,12 +1063,12 @@ rare_plot + geom_point(aes(x = new_point[1], y = new_point[2]),
10631063
```
10641064
</center>
10651065

1066-
Figure \@ref(fig:05-upsample-2) shows what happens if we set the background colour of
1066+
Figure \@ref(fig:05-upsample-2) shows what happens if we set the background color of
10671067
each area of the plot to the predictions the $K$-nearest neighbor
10681068
classifier would make. We can see that the decision is
1069-
always "benign," corresponding to the blue colour.
1069+
always "benign," corresponding to the blue color.
10701070

1071-
```{r 05-upsample-2, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap = "Imbalanced data with background colour indicating the decision of the classifier and the points represent the labelled data"}
1071+
```{r 05-upsample-2, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap = "Imbalanced data with background color indicating the decision of the classifier and the points represent the labelled data"}
10721072
10731073
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 7) |>
10741074
set_engine("kknn") |>
@@ -1120,13 +1120,13 @@ upsampled_cancer |>
11201120
summarize(n = n())
11211121
```
11221122
Now suppose we train our $K$-nearest neighbor classifier with $K=7$ on this *balanced* data.
1123-
Figure \@ref(fig:05-upsample-plot) shows what happens now when we set the background colour
1123+
Figure \@ref(fig:05-upsample-plot) shows what happens now when we set the background color
11241124
of each area of our scatter plot to the decision the $K$-nearest neighbor
11251125
classifier would make. We can see that the decision is more reasonable; when the points are close
11261126
to those labelled malignant, the classifier predicts a malignant tumor, and vice versa when they are
11271127
closer to the benign tumor observations.
11281128

1129-
```{r 05-upsample-plot, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap = "Upsampled data with background colour indicating the decision of the classifier"}
1129+
```{r 05-upsample-plot, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap = "Upsampled data with background color indicating the decision of the classifier"}
11301130
11311131
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 7) |>
11321132
set_engine("kknn") |>
@@ -1208,11 +1208,11 @@ prediction
12081208
The classifier predicts that the first observation is benign ("B"), while the second is
12091209
malignant ("M"). Figure \@ref(fig:05-workflow-plot-show) visualizes the predictions that this
12101210
trained $K$-nearest neighbor model will make on a large range of new observations.
1211-
Although you have seen coloured prediction map visualizations like this a few times now,
1211+
Although you have seen colored prediction map visualizations like this a few times now,
12121212
we have not included the code to generate them, as it is a little bit complicated.
12131213
For the interested reader who wants a learning challenge, we now include it below.
12141214
The basic idea is to create a grid of synthetic new observations using the `expand.grid` function,
1215-
predict the label of each, and visualize the predictions with a coloured scatter having a very high transparency
1215+
predict the label of each, and visualize the predictions with a colored scatter having a very high transparency
12161216
(low `alpha` value) and large point radius. See if you can figure out what each line is doing!
12171217

12181218
> *Understanding this code is not required for the remainder of the textbook. It is included
@@ -1235,8 +1235,8 @@ knnPredGrid <- predict(knn_fit, asgrid)
12351235
prediction_table <- bind_cols(knnPredGrid, asgrid) |> rename(Class = .pred_class)
12361236
12371237
# plot:
1238-
# 1. the coloured scatter of the original data
1239-
# 2. the faded coloured scatter for the grid points
1238+
# 1. the colored scatter of the original data
1239+
# 2. the faded colored scatter for the grid points
12401240
wkflw_plot <-
12411241
ggplot() +
12421242
geom_point(data = unscaled_cancer,
@@ -1247,6 +1247,6 @@ wkflw_plot <-
12471247
scale_color_manual(labels = c("Malignant", "Benign"), values = c("orange2", "steelblue2"))
12481248
```
12491249

1250-
```{r 05-workflow-plot-show, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap = "Scatter plot of smoothness versus area where background colour indicates the decision of the classifier"}
1250+
```{r 05-workflow-plot-show, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap = "Scatter plot of smoothness versus area where background color indicates the decision of the classifier"}
12511251
wkflw_plot
12521252
```

classification2.Rmd

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -73,7 +73,7 @@ We start by loading the necessary packages, reading in the breast cancer data
7373
from the previous chapter, and making a quick scatter plot visualization of
7474
tumor cell concavity versus smoothness colored by diagnosis in Figure \@ref(fig:06-precode).
7575

76-
```{r 06-precode, fig.height = 4, fig.width = 5, fig.cap="Scatter plot of tumor cell concavity versus smoothness coloured by diagnosis label", message = F, warning = F}
76+
```{r 06-precode, fig.height = 4, fig.width = 5, fig.cap="Scatter plot of tumor cell concavity versus smoothness colored by diagnosis label", message = F, warning = F}
7777
# load packages
7878
library(tidyverse)
7979
library(tidymodels)
@@ -214,11 +214,11 @@ knn_fit <- workflow() |>
214214
knn_fit
215215
```
216216

217-
> Note: Here again you see the `set.seed` function. In the $K$-nearest neighbors algorithm,
217+
> Note: Here again you see the `set.seed` function because in the $K$-nearest neighbors algorithm,
218218
> if there is a tie for the majority neighbor class, the winner is randomly selected. Although there is no chance
219219
> of a tie when $K$ is odd (here $K=3$), it is possible that the code may be changed in the future to have an even value of $K$.
220220
> Thus, to prevent potential issues with reproducibility, we have set the seed. Note that in your own code,
221-
> you should have to set the seed once at the beginning of your analysis.
221+
> you should only set the seed once at the beginning of your analysis.
222222
223223
### Predict the labels in the test set
224224

0 commit comments

Comments
 (0)