Skip to content

Commit c22ffd2

Browse files
Merge pull request #389 from UBC-DSCI/classification1_edit
Classification1 edit
2 parents 79b6286 + 8ee24ee commit c22ffd2

File tree

1 file changed

+21
-22
lines changed

1 file changed

+21
-22
lines changed

classification1.Rmd

Lines changed: 21 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -48,16 +48,16 @@ predictions from our classifier are, as well as how to improve our classifier
4848

4949
## Chapter learning objectives
5050

51-
By the end of the chapter, readers will be able to:
51+
By the end of the chapter, readers will be able to do the following:
5252

53-
- Recognize situations where a classifier would be appropriate for making predictions
54-
- Describe what a training data set is and how it is used in classification
55-
- Interpret the output of a classifier
56-
- Compute, by hand, the straight-line (Euclidean) distance between points on a graph when there are two predictor variables
57-
- Explain the $K$-nearest neighbor classification algorithm
58-
- Perform $K$-nearest neighbor classification in R using `tidymodels`
59-
- Use a `recipe` to preprocess data to be centered, scaled, and balanced
60-
- Combine preprocessing and model training using a `workflow`
53+
- Recognize situations where a classifier would be appropriate for making predictions.
54+
- Describe what a training data set is and how it is used in classification.
55+
- Interpret the output of a classifier.
56+
- Compute, by hand, the straight-line (Euclidean) distance between points on a graph when there are two predictor variables.
57+
- Explain the $K$-nearest neighbor classification algorithm.
58+
- Perform $K$-nearest neighbor classification in R using `tidymodels`.
59+
- Use a `recipe` to preprocess data to be centered, scaled, and balanced.
60+
- Combine preprocessing and model training using a `workflow`.
6161

6262

6363
## The classification problem
@@ -188,7 +188,7 @@ cancer <- cancer |>
188188
glimpse(cancer)
189189
```
190190

191-
Recall factors have what are called "levels", which you can think of as categories. We
191+
Recall that factors have what are called "levels", which you can think of as categories. We
192192
can verify the levels of the `Class` column by using the `levels` \index{levels}\index{factor!levels} function.
193193
This function should return the name of each category in that column. Given
194194
that we only have two different values in our `Class` column (B for benign and M
@@ -208,7 +208,7 @@ cancer |>
208208
Before we start doing any modeling, let's explore our data set. Below we use
209209
the `group_by`, `summarize` and `n` \index{group\_by}\index{summarize} functions to find the number and percentage
210210
of benign and malignant tumor observations in our data set. The `n` function within
211-
`summarize` when paired with `group_by` counts the number of observations in each `Class` group.
211+
`summarize`, when paired with `group_by`, counts the number of observations in each `Class` group.
212212
Then we calculate the percentage in each group by dividing by the total number of observations. We have 357 (63\%) benign and 212 (37\%) malignant tumor observations.
213213
```{r 05-tally}
214214
num_obs <- nrow(cancer)
@@ -358,7 +358,7 @@ concavity of `r new_point[2]`. Looking at the scatter plot in Figure \@ref(fig:0
358358
classify this red, diamond observation? The nearest neighbor to this new point is a
359359
**benign** observation at (`r round(neighbors[1, c(attrs[1], attrs[2])], 1)`).
360360
Does this seem like the right prediction to make for this observation? Probably
361-
not, if you consider the other nearby points...
361+
not, if you consider the other nearby points.
362362

363363

364364
```{r 05-knn-4, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap="Scatter plot of concavity versus perimeter. The new observation is represented as a red diamond with a line to the one nearest neighbor, which has a benign label."}
@@ -561,8 +561,8 @@ $$\mathrm{Distance} = \sqrt{(a_{1} -b_{1})^2 + (a_{2} - b_{2})^2 + \dots + (a_{m
561561

562562
This formula still corresponds to a straight-line distance, just in a space
563563
with more dimensions. Suppose we want to calculate the distance between a new
564-
observation with a perimeter of 0, concavity of 3.5 and symmetry of 1 and
565-
another observation with a perimeter, concavity and symmetry of 0.417, 2.31 and
564+
observation with a perimeter of 0, concavity of 3.5, and symmetry of 1, and
565+
another observation with a perimeter, concavity, and symmetry of 0.417, 2.31, and
566566
0.837 respectively. We have two observations with three predictor variables:
567567
perimeter, concavity, and symmetry. Previously, when we had two variables, we
568568
added up the squared difference between each of our (two) variables, and then
@@ -672,7 +672,7 @@ if(!is_latex_output()){
672672

673673
### Summary of $K$-nearest neighbors algorithm
674674

675-
In order to classify a new observation using a $K$-nearest neighbor classifier, we have to:
675+
In order to classify a new observation using a $K$-nearest neighbor classifier, we have to do the following:
676676

677677
1. Compute the distance between the new observation and each observation in the training set.
678678
2. Sort the data table in ascending order according to the distances.
@@ -763,13 +763,13 @@ Here you can see the final trained model summary. It confirms that the computati
763763
to train the model was `kknn::train.kknn`. It also shows the fraction of errors made by
764764
the nearest neighbor model, but we will ignore this for now and discuss it in more detail
765765
in the next chapter.
766-
Finally it shows (somewhat confusingly) that the "best" weight function
766+
Finally, it shows (somewhat confusingly) that the "best" weight function
767767
was "rectangular" and "best" setting of $K$ was 5; but since we specified these earlier,
768768
R is just repeating those settings to us here. In the next chapter, we will actually
769769
let R find the value of $K$ for us.
770770

771771
Finally, we make the prediction on the new observation by calling the `predict` \index{tidymodels!predict} function,
772-
passing both the fit object we just created and the new observation itself. As above
772+
passing both the fit object we just created and the new observation itself. As above,
773773
when we ran the $K$-nearest neighbors
774774
classification algorithm manually, the `knn_fit` object classifies the new observation as
775775
malignant ("M"). Note that the `predict` function outputs a data frame with a single
@@ -806,7 +806,7 @@ difference of \$1000 in yearly salary!
806806
In many other predictive models, the *center* of each variable (e.g., its mean)
807807
matters as well. For example, if we had a data set with a temperature variable
808808
measured in degrees Kelvin, and the same data set with temperature measured in
809-
degrees Celcius, the two variables would differ by a constant shift of 273
809+
degrees Celsius, the two variables would differ by a constant shift of 273
810810
(even though they contain exactly the same information). Likewise, in our
811811
hypothetical job classification example, we would likely see that the center of
812812
the salary variable is in the tens of thousands, while the center of the years
@@ -885,7 +885,7 @@ You can find [a full set of all the steps and variable selection functions](http
885885
on the `recipes` home page.
886886

887887
At this point, we have calculated the required statistics based on the data input into the
888-
recipe, but the data are not yet scaled and centred. To actually scale and center
888+
recipe, but the data are not yet scaled and centered. To actually scale and center
889889
the data, we need to apply the `bake` \index{tidymodels!bake} \index{bake|see{tidymodels}} function to the unscaled data.
890890

891891
```{r 05-scaling-4}
@@ -1021,7 +1021,7 @@ ggarrange(unscaled, scaled, ncol = 2, common.legend = TRUE, legend = "bottom")
10211021
10221022
```
10231023

1024-
```{r 05-scaling-plt-zoomed, fig.height = 4.5, fig.width = 9, echo = FALSE, fig.cap = "Close up of three nearest neighbors for unstandardized data."}
1024+
```{r 05-scaling-plt-zoomed, fig.height = 4.5, fig.width = 9, echo = FALSE, fig.cap = "Close-up of three nearest neighbors for unstandardized data."}
10251025
library(ggforce)
10261026
ggplot(unscaled_cancer, aes(x = Area,
10271027
y = Smoothness,
@@ -1284,8 +1284,7 @@ upsampled_plot
12841284

12851285
## Putting it together in a `workflow` {#puttingittogetherworkflow}
12861286

1287-
The `tidymodels` package collection also provides the `workflow`, a way to chain \index{tidymodels!workflow} \index{workflow|see{tidymodels}}
1288-
together multiple data analysis steps without a lot of otherwise necessary code for intermediate steps.
1287+
The `tidymodels` package collection also provides the `workflow`, a way to chain\index{tidymodels!workflow}\index{workflow|see{tidymodels}} together multiple data analysis steps without a lot of otherwise necessary code for intermediate steps.
12891288
To illustrate the whole pipeline, let's start from scratch with the `unscaled_wdbc.csv` data.
12901289
First we will load the data, create a model, and specify a recipe for how the data should be preprocessed:
12911290

0 commit comments

Comments
 (0)