Merge pull request #389 from UBC-DSCI/classification1_edit

trevorcampbell · web-flow · commit c22ffd2feb8c · 2021-12-04T11:50:24.000-08:00
Classification1 edit
diff --git a/classification1.Rmd b/classification1.Rmd
@@ -48,16 +48,16 @@ predictions from our classifier are, as well as how to improve our classifier
 
 ## Chapter learning objectives 
 
-By the end of the chapter, readers will be able to:
+By the end of the chapter, readers will be able to do the following:
 
-- Recognize situations where a classifier would be appropriate for making predictions
-- Describe what a training data set is and how it is used in classification
-- Interpret the output of a classifier
-- Compute, by hand, the straight-line (Euclidean) distance between points on a graph when there are two predictor variables
-- Explain the $K$-nearest neighbor classification algorithm
-- Perform $K$-nearest neighbor classification in R using `tidymodels`  
-- Use a `recipe` to preprocess data to be centered, scaled, and balanced
-- Combine preprocessing and model training using a `workflow`
+- Recognize situations where a classifier would be appropriate for making predictions.
+- Describe what a training data set is and how it is used in classification.
+- Interpret the output of a classifier.
+- Compute, by hand, the straight-line (Euclidean) distance between points on a graph when there are two predictor variables.
+- Explain the $K$-nearest neighbor classification algorithm.
+- Perform $K$-nearest neighbor classification in R using `tidymodels`.
+- Use a `recipe` to preprocess data to be centered, scaled, and balanced.
+- Combine preprocessing and model training using a `workflow`.
 
 
 ## The classification problem
@@ -188,7 +188,7 @@ cancer <- cancer |>
 glimpse(cancer)
 ```
 
-Recall factors have what are called "levels", which you can think of as categories. We
+Recall that factors have what are called "levels", which you can think of as categories. We
 can verify the levels of the `Class` column by using the `levels` \index{levels}\index{factor!levels} function.
 This function should return the name of each category in that column. Given
 that we only have two different values in our `Class` column (B for benign and M 
@@ -208,7 +208,7 @@ cancer |>
 Before we start doing any modeling, let's explore our data set. Below we use
 the `group_by`, `summarize` and `n` \index{group\_by}\index{summarize} functions to find the number and percentage 
 of benign and malignant tumor observations in our data set. The `n` function within
-`summarize` when paired with `group_by` counts the number of observations in each `Class` group. 
+`summarize`, when paired with `group_by`, counts the number of observations in each `Class` group. 
 Then we calculate the percentage in each group by dividing by the total number of observations. We have 357 (63\%) benign and 212 (37\%) malignant tumor observations.
 ```{r 05-tally}
 num_obs <- nrow(cancer)
@@ -358,7 +358,7 @@ concavity of `r new_point[2]`. Looking at the scatter plot in Figure \@ref(fig:0
 classify this red, diamond observation? The nearest neighbor to this new point is a
 **benign** observation at (`r round(neighbors[1, c(attrs[1], attrs[2])], 1)`).
 Does this seem like the right prediction to make for this observation? Probably 
-not, if you consider the other nearby points...
+not, if you consider the other nearby points.
 
 
 ```{r 05-knn-4, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap="Scatter plot of concavity versus perimeter. The new observation is represented as a red diamond with a line to the one nearest neighbor, which has a benign label."}
@@ -561,8 +561,8 @@ $$\mathrm{Distance} = \sqrt{(a_{1} -b_{1})^2 + (a_{2} - b_{2})^2 + \dots + (a_{m
 
 This formula still corresponds to a straight-line distance, just in a space
 with more dimensions. Suppose we want to calculate the distance between a new
-observation with a perimeter of 0, concavity of 3.5 and symmetry of 1 and
-another observation with a perimeter, concavity and symmetry of 0.417, 2.31 and
+observation with a perimeter of 0, concavity of 3.5, and symmetry of 1, and
+another observation with a perimeter, concavity, and symmetry of 0.417, 2.31, and
 0.837 respectively. We have two observations with three predictor variables:
 perimeter, concavity, and symmetry. Previously, when we had two variables, we
 added up the squared difference between each of our (two) variables, and then
@@ -672,7 +672,7 @@ if(!is_latex_output()){
 
 ### Summary of $K$-nearest neighbors algorithm
 
-In order to classify a new observation using a $K$-nearest neighbor classifier, we have to:
+In order to classify a new observation using a $K$-nearest neighbor classifier, we have to do the following:
 
 1. Compute the distance between the new observation and each observation in the training set.
 2. Sort the data table in ascending order according to the distances.
@@ -763,13 +763,13 @@ Here you can see the final trained model summary. It confirms that the computati
 to train the model  was `kknn::train.kknn`. It also shows the fraction of errors made by
 the nearest neighbor model, but we will ignore this for now and discuss it in more detail
 in the next chapter.
-Finally it shows (somewhat confusingly) that the "best" weight function 
+Finally, it shows (somewhat confusingly) that the "best" weight function 
 was "rectangular" and "best" setting of $K$ was 5; but since we specified these earlier,
 R is just repeating those settings to us here. In the next chapter, we will actually
 let R find the value of $K$ for us. 
 
 Finally, we make the prediction on the new observation by calling the `predict` \index{tidymodels!predict} function,
-passing both the fit object we just created and the new observation itself. As above 
+passing both the fit object we just created and the new observation itself. As above, 
 when we ran the $K$-nearest neighbors
 classification algorithm manually, the `knn_fit` object classifies the new observation as 
 malignant ("M"). Note that the `predict` function outputs a data frame with a single 
@@ -806,7 +806,7 @@ difference of \$1000 in yearly salary!
 In many other predictive models, the *center* of each variable (e.g., its mean)
 matters as well. For example, if we had a data set with a temperature variable
 measured in degrees Kelvin, and the same data set with temperature measured in
-degrees Celcius, the two variables would differ by a constant shift of 273
+degrees Celsius, the two variables would differ by a constant shift of 273
 (even though they contain exactly the same information). Likewise, in our
 hypothetical job classification example, we would likely see that the center of
 the salary variable is in the tens of thousands, while the center of the years
@@ -885,7 +885,7 @@ You can find [a full set of all the steps and variable selection functions](http
 on the `recipes` home page.
 
 At this point, we have calculated the required statistics based on the data input into the 
-recipe, but the data are not yet scaled and centred. To actually scale and center 
+recipe, but the data are not yet scaled and centered. To actually scale and center 
 the data, we need to apply the `bake` \index{tidymodels!bake} \index{bake|see{tidymodels}} function to the unscaled data.
 
 ```{r 05-scaling-4}
@@ -1021,7 +1021,7 @@ ggarrange(unscaled, scaled, ncol = 2, common.legend = TRUE, legend = "bottom")
 
 ```
 
-```{r 05-scaling-plt-zoomed, fig.height = 4.5, fig.width = 9, echo = FALSE, fig.cap = "Close up of three nearest neighbors for unstandardized data."}
+```{r 05-scaling-plt-zoomed, fig.height = 4.5, fig.width = 9, echo = FALSE, fig.cap = "Close-up of three nearest neighbors for unstandardized data."}
 library(ggforce)
 ggplot(unscaled_cancer, aes(x = Area, 
                             y = Smoothness, 
@@ -1284,8 +1284,7 @@ upsampled_plot
 
 ## Putting it together in a `workflow` {#puttingittogetherworkflow}
 
-The `tidymodels` package collection also provides the `workflow`, a way to chain \index{tidymodels!workflow} \index{workflow|see{tidymodels}}
-together multiple data analysis steps without a lot of otherwise necessary code for intermediate steps.
+The `tidymodels` package collection also provides the `workflow`, a way to chain\index{tidymodels!workflow}\index{workflow|see{tidymodels}} together multiple data analysis steps without a lot of otherwise necessary code for intermediate steps.
 To illustrate the whole pipeline, let's start from scratch with the `unscaled_wdbc.csv` data.
 First we will load the data, create a model, and specify a recipe for how the data should be preprocessed: