minor edits based on Trevor's comments

leem44 · leem44 · commit c474f25c3ace · 2021-09-22T14:30:09.000-07:00
diff --git a/classification1.Rmd b/classification1.Rmd
@@ -160,15 +160,14 @@ Recall factors have what are called "levels", which you can think of as categori
 can verify the levels of the `Class` column by using the `levels` function.
 This function should return the name of each category in that column. Given
 that we only have two different values in our `Class` column (B for benign and M 
-for malignant), we only expect to get two names back.  Note that the `levels` function requires
-a *vector* argument (while the `select` function outputs a *data frame*); 
+for malignant), we only expect to get two names back.  Note that the `levels` function requires a *vector* argument; 
 so we use the `pull` function to convert the `Class`
 column into a vector and pass that into the `levels` function to see the categories 
 in the `Class` column. 
 
 ```{r 05-levels}
 cancer |>
-  pull(Class) |> # turns a data frame into a vector
+  pull(Class) |>
   levels()
 ```
 
@@ -216,10 +215,7 @@ measured *except* the label (i.e., an image without the physician's diagnosis
 for the tumor class). We could compute the standardized perimeter and concavity values,
 resulting in values of, say, 1 and 1. Could we use this information to classify
 that observation as benign or malignant? Based on the scatter plot, how might 
-you classify that new observation? How would you classify a new observation with
-a standardized perimeter value of -1 and a concavity value of -0.5? What about 0 and 1? 
-Based on our visualization, it seems
-like the *prediction of an unobserved label* might be possible.
+you classify that new observation? If the standardized concavity and perimeter values are 1 and 1, the point would lie in the middle of the orange cloud of malignant points and thus we could probably classify it as malignant. Based on our visualization, it seems like the *prediction of an unobserved label* might be possible.
 
 ## Classification with $K$-nearest neighbors
 
@@ -270,7 +266,7 @@ diagnosis "Class" is unknown. This new observation is depicted by the red, diamo
 Figure \@ref(fig:05-knn-1).
 
 
-```{r 05-knn-1, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap="Scatter plot of concavity versus perimeter with new observation labelled in red"}
+```{r 05-knn-1, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap="Scatter plot of concavity versus perimeter with new observation represented as a red diamond"}
 perim_concav_with_new_point <-  bind_rows(cancer, tibble(Perimeter = new_point[1], Concavity = new_point[2], Class = "unknown")) %>%
   ggplot(aes(x = Perimeter, y = Concavity, color = Class, shape = Class, size = Class)) +
   geom_point(alpha = 0.6) +
@@ -295,7 +291,7 @@ then the perimeter and concavity values are similar, and so we may expect that
 they would have the same diagnosis. 
 
 
-```{r 05-knn-2, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap="Scatter plot of concavity versus perimeter, with malignant nearest neighbor to a new observation highlighted"}
+```{r 05-knn-2, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap="Scatter plot of concavity versus perimeter. The new observation is represented as a red diamond with a line to the malignant nearest neighbor."}
 perim_concav_with_new_point +
   geom_segment(aes(
     x = new_point[1],
@@ -321,7 +317,7 @@ Does this seem like the right prediction to make for this observation? Probably
 not, if you consider the other nearby points...
 
 
-```{r 05-knn-4, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap="Scatter plot of concavity versus perimeter, with benign nearest neighbor to a new observation highlighted"}
+```{r 05-knn-4, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap="Scatter plot of concavity versus perimeter. The new observation is represented as a red diamond with a line to the benign nearest neighbor."}
 
 perim_concav_with_new_point2 <- bind_rows(cancer, tibble(Perimeter = new_point[1], Concavity = new_point[2], Class = "unknown")) %>%
   ggplot(aes(x = Perimeter, y = Concavity, color = Class, shape = Class, size = Class)) +
@@ -407,7 +403,7 @@ You will see in the `mutate` step below, we compute the straight-line
 distance using the formula above: we square the differences between the two observations' perimeter 
 and concavity coordinates, add the squared differences, and then take the square root.
 
-```{r 05-multiknn-1, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap="Scatter plot of concavity versus perimeter with new observation labelled in red"}
+```{r 05-multiknn-1, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap="Scatter plot of concavity versus perimeter with new observation represented as a red diamond"}
 perim_concav <- bind_rows(cancer, tibble(Perimeter = new_point[1], Concavity = new_point[2], Class = "unknown")) |>
   ggplot(aes(x = Perimeter, y = Concavity, color = Class, shape = Class, size = Class)) +
   geom_point(aes(x = new_point[1], y = new_point[2]), color = "red", size = 2.5, pch = 18) + 
@@ -438,7 +434,7 @@ cancer |>
   mutate(dist_from_new = sqrt((Perimeter - new_obs_Perimeter)^2 + 
                               (Concavity - new_obs_Concavity)^2)) |>
   arrange(dist_from_new) |>
-  slice(1:5) # subset the first 5 rows
+  slice(1:5) # take the first 5 rows
 ```
 
 In Table \@ref(tab:05-multiknn-mathtable) we show in mathematical detail how the `mutate` step was used to compute the `dist_from_new` 
@@ -490,12 +486,9 @@ perim_concav + annotate("path",
 
 Although the above description is directed toward two predictor variables, 
 exactly the same $K$-nearest neighbors algorithm applies when you
-have a higher number of predictor variables (i.e., a higher-dimensional
-predictor space).  Each predictor variable may give us new
+have a higher number of predictor variable.  Each predictor variable may give us new
 information to help create our classifier.  The only difference is the formula
-for the distance between points. 
-
-Suppose we have $m$ predictor
+for the distance between points. Suppose we have $m$ predictor
 variables for two observations $a$ and $b$, i.e., 
 $a = (a_{1}, a_{2}, \dots, a_{m})$ and
 $b = (b_{1}, b_{2}, \dots, b_{m})$.
@@ -505,17 +498,13 @@ The distance formula becomes
 $$\mathrm{Distance} = \sqrt{(a_{1} -b_{1})^2 + (a_{2} - b_{2})^2 + \dots + (a_{m} - b_{m})^2}.$$
 
 This formula still corresponds to a straight-line distance, just in a space with 
-more dimensions. Suppose we want to calculate the distance between a new observation 
-that we want to classify with a perimeter of 0, concavity of 3.5 and symmetry of 1 and
-another observation with a perimeter, concavity and symmetry of 0.417, 2.31 and 0.837 respectively. 
-We have three predictor variables, perimeter, concavity, and symmetry (so, a 3-dimensional space) for two observations where
-$\mathrm{observation}_{old} = (0.417, 2.31, 0.837)$ and $\mathrm{observation}_{new} = (0, 3.5, 1)$.
-Previously, when we had two variables, we added up the squared difference between each of our (two) variables,
+more dimensions. Suppose we want to calculate the distance between a new observation with a perimeter of 0, concavity of 3.5 and symmetry of 1 and
+another observation with a perimeter, concavity and symmetry of 0.417, 2.31 and 0.837 respectively. We have two observations with three predictor variables: perimeter, concavity, and symmetry. Previously, when we had two variables, we added up the squared difference between each of our (two) variables,
 and then took the square root. Now we will do the same, except for our
 three variables.  We calculate the distance as follows
 
 $$\mathrm{Distance} =\sqrt{(0 - 0.417)^2 + (3.5 - 2.31)^2 + (1 - 0.837)^2} = 1.27.$$
-In this case, the formula above is just the straight line distance in this 3-dimensional space.
+
 Let's calculate the distances between our new observation and each of the observations in the training set to find the $K=5$ neighbors when we have these three predictors. 
 ```{r}
 new_obs_Perimeter <- 0
@@ -528,7 +517,7 @@ cancer |>
                               (Concavity - new_obs_Concavity)^2 +
                                 (Symmetry - new_obs_Symmetry)^2)) |>
   arrange(dist_from_new) |>
-  slice(1:5) # subset the first 5 rows
+  slice(1:5) # take the first 5 rows
 ```
 Based on $K=5$ nearest neighbors with these three predictors we would classify the new observation as malignant since 4 out of 5 of the nearest neighbors are malignant class. 
 Figure \@ref(fig:05-more) shows what the data look like when we visualize them