Skip to content

Commit c474f25

Browse files
committed
minor edits based on Trevor's comments
1 parent 4eddf44 commit c474f25

File tree

1 file changed

+14
-25
lines changed

1 file changed

+14
-25
lines changed

classification1.Rmd

Lines changed: 14 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -160,15 +160,14 @@ Recall factors have what are called "levels", which you can think of as categori
160160
can verify the levels of the `Class` column by using the `levels` function.
161161
This function should return the name of each category in that column. Given
162162
that we only have two different values in our `Class` column (B for benign and M
163-
for malignant), we only expect to get two names back. Note that the `levels` function requires
164-
a *vector* argument (while the `select` function outputs a *data frame*);
163+
for malignant), we only expect to get two names back. Note that the `levels` function requires a *vector* argument;
165164
so we use the `pull` function to convert the `Class`
166165
column into a vector and pass that into the `levels` function to see the categories
167166
in the `Class` column.
168167

169168
```{r 05-levels}
170169
cancer |>
171-
pull(Class) |> # turns a data frame into a vector
170+
pull(Class) |>
172171
levels()
173172
```
174173

@@ -216,10 +215,7 @@ measured *except* the label (i.e., an image without the physician's diagnosis
216215
for the tumor class). We could compute the standardized perimeter and concavity values,
217216
resulting in values of, say, 1 and 1. Could we use this information to classify
218217
that observation as benign or malignant? Based on the scatter plot, how might
219-
you classify that new observation? How would you classify a new observation with
220-
a standardized perimeter value of -1 and a concavity value of -0.5? What about 0 and 1?
221-
Based on our visualization, it seems
222-
like the *prediction of an unobserved label* might be possible.
218+
you classify that new observation? If the standardized concavity and perimeter values are 1 and 1, the point would lie in the middle of the orange cloud of malignant points and thus we could probably classify it as malignant. Based on our visualization, it seems like the *prediction of an unobserved label* might be possible.
223219

224220
## Classification with $K$-nearest neighbors
225221

@@ -270,7 +266,7 @@ diagnosis "Class" is unknown. This new observation is depicted by the red, diamo
270266
Figure \@ref(fig:05-knn-1).
271267

272268

273-
```{r 05-knn-1, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap="Scatter plot of concavity versus perimeter with new observation labelled in red"}
269+
```{r 05-knn-1, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap="Scatter plot of concavity versus perimeter with new observation represented as a red diamond"}
274270
perim_concav_with_new_point <- bind_rows(cancer, tibble(Perimeter = new_point[1], Concavity = new_point[2], Class = "unknown")) %>%
275271
ggplot(aes(x = Perimeter, y = Concavity, color = Class, shape = Class, size = Class)) +
276272
geom_point(alpha = 0.6) +
@@ -295,7 +291,7 @@ then the perimeter and concavity values are similar, and so we may expect that
295291
they would have the same diagnosis.
296292

297293

298-
```{r 05-knn-2, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap="Scatter plot of concavity versus perimeter, with malignant nearest neighbor to a new observation highlighted"}
294+
```{r 05-knn-2, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap="Scatter plot of concavity versus perimeter. The new observation is represented as a red diamond with a line to the malignant nearest neighbor."}
299295
perim_concav_with_new_point +
300296
geom_segment(aes(
301297
x = new_point[1],
@@ -321,7 +317,7 @@ Does this seem like the right prediction to make for this observation? Probably
321317
not, if you consider the other nearby points...
322318

323319

324-
```{r 05-knn-4, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap="Scatter plot of concavity versus perimeter, with benign nearest neighbor to a new observation highlighted"}
320+
```{r 05-knn-4, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap="Scatter plot of concavity versus perimeter. The new observation is represented as a red diamond with a line to the benign nearest neighbor."}
325321
326322
perim_concav_with_new_point2 <- bind_rows(cancer, tibble(Perimeter = new_point[1], Concavity = new_point[2], Class = "unknown")) %>%
327323
ggplot(aes(x = Perimeter, y = Concavity, color = Class, shape = Class, size = Class)) +
@@ -407,7 +403,7 @@ You will see in the `mutate` step below, we compute the straight-line
407403
distance using the formula above: we square the differences between the two observations' perimeter
408404
and concavity coordinates, add the squared differences, and then take the square root.
409405

410-
```{r 05-multiknn-1, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap="Scatter plot of concavity versus perimeter with new observation labelled in red"}
406+
```{r 05-multiknn-1, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap="Scatter plot of concavity versus perimeter with new observation represented as a red diamond"}
411407
perim_concav <- bind_rows(cancer, tibble(Perimeter = new_point[1], Concavity = new_point[2], Class = "unknown")) |>
412408
ggplot(aes(x = Perimeter, y = Concavity, color = Class, shape = Class, size = Class)) +
413409
geom_point(aes(x = new_point[1], y = new_point[2]), color = "red", size = 2.5, pch = 18) +
@@ -438,7 +434,7 @@ cancer |>
438434
mutate(dist_from_new = sqrt((Perimeter - new_obs_Perimeter)^2 +
439435
(Concavity - new_obs_Concavity)^2)) |>
440436
arrange(dist_from_new) |>
441-
slice(1:5) # subset the first 5 rows
437+
slice(1:5) # take the first 5 rows
442438
```
443439

444440
In Table \@ref(tab:05-multiknn-mathtable) we show in mathematical detail how the `mutate` step was used to compute the `dist_from_new`
@@ -490,12 +486,9 @@ perim_concav + annotate("path",
490486

491487
Although the above description is directed toward two predictor variables,
492488
exactly the same $K$-nearest neighbors algorithm applies when you
493-
have a higher number of predictor variables (i.e., a higher-dimensional
494-
predictor space). Each predictor variable may give us new
489+
have a higher number of predictor variable. Each predictor variable may give us new
495490
information to help create our classifier. The only difference is the formula
496-
for the distance between points.
497-
498-
Suppose we have $m$ predictor
491+
for the distance between points. Suppose we have $m$ predictor
499492
variables for two observations $a$ and $b$, i.e.,
500493
$a = (a_{1}, a_{2}, \dots, a_{m})$ and
501494
$b = (b_{1}, b_{2}, \dots, b_{m})$.
@@ -505,17 +498,13 @@ The distance formula becomes
505498
$$\mathrm{Distance} = \sqrt{(a_{1} -b_{1})^2 + (a_{2} - b_{2})^2 + \dots + (a_{m} - b_{m})^2}.$$
506499

507500
This formula still corresponds to a straight-line distance, just in a space with
508-
more dimensions. Suppose we want to calculate the distance between a new observation
509-
that we want to classify with a perimeter of 0, concavity of 3.5 and symmetry of 1 and
510-
another observation with a perimeter, concavity and symmetry of 0.417, 2.31 and 0.837 respectively.
511-
We have three predictor variables, perimeter, concavity, and symmetry (so, a 3-dimensional space) for two observations where
512-
$\mathrm{observation}_{old} = (0.417, 2.31, 0.837)$ and $\mathrm{observation}_{new} = (0, 3.5, 1)$.
513-
Previously, when we had two variables, we added up the squared difference between each of our (two) variables,
501+
more dimensions. Suppose we want to calculate the distance between a new observation with a perimeter of 0, concavity of 3.5 and symmetry of 1 and
502+
another observation with a perimeter, concavity and symmetry of 0.417, 2.31 and 0.837 respectively. We have two observations with three predictor variables: perimeter, concavity, and symmetry. Previously, when we had two variables, we added up the squared difference between each of our (two) variables,
514503
and then took the square root. Now we will do the same, except for our
515504
three variables. We calculate the distance as follows
516505

517506
$$\mathrm{Distance} =\sqrt{(0 - 0.417)^2 + (3.5 - 2.31)^2 + (1 - 0.837)^2} = 1.27.$$
518-
In this case, the formula above is just the straight line distance in this 3-dimensional space.
507+
519508
Let's calculate the distances between our new observation and each of the observations in the training set to find the $K=5$ neighbors when we have these three predictors.
520509
```{r}
521510
new_obs_Perimeter <- 0
@@ -528,7 +517,7 @@ cancer |>
528517
(Concavity - new_obs_Concavity)^2 +
529518
(Symmetry - new_obs_Symmetry)^2)) |>
530519
arrange(dist_from_new) |>
531-
slice(1:5) # subset the first 5 rows
520+
slice(1:5) # take the first 5 rows
532521
```
533522
Based on $K=5$ nearest neighbors with these three predictors we would classify the new observation as malignant since 4 out of 5 of the nearest neighbors are malignant class.
534523
Figure \@ref(fig:05-more) shows what the data look like when we visualize them

0 commit comments

Comments
 (0)