You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: classification1.Rmd
+14-25Lines changed: 14 additions & 25 deletions
Original file line number
Diff line number
Diff line change
@@ -160,15 +160,14 @@ Recall factors have what are called "levels", which you can think of as categori
160
160
can verify the levels of the `Class` column by using the `levels` function.
161
161
This function should return the name of each category in that column. Given
162
162
that we only have two different values in our `Class` column (B for benign and M
163
-
for malignant), we only expect to get two names back. Note that the `levels` function requires
164
-
a *vector* argument (while the `select` function outputs a *data frame*);
163
+
for malignant), we only expect to get two names back. Note that the `levels` function requires a *vector* argument;
165
164
so we use the `pull` function to convert the `Class`
166
165
column into a vector and pass that into the `levels` function to see the categories
167
166
in the `Class` column.
168
167
169
168
```{r 05-levels}
170
169
cancer |>
171
-
pull(Class) |> # turns a data frame into a vector
170
+
pull(Class) |>
172
171
levels()
173
172
```
174
173
@@ -216,10 +215,7 @@ measured *except* the label (i.e., an image without the physician's diagnosis
216
215
for the tumor class). We could compute the standardized perimeter and concavity values,
217
216
resulting in values of, say, 1 and 1. Could we use this information to classify
218
217
that observation as benign or malignant? Based on the scatter plot, how might
219
-
you classify that new observation? How would you classify a new observation with
220
-
a standardized perimeter value of -1 and a concavity value of -0.5? What about 0 and 1?
221
-
Based on our visualization, it seems
222
-
like the *prediction of an unobserved label* might be possible.
218
+
you classify that new observation? If the standardized concavity and perimeter values are 1 and 1, the point would lie in the middle of the orange cloud of malignant points and thus we could probably classify it as malignant. Based on our visualization, it seems like the *prediction of an unobserved label* might be possible.
223
219
224
220
## Classification with $K$-nearest neighbors
225
221
@@ -270,7 +266,7 @@ diagnosis "Class" is unknown. This new observation is depicted by the red, diamo
270
266
Figure \@ref(fig:05-knn-1).
271
267
272
268
273
-
```{r 05-knn-1, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap="Scatter plot of concavity versus perimeter with new observation labelled in red"}
269
+
```{r 05-knn-1, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap="Scatter plot of concavity versus perimeter with new observation represented as a red diamond"}
ggplot(aes(x = Perimeter, y = Concavity, color = Class, shape = Class, size = Class)) +
276
272
geom_point(alpha = 0.6) +
@@ -295,7 +291,7 @@ then the perimeter and concavity values are similar, and so we may expect that
295
291
they would have the same diagnosis.
296
292
297
293
298
-
```{r 05-knn-2, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap="Scatter plot of concavity versus perimeter, with malignant nearest neighbor to a new observation highlighted"}
294
+
```{r 05-knn-2, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap="Scatter plot of concavity versus perimeter. The new observation is represented as a red diamond with a line to the malignant nearest neighbor."}
299
295
perim_concav_with_new_point +
300
296
geom_segment(aes(
301
297
x = new_point[1],
@@ -321,7 +317,7 @@ Does this seem like the right prediction to make for this observation? Probably
321
317
not, if you consider the other nearby points...
322
318
323
319
324
-
```{r 05-knn-4, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap="Scatter plot of concavity versus perimeter, with benign nearest neighbor to a new observation highlighted"}
320
+
```{r 05-knn-4, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap="Scatter plot of concavity versus perimeter. The new observation is represented as a red diamond with a line to the benign nearest neighbor."}
ggplot(aes(x = Perimeter, y = Concavity, color = Class, shape = Class, size = Class)) +
@@ -407,7 +403,7 @@ You will see in the `mutate` step below, we compute the straight-line
407
403
distance using the formula above: we square the differences between the two observations' perimeter
408
404
and concavity coordinates, add the squared differences, and then take the square root.
409
405
410
-
```{r 05-multiknn-1, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap="Scatter plot of concavity versus perimeter with new observation labelled in red"}
406
+
```{r 05-multiknn-1, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap="Scatter plot of concavity versus perimeter with new observation represented as a red diamond"}
Previously, when we had two variables, we added up the squared difference between each of our (two) variables,
501
+
more dimensions. Suppose we want to calculate the distance between a new observation with a perimeter of 0, concavity of 3.5 and symmetry of 1 and
502
+
another observation with a perimeter, concavity and symmetry of 0.417, 2.31 and 0.837 respectively. We have two observations with three predictor variables: perimeter, concavity, and symmetry. Previously, when we had two variables, we added up the squared difference between each of our (two) variables,
514
503
and then took the square root. Now we will do the same, except for our
515
504
three variables. We calculate the distance as follows
In this case, the formula above is just the straight line distance in this 3-dimensional space.
507
+
519
508
Let's calculate the distances between our new observation and each of the observations in the training set to find the $K=5$ neighbors when we have these three predictors.
520
509
```{r}
521
510
new_obs_Perimeter <- 0
@@ -528,7 +517,7 @@ cancer |>
528
517
(Concavity - new_obs_Concavity)^2 +
529
518
(Symmetry - new_obs_Symmetry)^2)) |>
530
519
arrange(dist_from_new) |>
531
-
slice(1:5) # subset the first 5 rows
520
+
slice(1:5) # take the first 5 rows
532
521
```
533
522
Based on $K=5$ nearest neighbors with these three predictors we would classify the new observation as malignant since 4 out of 5 of the nearest neighbors are malignant class.
534
523
Figure \@ref(fig:05-more) shows what the data look like when we visualize them
0 commit comments