You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: classification1.Rmd
+37-37Lines changed: 37 additions & 37 deletions
Original file line number
Diff line number
Diff line change
@@ -13,7 +13,7 @@ focus on *classification*, i.e., using one or more
13
13
variables to predict the value of a categorical variable of interest. This chapter
14
14
will cover the basics of classification, how to preprocess data to make it
15
15
suitable for use in a classifier, and how to use our observed data to make
16
-
predictions. The next will focus on how to evaluate how accurate the
16
+
predictions. The next chapter will focus on how to evaluate how accurate the
17
17
predictions from our classifier are, as well as how to improve our classifier
18
18
(where possible) to maximize its accuracy.
19
19
@@ -161,8 +161,8 @@ can verify the levels of the `Class` column by using the `levels` function.
161
161
This function should return the name of each category in that column. Given
162
162
that we only have two different values in our `Class` column (B for benign and M
163
163
for malignant), we only expect to get two names back. Note that the `levels` function requires a *vector* argument;
164
-
so we use the `pull` function to convert the `Class`
165
-
column into a vector and pass that into the `levels` function to see the categories
164
+
so we use the `pull` function to extract a single column (`Class`) and
165
+
pass that into the `levels` function to see the categories
166
166
in the `Class` column.
167
167
168
168
```{r 05-levels}
@@ -176,7 +176,7 @@ cancer |>
176
176
Before we start doing any modelling, let's explore our data set. Below we use
177
177
the `group_by`, `summarize` and `n` functions to find the number and percentage
178
178
of benign and malignant tumor observations in our data set. The `n` function within
179
-
`summarize` counts the number of observations in each `Class` group.
179
+
`summarize`when paired with `group_by`counts the number of observations in each `Class` group.
180
180
Then we calculate the percentage in each group by dividing by the total number of observations. We have 357 (63\%) benign and 212 (37\%) malignant tumor observations.
181
181
```{r 05-tally}
182
182
num_obs <- nrow(cancer)
@@ -190,13 +190,13 @@ cancer |>
190
190
191
191
Next, let's draw a scatter plot to visualize the relationship between the
192
192
perimeter and concavity variables. Rather than use `ggplot's` default palette,
193
-
we select our own colourblind-friendly colors—`"orange2"`
193
+
we select our own colorblind-friendly colors—`"orange2"`
194
194
for light orange and `"steelblue2"` for light blue—and
195
195
pass them as the `values` argument to the `scale_color_manual` function.
196
196
We also make the category labels ("B" and "M") more readable by
197
197
changing them to "Benign" and "Malignant" using the `labels` argument.
198
198
199
-
```{r 05-scatter, fig.height = 4, fig.width = 5, fig.cap= "Scatter plot of concavity versus perimeter coloured by diagnosis label"}
199
+
```{r 05-scatter, fig.height = 4, fig.width = 5, fig.cap= "Scatter plot of concavity versus perimeter colored by diagnosis label"}
200
200
perim_concav <- cancer %>%
201
201
ggplot(aes(x = Perimeter, y = Concavity, color = Class)) +
202
202
geom_point(alpha = 0.6) +
@@ -215,7 +215,7 @@ measured *except* the label (i.e., an image without the physician's diagnosis
215
215
for the tumor class). We could compute the standardized perimeter and concavity values,
216
216
resulting in values of, say, 1 and 1. Could we use this information to classify
217
217
that observation as benign or malignant? Based on the scatter plot, how might
218
-
you classify that new observation? If the standardized concavity and perimeter values are 1 and 1, the point would lie in the middle of the orange cloud of malignant points and thus we could probably classify it as malignant. Based on our visualization, it seems like the *prediction of an unobserved label* might be possible.
218
+
you classify that new observation? If the standardized concavity and perimeter values are 1 and 1 respectively, the point would lie in the middle of the orange cloud of malignant points and thus we could probably classify it as malignant. Based on our visualization, it seems like the *prediction of an unobserved label* might be possible.
219
219
220
220
## Classification with $K$-nearest neighbors
221
221
@@ -261,7 +261,7 @@ $K$ for us. We will cover how to choose $K$ ourselves in the next chapter.
261
261
262
262
To illustrate the concept of $K$-nearest neighbors classification, we
263
263
will walk through an example. Suppose we have a
264
-
new observation, with perimeter of `r new_point[1]` and concavity of `r new_point[2]`, whose
264
+
new observation, with standardized perimeter of `r new_point[1]` and standardized concavity of `r new_point[2]`, whose
265
265
diagnosis "Class" is unknown. This new observation is depicted by the red, diamond point in
266
266
Figure \@ref(fig:05-knn-1).
267
267
@@ -291,7 +291,7 @@ then the perimeter and concavity values are similar, and so we may expect that
291
291
they would have the same diagnosis.
292
292
293
293
294
-
```{r 05-knn-2, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap="Scatter plot of concavity versus perimeter. The new observation is represented as a red diamond with a line to the malignant nearest neighbor."}
294
+
```{r 05-knn-2, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap="Scatter plot of concavity versus perimeter. The new observation is represented as a red diamond with a line to the one nearest neighbor, which has a malignant label."}
295
295
perim_concav_with_new_point +
296
296
geom_segment(aes(
297
297
x = new_point[1],
@@ -317,7 +317,7 @@ Does this seem like the right prediction to make for this observation? Probably
317
317
not, if you consider the other nearby points...
318
318
319
319
320
-
```{r 05-knn-4, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap="Scatter plot of concavity versus perimeter. The new observation is represented as a red diamond with a line to the benign nearest neighbor."}
320
+
```{r 05-knn-4, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap="Scatter plot of concavity versus perimeter. The new observation is represented as a red diamond with a line to the one nearest neighbor, which has a benign label."}
Figure \@ref(fig:05-upsample-2) shows what happens if we set the background colour of
1066
+
Figure \@ref(fig:05-upsample-2) shows what happens if we set the background color of
1067
1067
each area of the plot to the predictions the $K$-nearest neighbor
1068
1068
classifier would make. We can see that the decision is
1069
-
always "benign," corresponding to the blue colour.
1069
+
always "benign," corresponding to the blue color.
1070
1070
1071
-
```{r 05-upsample-2, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap = "Imbalanced data with background colour indicating the decision of the classifier and the points represent the labelled data"}
1071
+
```{r 05-upsample-2, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap = "Imbalanced data with background color indicating the decision of the classifier and the points represent the labelled data"}
Now suppose we train our $K$-nearest neighbor classifier with $K=7$ on this *balanced* data.
1123
-
Figure \@ref(fig:05-upsample-plot) shows what happens now when we set the background colour
1123
+
Figure \@ref(fig:05-upsample-plot) shows what happens now when we set the background color
1124
1124
of each area of our scatter plot to the decision the $K$-nearest neighbor
1125
1125
classifier would make. We can see that the decision is more reasonable; when the points are close
1126
1126
to those labelled malignant, the classifier predicts a malignant tumor, and vice versa when they are
1127
1127
closer to the benign tumor observations.
1128
1128
1129
-
```{r 05-upsample-plot, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap = "Upsampled data with background colour indicating the decision of the classifier"}
1129
+
```{r 05-upsample-plot, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap = "Upsampled data with background color indicating the decision of the classifier"}
```{r 05-workflow-plot-show, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap = "Scatter plot of smoothness versus area where background colour indicates the decision of the classifier"}
1250
+
```{r 05-workflow-plot-show, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap = "Scatter plot of smoothness versus area where background color indicates the decision of the classifier"}
0 commit comments