You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: classification1.Rmd
+21-22Lines changed: 21 additions & 22 deletions
Original file line number
Diff line number
Diff line change
@@ -48,16 +48,16 @@ predictions from our classifier are, as well as how to improve our classifier
48
48
49
49
## Chapter learning objectives
50
50
51
-
By the end of the chapter, readers will be able to:
51
+
By the end of the chapter, readers will be able to do the following:
52
52
53
-
- Recognize situations where a classifier would be appropriate for making predictions
54
-
- Describe what a training data set is and how it is used in classification
55
-
- Interpret the output of a classifier
56
-
- Compute, by hand, the straight-line (Euclidean) distance between points on a graph when there are two predictor variables
57
-
- Explain the $K$-nearest neighbor classification algorithm
58
-
- Perform $K$-nearest neighbor classification in R using `tidymodels`
59
-
- Use a `recipe` to preprocess data to be centered, scaled, and balanced
60
-
- Combine preprocessing and model training using a `workflow`
53
+
- Recognize situations where a classifier would be appropriate for making predictions.
54
+
- Describe what a training data set is and how it is used in classification.
55
+
- Interpret the output of a classifier.
56
+
- Compute, by hand, the straight-line (Euclidean) distance between points on a graph when there are two predictor variables.
57
+
- Explain the $K$-nearest neighbor classification algorithm.
58
+
- Perform $K$-nearest neighbor classification in R using `tidymodels`.
59
+
- Use a `recipe` to preprocess data to be centered, scaled, and balanced.
60
+
- Combine preprocessing and model training using a `workflow`.
61
61
62
62
63
63
## The classification problem
@@ -188,7 +188,7 @@ cancer <- cancer |>
188
188
glimpse(cancer)
189
189
```
190
190
191
-
Recall factors have what are called "levels", which you can think of as categories. We
191
+
Recall that factors have what are called "levels", which you can think of as categories. We
192
192
can verify the levels of the `Class` column by using the `levels` \index{levels}\index{factor!levels} function.
193
193
This function should return the name of each category in that column. Given
194
194
that we only have two different values in our `Class` column (B for benign and M
@@ -208,7 +208,7 @@ cancer |>
208
208
Before we start doing any modeling, let's explore our data set. Below we use
209
209
the `group_by`, `summarize` and `n` \index{group\_by}\index{summarize} functions to find the number and percentage
210
210
of benign and malignant tumor observations in our data set. The `n` function within
211
-
`summarize` when paired with `group_by` counts the number of observations in each `Class` group.
211
+
`summarize`, when paired with `group_by`, counts the number of observations in each `Class` group.
212
212
Then we calculate the percentage in each group by dividing by the total number of observations. We have 357 (63\%) benign and 212 (37\%) malignant tumor observations.
213
213
```{r 05-tally}
214
214
num_obs <- nrow(cancer)
@@ -358,7 +358,7 @@ concavity of `r new_point[2]`. Looking at the scatter plot in Figure \@ref(fig:0
358
358
classify this red, diamond observation? The nearest neighbor to this new point is a
359
359
**benign** observation at (`r round(neighbors[1, c(attrs[1], attrs[2])], 1)`).
360
360
Does this seem like the right prediction to make for this observation? Probably
361
-
not, if you consider the other nearby points...
361
+
not, if you consider the other nearby points.
362
362
363
363
364
364
```{r 05-knn-4, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap="Scatter plot of concavity versus perimeter. The new observation is represented as a red diamond with a line to the one nearest neighbor, which has a benign label."}
```{r 05-scaling-plt-zoomed, fig.height = 4.5, fig.width = 9, echo = FALSE, fig.cap = "Closeup of three nearest neighbors for unstandardized data."}
1024
+
```{r 05-scaling-plt-zoomed, fig.height = 4.5, fig.width = 9, echo = FALSE, fig.cap = "Close-up of three nearest neighbors for unstandardized data."}
1025
1025
library(ggforce)
1026
1026
ggplot(unscaled_cancer, aes(x = Area,
1027
1027
y = Smoothness,
@@ -1284,8 +1284,7 @@ upsampled_plot
1284
1284
1285
1285
## Putting it together in a `workflow` {#puttingittogetherworkflow}
1286
1286
1287
-
The `tidymodels` package collection also provides the `workflow`, a way to chain \index{tidymodels!workflow} \index{workflow|see{tidymodels}}
1288
-
together multiple data analysis steps without a lot of otherwise necessary code for intermediate steps.
1287
+
The `tidymodels` package collection also provides the `workflow`, a way to chain \index{tidymodels!workflow} \index{workflow|see{tidymodels}} together multiple data analysis steps without a lot of otherwise necessary code for intermediate steps.
1289
1288
To illustrate the whole pipeline, let's start from scratch with the `unscaled_wdbc.csv` data.
1290
1289
First we will load the data, create a model, and specify a recipe for how the data should be preprocessed:
0 commit comments