Skip to content

Commit e18b208

Browse files
authored
Merge pull request #398 from UBC-DSCI/class2-edits
Copyediting for classification 2
2 parents c22ffd2 + 292881f commit e18b208

File tree

1 file changed

+45
-45
lines changed

1 file changed

+45
-45
lines changed

classification2.Rmd

Lines changed: 45 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -34,15 +34,15 @@ a classifier, as well as how to improve the classifier (where possible)
3434
to maximize its accuracy.
3535

3636
## Chapter learning objectives
37-
By the end of the chapter, readers will be able to:
37+
By the end of the chapter, readers will be able to do the following:
3838

39-
- Describe what training, validation, and test data sets are and how they are used in classification
40-
- Split data into training, validation, and test data sets
41-
- Describe what a random seed is and its importance in reproducible data analysis
42-
- Set the random seed in R using the `set.seed` function
43-
- Evaluate classification accuracy in R using a validation data set and appropriate metrics
44-
- Execute cross-validation in R to choose the number of neighbors in a $K$-nearest neighbors classifier
45-
- Describe advantages and disadvantages of the $K$-nearest neighbors classification algorithm
39+
- Describe what training, validation, and test data sets are and how they are used in classification.
40+
- Split data into training, validation, and test data sets.
41+
- Describe what a random seed is and its importance in reproducible data analysis.
42+
- Set the random seed in R using the `set.seed` function.
43+
- Evaluate classification accuracy in R using a validation data set and appropriate metrics.
44+
- Execute cross-validation in R to choose the number of neighbors in a $K$-nearest neighbors classifier.
45+
- Describe the advantages and disadvantages of the $K$-nearest neighbors classification algorithm.
4646

4747
## Evaluating accuracy
4848

@@ -61,7 +61,7 @@ tumor images?
6161

6262
The trick is to split the data into a **training set** \index{training set} and **test set** \index{test set} (Figure \@ref(fig:06-training-test))
6363
and use only the **training set** when building the classifier.
64-
Then to evaluate the accuracy of the classifier, we first set aside the true labels from the **test set**,
64+
Then, to evaluate the accuracy of the classifier, we first set aside the true labels from the **test set**,
6565
and then use the classifier to predict the labels in the **test set**. If our predictions match the true
6666
labels for the observations in the **test set**, then we have some
6767
confidence that our classifier might also accurately predict the class
@@ -80,7 +80,7 @@ knitr::include_graphics("img/training_test.jpeg")
8080
How exactly can we assess how well our predictions match the true labels for
8181
the observations in the test set? One way we can do this is to calculate the
8282
**prediction accuracy**. \index{prediction accuracy|see{accuracy}}\index{accuracy} This is the fraction of examples for which the
83-
classifier made the correct prediction. To calculate this we divide the number
83+
classifier made the correct prediction. To calculate this, we divide the number
8484
of correct predictions by the number of predictions made.
8585

8686
$$\mathrm{prediction \; accuracy} = \frac{\mathrm{number \; of \; correct \; predictions}}{\mathrm{total \; number \; of \; predictions}}$$
@@ -236,8 +236,8 @@ perim_concav
236236

237237
Once we have decided on a predictive question to answer and done some
238238
preliminary exploration, the very next thing to do is to split the data into
239-
the training and test sets. Typically, the training set is between 50 - 95% of
240-
the data, while the test set is the remaining 5 - 50%; the intuition is that
239+
the training and test sets. Typically, the training set is between 50% and 95% of
240+
the data, while the test set is the remaining 5% to 50%; the intuition is that
241241
you want to trade off between training an accurate model (by using a larger
242242
training data set) and getting an accurate evaluation of its performance (by
243243
using a larger test data set). Here, we will use 75% of the data for training,
@@ -297,7 +297,7 @@ train_prop <- cancer_train |>
297297
We can use `group_by` and `summarize` to \index{group\_by}\index{summarize} find the percentage of malignant and benign classes
298298
in `cancer_train` and we see about `r round(filter(train_prop, Class == "B")$proportion, 2)*100`% of the training
299299
data are benign and `r round(filter(train_prop, Class == "M")$proportion, 2)*100`%
300-
are malignant indicating that our class proportions were roughly preserved when we split the data.
300+
are malignant, indicating that our class proportions were roughly preserved when we split the data.
301301

302302
```{r 06-train-proportion}
303303
cancer_proportions <- cancer_train |>
@@ -378,7 +378,7 @@ cancer_test_predictions
378378

379379
### Compute the accuracy
380380

381-
Finally we can assess our classifier's accuracy. To do this we use the `metrics` function \index{tidymodels!metrics}
381+
Finally, we can assess our classifier's accuracy. To do this we use the `metrics` function \index{tidymodels!metrics}
382382
from `tidymodels` to get the statistics about the quality of our model, specifying
383383
the `truth` and `estimate` arguments:
384384

@@ -394,7 +394,7 @@ cancer_acc_1 <- cancer_test_predictions |>
394394
filter(.metric == 'accuracy')
395395
```
396396

397-
In the metrics data frame we filtered the `.metric` column since we are
397+
In the metrics data frame, we filtered the `.metric` column since we are
398398
interested in the `accuracy` row. Other entries involve more advanced metrics that
399399
are beyond the scope of this book. Looking at the value of the `.estimate` variable
400400
shows that the estimated accuracy of the classifier on the test data
@@ -428,7 +428,7 @@ and `r confu12` observations as malignant when they were truly benign.
428428
### Critically analyze performance
429429

430430
We now know that the classifier was `r round(100*cancer_acc_1$.estimate,0)`% accurate
431-
on the test data set. That sounds pretty good!... Wait, *is* it good?
431+
on the test data set. That sounds pretty good! Wait, *is* it good?
432432
Or do we need something higher?
433433

434434
In general, what a *good* value for accuracy \index{accuracy!assessment} is depends on the application.
@@ -483,7 +483,7 @@ the $K$-nearest neighbors classifier improved quite a bit on the basic
483483
majority classifier. Hooray! But we still need to be cautious; in
484484
this application, it is likely very important not to misdiagnose any malignant tumors to avoid missing
485485
patients who actually need medical care. The confusion matrix above shows
486-
that the classifier does indeed misdiagnose a significant number of malignant tumors as benign (`r confu21`
486+
that the classifier does, indeed, misdiagnose a significant number of malignant tumors as benign (`r confu21`
487487
out of `r confu11+confu21` malignant tumors, or `r round(100*(confu21)/(confu11+confu21))`%!).
488488
Therefore, even though the accuracy improved upon the majority classifier,
489489
our critical analysis suggests that this classifier may not have appropriate performance
@@ -628,17 +628,17 @@ classifier's accuracy; this has the effect of reducing the influence of any one
628628

629629
In practice, we don't use random splits, but rather use a more structured
630630
splitting procedure so that each observation in the data set is used in a
631-
validation set only a single time. The name for this strategy is called
631+
validation set only a single time. The name for this strategy is
632632
**cross-validation**. In **cross-validation**, \index{cross-validation} we split our **overall training
633-
data** into $C$ evenly-sized chunks. Then, iteratively use $1$ chunk as the
633+
data** into $C$ evenly sized chunks. Then, iteratively use $1$ chunk as the
634634
**validation set** and combine the remaining $C-1$ chunks
635635
as the **training set**.
636636
This procedure is shown in Figure \@ref(fig:06-cv-image).
637637
Here, $C=5$ different chunks of the data set are used,
638638
resulting in 5 different choices for the **validation set**; we call this
639639
*5-fold* cross-validation.
640640

641-
```{r 06-cv-image, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "5-fold cross validation.", fig.retina = 2, out.width = "100%"}
641+
```{r 06-cv-image, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "5-fold cross-validation.", fig.retina = 2, out.width = "100%"}
642642
knitr::include_graphics("img/cv.png")
643643
```
644644

@@ -703,9 +703,9 @@ by computational power: the
703703
more folds we choose, the more computation it takes, and hence the more time
704704
it takes to run the analysis. So when you do cross-validation, you need to
705705
consider the size of the data, and the speed of the algorithm (e.g., $K$-nearest
706-
neighbor) and the speed of your computer. In practice, this is a trial and
707-
error process, but typically $C$ is chosen to be either 5 or 10. Here we show
708-
how the standard error decreases when we use 10-fold cross validation rather
706+
neighbor) and the speed of your computer. In practice, this is a
707+
trial-and-error process, but typically $C$ is chosen to be either 5 or 10. Here we show
708+
how the standard error decreases when we use 10-fold cross-validation rather
709709
than 5-fold:
710710

711711
```{r 06-10-fold}
@@ -800,9 +800,9 @@ that doesn't mean the classifier is actually more accurate with this parameter
800800
value! Generally, when selecting $K$ (and other parameters for other predictive
801801
models), we are looking for a value where:
802802

803-
- we get roughly optimal accuracy, so that our model will likely be accurate
804-
- changing the value to a nearby one (e.g., adding or subtracting a small number) doesn't decrease accuracy too much, so that our choice is reliable in the presence of uncertainty
805-
- the cost of training the model is not prohibitive (e.g., in our situation, if $K$ is too large, predicting becomes expensive!)
803+
- we get roughly optimal accuracy, so that our model will likely be accurate;
804+
- changing the value to a nearby one (e.g., adding or subtracting a small number) doesn't decrease accuracy too much, so that our choice is reliable in the presence of uncertainty;
805+
- the cost of training the model is not prohibitive (e.g., in our situation, if $K$ is too large, predicting becomes expensive!).
806806

807807
We know that $K =$ `r (accuracies |> arrange(desc(mean)) |> head(1))$neighbors`
808808
provides the highest estimated accuracy. Further, Figure \@ref(fig:06-find-k) shows that the estimated accuracy
@@ -949,7 +949,7 @@ The overall workflow for performing $K$-nearest neighbors classification using `
949949
\index{tidymodels}\index{recipe}\index{cross-validation}\index{K-nearest neighbors!classification}\index{classification}
950950

951951
1. Use the `initial_split` function to split the data into a training and test set. Set the `strata` argument to the class label variable. Put the test set aside for now.
952-
2. Use the `vfold_cv` function to split up the training data for cross validation.
952+
2. Use the `vfold_cv` function to split up the training data for cross-validation.
953953
3. Create a `recipe` that specifies the class label and predictors, as well as preprocessing steps for all variables. Pass the training data as the `data` argument of the recipe.
954954
4. Create a `nearest_neighbors` model specification, with `neighbors = tune()`.
955955
5. Add the recipe and model specification to a `workflow()`, and use the `tune_grid` function on the train/validation splits to estimate the classifier accuracy for a range of $K$ values.
@@ -964,15 +964,15 @@ the $K$-NN here.
964964

965965
**Strengths:** $K$-nearest neighbors classification
966966

967-
1. is a simple, intuitive algorithm
968-
2. requires few assumptions about what the data must look like
969-
3. works for binary (two-class) and multi-class (more than 2 classes) classification problems
967+
1. is a simple, intuitive algorithm,
968+
2. requires few assumptions about what the data must look like, and
969+
3. works for binary (two-class) and multi-class (more than 2 classes) classification problems.
970970

971971
**Weaknesses:** $K$-nearest neighbors classification
972972

973-
1. becomes very slow as the training data gets larger
974-
2. may not perform well with a large number of predictors
975-
3. may not perform well when classes are imbalanced
973+
1. becomes very slow as the training data gets larger,
974+
2. may not perform well with a large number of predictors, and
975+
3. may not perform well when classes are imbalanced.
976976

977977
## Predictor variable selection
978978

@@ -1168,9 +1168,9 @@ This procedure is indeed a well-known variable selection method referred to
11681168
as *best subset selection*. \index{variable selection!best subset}\index{predictor selection|see{variable selection}}
11691169
In particular, you
11701170

1171-
1. create a separate model for every possible subset of predictors
1172-
2. tune each one using cross validation
1173-
3. pick the subset of predictors that gives you the highest cross-validation accuracy
1171+
1. create a separate model for every possible subset of predictors,
1172+
2. tune each one using cross-validation, and
1173+
3. pick the subset of predictors that gives you the highest cross-validation accuracy.
11741174

11751175
Best subset selection is applicable to any classification method ($K$-NN or otherwise).
11761176
However, it becomes very slow when you have even a moderate
@@ -1190,12 +1190,12 @@ Another idea is to iteratively build up a model by adding one predictor variable
11901190
at a time. This method&mdash;known as *forward selection*&mdash;is also widely \index{variable selection!forward}
11911191
applicable and fairly straightforward. It involves the following steps:
11921192

1193-
1. start with a model having no predictors
1194-
2. run the following 3 steps until you run out of predictors:
1195-
1. for each unused predictor, add it to the model to form a *candidate model*
1196-
2. tune all of the candidate models
1197-
3. update the model to be the candidate model with the highest cross-validation accuracy
1198-
3. select the model that provides the best trade-off between accuracy and simplicity
1193+
1. Start with a model having no predictors.
1194+
2. Run the following 3 steps until you run out of predictors:
1195+
1. For each unused predictor, add it to the model to form a *candidate model*.
1196+
2. Tune all of the candidate models.
1197+
3. Update the model to be the candidate model with the highest cross-validation accuracy.
1198+
3. Select the model that provides the best trade-off between accuracy and simplicity.
11991199

12001200
Say you have $m$ total predictors to work with. In the first iteration, you have to make
12011201
$m$ candidate models, each with 1 predictor. Then in the second iteration, you have
@@ -1266,7 +1266,7 @@ Finally, we need to write some code that performs the task of sequentially
12661266
finding the best predictor to add to the model.
12671267
If you recall the end of the wrangling chapter, we mentioned
12681268
that sometimes one needs more flexible forms of iteration than what
1269-
we have used earlier, and in these cases one typically resorts to
1269+
we have used earlier, and in these cases, one typically resorts to
12701270
[a for loop](https://r4ds.had.co.nz/iteration.html#iteration).
12711271
This is one of those cases! Here we will use two for loops:
12721272
one over increasing predictor set sizes
@@ -1358,7 +1358,7 @@ in Figure \@ref(fig:06-fwdsel-3), i.e., the place on the plot where the accuracy
13581358
levels off or begins to decrease. The elbow in Figure \@ref(fig:06-fwdsel-3) appears to occur at the model with
13591359
3 predictors; after that point the accuracy levels off. So here the right trade-off of accuracy and number of predictors
13601360
occurs with 3 variables: `Class ~ Perimeter + Concavity + Smoothness`. In other words, we have successfully removed irrelevant
1361-
predictors from the model! It is always worth remembering, however, that what cross validation gives you
1361+
predictors from the model! It is always worth remembering, however, that what cross-validation gives you
13621362
is an *estimate* of the true accuracy; you have to use your judgement when looking at this plot to decide
13631363
where the elbow occurs, and whether adding a variable provides a meaningful increase in accuracy.
13641364

@@ -1388,4 +1388,4 @@ found in Chapter \@ref(move-to-your-own-machine).
13881388

13891389
## Additional resources
13901390
- The [`tidymodels` website](https://tidymodels.org/packages) is an excellent reference for more details on, and advanced usage of, the functions and packages in the past two chapters. Aside from that, it also has a [nice beginner's tutorial](https://www.tidymodels.org/start/) and [an extensive list of more advanced examples](https://www.tidymodels.org/learn/) that you can use to continue learning beyond the scope of this book. It's worth noting that the `tidymodels` package does a lot more than just classification, and so the examples on the website similarly go beyond classification as well. In the next two chapters, you'll learn about another kind of predictive modeling setting, so it might be worth visiting the website only after reading through those chapters.
1391-
- [An Introduction to Statistical Learning](https://www.statlearning.com/) [-@james2013introduction] provides a great next stop in the process of learning about classification. Chapter 4 discusses additional basic techniques for classification that we do not cover, such as logistic regression, linear discriminant analysis, and naive Bayes. Chapter 5 goes into much more detail about cross-validation. Chapters 8 and 9 cover decision trees and support vector machines, two very popular but more advanced classification methods. Finally, Chapter 6 covers a number of methods for selecting predictor variables. Note that while this book is still a very accessible introductory text, it requires a bit more mathematical background than we require.
1391+
- [*An Introduction to Statistical Learning*](https://www.statlearning.com/) [-@james2013introduction] provides a great next stop in the process of learning about classification. Chapter 4 discusses additional basic techniques for classification that we do not cover, such as logistic regression, linear discriminant analysis, and naive Bayes. Chapter 5 goes into much more detail about cross-validation. Chapters 8 and 9 cover decision trees and support vector machines, two very popular but more advanced classification methods. Finally, Chapter 6 covers a number of methods for selecting predictor variables. Note that while this book is still a very accessible introductory text, it requires a bit more mathematical background than we require.

0 commit comments

Comments
 (0)