You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
common in classification problems. In such cases, we typically refer to the label we are more
153
154
interested in identifying as the *positive* label, and the other as the
154
155
*negative* label. In the tumor example, we would refer to malignant
@@ -166,7 +167,7 @@ therefore, 100% accuracy). However, classifiers in practice will almost always
166
167
make some errors. So you should think about which kinds of error are most
167
168
important in your application, and use the confusion matrix to quantify and
168
169
report them. Two commonly used metrics that we can compute using the confusion
169
-
matrix are the **precision** and **recall** of the classifier. These are often
170
+
matrix are the **precision** and **recall** of the classifier.\index{precision}\index{recall} These are often
170
171
reported together with accuracy. *Precision* quantifies how many of the
171
172
positive predictions the classifier made were actually positive. Intuitively,
172
173
we would like a classifier to have a *high* precision: for a classifier with
@@ -582,7 +583,7 @@ We now know that the classifier was `r round(100*cancer_acc_1$.estimate, 0)`% ac
582
583
on the test data set, and had a precision of `r round(100*cancer_prec_1$.estimate, 0)`% and a recall of `r round(100*cancer_rec_1$.estimate, 0)`%.
583
584
That sounds pretty good! Wait, *is* it good? Or do we need something higher?
584
585
585
-
In general, a *good* value for accuracy (as well as precision and recall, if applicable)\index{accuracy!assessment}
586
+
In general, a *good* value for accuracy (as well as precision and recall, if applicable)\index{accuracy!assessment}\index{precision!assessment}\index{recall!assessment}
586
587
depends on the application; you must critically analyze your accuracy in the context of the problem
587
588
you are solving. For example, if we were building a classifier for a kind of tumor that is benign 99%
588
589
of the time, a classifier with 99% accuracy is not terribly impressive (just always guess benign!).
@@ -845,7 +846,7 @@ The `collect_metrics`\index{tidymodels!collect\_metrics}\index{cross-validation!
845
846
of the classifier's validation accuracy across the folds. You will find results
846
847
related to the accuracy in the row with `accuracy` listed under the `.metric` column.
847
848
You should consider the mean (`mean`) to be the estimated accuracy, while the standard
848
-
error (`std_err`) is a measure of how uncertain we are in the mean value. A detailed treatment of this
849
+
error (`std_err`) is\index{standard error}\index{sem|see{standard error}} a measure of how uncertain we are in the mean value. A detailed treatment of this
849
850
is beyond the scope of this chapter; but roughly, if your estimated mean is `r round(filter(collect_metrics(knn_fit), .metric == "accuracy")$mean,2)` and standard
850
851
error is `r round(filter(collect_metrics(knn_fit), .metric == "accuracy")$std_err,2)`, you can expect the *true* average accuracy of the
851
852
classifier to be somewhere roughly between `r (round(filter(collect_metrics(knn_fit), .metric == "accuracy")$mean,2) - round(filter(collect_metrics(knn_fit), .metric == "accuracy")$std_err,2))*100`% and `r (round(filter(collect_metrics(knn_fit), .metric == "accuracy")$mean,2) + round(filter(collect_metrics(knn_fit), .metric == "accuracy")$std_err,2))*100`% (although it may
@@ -859,7 +860,7 @@ knn_fit |>
859
860
collect_metrics()
860
861
```
861
862
862
-
We can choose any number of folds, and typically the more we use the better our
863
+
We can choose any number of folds,\index{cross-validation!folds} and typically the more we use the better our
863
864
accuracy estimate will be (lower standard error). However, we are limited
864
865
by computational power: the
865
866
more folds we choose, the more computation it takes, and hence the more time
@@ -1180,6 +1181,7 @@ knn_fit
1180
1181
Then to make predictions and assess the estimated accuracy of the best model on the test data, we use the
1181
1182
`predict` and `metrics` functions as we did earlier in the chapter. We can then pass those predictions to
1182
1183
the `precision`, `recall`, and `conf_mat` functions to assess the estimated precision and recall, and print a confusion matrix.
Copy file name to clipboardExpand all lines: source/clustering.Rmd
+4-16Lines changed: 4 additions & 16 deletions
Original file line number
Diff line number
Diff line change
@@ -164,7 +164,7 @@ library(tidyverse)
164
164
set.seed(1)
165
165
```
166
166
167
-
Now we can load and preview the `penguins` data.
167
+
Now we can load and preview the `penguins` data.\index{read function!read\_csv}
168
168
169
169
```{r message = FALSE, warning = FALSE}
170
170
penguins <- read_csv("data/penguins.csv")
@@ -295,7 +295,7 @@ improves it by making adjustments to the assignment of data
295
295
to clusters until it cannot improve any further. But how do we measure
296
296
the "quality" of a clustering, and what does it mean to improve it?
297
297
In K-means clustering, we measure the quality of a cluster
298
-
by its\index{within-cluster sum-of-squared-distances|see{WSSD}}\index{WSSD} *within-cluster sum-of-squared-distances* (WSSD).
298
+
by its\index{within-cluster sum of squareddistances|see{WSSD}}\index{WSSD} *within-cluster sum-of-squared-distances* (WSSD).
299
299
Computing this involves two steps.
300
300
First, we find the cluster centers by computing the mean of each variable
301
301
over data points in the cluster. For example, suppose we have a
@@ -639,7 +639,7 @@ in the fourth iteration; both the centers and labels will remain the same from t
639
639
640
640
### Random restarts
641
641
642
-
Unlike the classification and regression models we studied in previous chapters, K-means \index{K-means!restart, nstart} can get "stuck" in a bad solution.
642
+
Unlike the classification and regression models we studied in previous chapters, K-means \index{K-means!restart} can get "stuck" in a bad solution.
643
643
For example, Figure \@ref(fig:10-toy-kmeans-bad-init) illustrates an unlucky random initialization by K-means.
0 commit comments