Skip to content

Commit d73ade7

Browse files
Merge pull request #564 from UBC-DSCI/index-update
Index update
2 parents 8ee0821 + 4e49b01 commit d73ade7

13 files changed

+72
-122
lines changed

source/classification1.Rmd

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1295,7 +1295,7 @@ upsampled_plot
12951295

12961296
### Missing data
12971297

1298-
One of the most common issues in real data sets in the wild is *missing data*,
1298+
One of the most common issues in real data sets in the wild is *missing data*,\index{missing data}
12991299
i.e., observations where the values of some of the variables were not recorded.
13001300
Unfortunately, as common as it is, handling missing data properly is very
13011301
challenging and generally relies on expert knowledge about the data, setting,
@@ -1329,7 +1329,7 @@ data. So how can we perform K-nearest neighbors classification in the presence
13291329
of missing data? Well, since there are not too many observations with missing
13301330
entries, one option is to simply remove those observations prior to building
13311331
the K-nearest neighbors classifier. We can accomplish this by using the
1332-
`drop_na` function from `tidyverse` prior to working with the data.
1332+
`drop_na` function from `tidyverse` prior to working with the data.\index{missing data!drop\_na}
13331333

13341334
```{r 05-naomit}
13351335
no_missing_cancer <- missing_cancer |> drop_na()
@@ -1342,7 +1342,8 @@ possible approach is to *impute* the missing entries, i.e., fill in synthetic
13421342
values based on the other observations in the data set. One reasonable choice
13431343
is to perform *mean imputation*, where missing entries are filled in using the
13441344
mean of the present entries in each variable. To perform mean imputation, we
1345-
add the `step_impute_mean` step to the `tidymodels` preprocessing recipe.
1345+
add the `step_impute_mean` \index{recipe!step\_impute\_mean}\index{missing data!mean imputation}
1346+
step to the `tidymodels` preprocessing recipe.
13461347
```{r 05-impute, results=FALSE, message=FALSE, echo=TRUE}
13471348
impute_missing_recipe <- recipe(Class ~ ., data = missing_cancer) |>
13481349
step_impute_mean(all_predictors()) |>

source/classification2.Rmd

Lines changed: 9 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -117,7 +117,7 @@ a single number. But prediction accuracy by itself does not tell the whole
117117
story. In particular, accuracy alone only tells us how often the classifier
118118
makes mistakes in general, but does not tell us anything about the *kinds* of
119119
mistakes the classifier makes. A more comprehensive view of performance can be
120-
obtained by additionally examining the **confusion matrix**. The confusion
120+
obtained by additionally examining the **confusion matrix**. The confusion\index{confusion matrix}
121121
matrix shows how many test set labels of each type are predicted correctly and
122122
incorrectly, which gives us more detail about the kinds of mistakes the
123123
classifier tends to make. Table \@ref(tab:confusion-matrix) shows an example
@@ -148,7 +148,8 @@ disastrous error, since it may lead to a patient who requires treatment not rece
148148
Since we are particularly interested in identifying malignant cases, this
149149
classifier would likely be unacceptable even with an accuracy of 89%.
150150

151-
Focusing more on one label than the other is
151+
Focusing more on one label than the other
152+
is\index{positive label}\index{negative label}\index{true positive}\index{false positive}\index{true negative}\index{false negative}
152153
common in classification problems. In such cases, we typically refer to the label we are more
153154
interested in identifying as the *positive* label, and the other as the
154155
*negative* label. In the tumor example, we would refer to malignant
@@ -166,7 +167,7 @@ therefore, 100% accuracy). However, classifiers in practice will almost always
166167
make some errors. So you should think about which kinds of error are most
167168
important in your application, and use the confusion matrix to quantify and
168169
report them. Two commonly used metrics that we can compute using the confusion
169-
matrix are the **precision** and **recall** of the classifier. These are often
170+
matrix are the **precision** and **recall** of the classifier.\index{precision}\index{recall} These are often
170171
reported together with accuracy. *Precision* quantifies how many of the
171172
positive predictions the classifier made were actually positive. Intuitively,
172173
we would like a classifier to have a *high* precision: for a classifier with
@@ -582,7 +583,7 @@ We now know that the classifier was `r round(100*cancer_acc_1$.estimate, 0)`% ac
582583
on the test data set, and had a precision of `r round(100*cancer_prec_1$.estimate, 0)`% and a recall of `r round(100*cancer_rec_1$.estimate, 0)`%.
583584
That sounds pretty good! Wait, *is* it good? Or do we need something higher?
584585

585-
In general, a *good* value for accuracy (as well as precision and recall, if applicable)\index{accuracy!assessment}
586+
In general, a *good* value for accuracy (as well as precision and recall, if applicable)\index{accuracy!assessment}\index{precision!assessment}\index{recall!assessment}
586587
depends on the application; you must critically analyze your accuracy in the context of the problem
587588
you are solving. For example, if we were building a classifier for a kind of tumor that is benign 99%
588589
of the time, a classifier with 99% accuracy is not terribly impressive (just always guess benign!).
@@ -845,7 +846,7 @@ The `collect_metrics`\index{tidymodels!collect\_metrics}\index{cross-validation!
845846
of the classifier's validation accuracy across the folds. You will find results
846847
related to the accuracy in the row with `accuracy` listed under the `.metric` column.
847848
You should consider the mean (`mean`) to be the estimated accuracy, while the standard
848-
error (`std_err`) is a measure of how uncertain we are in the mean value. A detailed treatment of this
849+
error (`std_err`) is\index{standard error}\index{sem|see{standard error}} a measure of how uncertain we are in the mean value. A detailed treatment of this
849850
is beyond the scope of this chapter; but roughly, if your estimated mean is `r round(filter(collect_metrics(knn_fit), .metric == "accuracy")$mean,2)` and standard
850851
error is `r round(filter(collect_metrics(knn_fit), .metric == "accuracy")$std_err,2)`, you can expect the *true* average accuracy of the
851852
classifier to be somewhere roughly between `r (round(filter(collect_metrics(knn_fit), .metric == "accuracy")$mean,2) - round(filter(collect_metrics(knn_fit), .metric == "accuracy")$std_err,2))*100`% and `r (round(filter(collect_metrics(knn_fit), .metric == "accuracy")$mean,2) + round(filter(collect_metrics(knn_fit), .metric == "accuracy")$std_err,2))*100`% (although it may
@@ -859,7 +860,7 @@ knn_fit |>
859860
collect_metrics()
860861
```
861862

862-
We can choose any number of folds, and typically the more we use the better our
863+
We can choose any number of folds,\index{cross-validation!folds} and typically the more we use the better our
863864
accuracy estimate will be (lower standard error). However, we are limited
864865
by computational power: the
865866
more folds we choose, the more computation it takes, and hence the more time
@@ -1180,6 +1181,7 @@ knn_fit
11801181
Then to make predictions and assess the estimated accuracy of the best model on the test data, we use the
11811182
`predict` and `metrics` functions as we did earlier in the chapter. We can then pass those predictions to
11821183
the `precision`, `recall`, and `conf_mat` functions to assess the estimated precision and recall, and print a confusion matrix.
1184+
\index{predict}\index{precision}\index{recall}\index{conf\_mat}
11831185

11841186
```{r 06-predictions-after-tuning, message = FALSE, warning = FALSE}
11851187
cancer_test_predictions <- predict(knn_fit, cancer_test) |>
@@ -1393,24 +1395,8 @@ accs <- accs |> unlist()
13931395
nghbrs <- nghbrs |> unlist()
13941396
fixedaccs <- fixedaccs |> unlist()
13951397
1396-
## get accuracy if we always just guess the most frequent label
1397-
#base_acc <- cancer_irrelevant |>
1398-
# group_by(Class) |>
1399-
# summarize(n = n()) |>
1400-
# mutate(frac = n/sum(n)) |>
1401-
# summarize(mx = max(frac)) |>
1402-
# select(mx)
1403-
#base_acc <- base_acc$mx |> unlist()
1404-
14051398
# plot
14061399
res <- tibble(ks = ks, accs = accs, fixedaccs = fixedaccs, nghbrs = nghbrs)
1407-
#res <- res |> mutate(base_acc = base_acc)
1408-
#plt_irrelevant_accuracies <- res |>
1409-
# ggplot() +
1410-
# geom_line(mapping = aes(x=ks, y=accs, linetype="Tuned K-NN")) +
1411-
# geom_hline(data=res, mapping=aes(yintercept=base_acc, linetype="Always Predict Benign")) +
1412-
# labs(x = "Number of Irrelevant Predictors", y = "Model Accuracy Estimate") +
1413-
# scale_linetype_manual(name="Method", values = c("dashed", "solid"))
14141400
14151401
plt_irrelevant_accuracies <- ggplot(res) +
14161402
geom_line(mapping = aes(x=ks, y=accs)) +
@@ -1533,7 +1519,7 @@ Therefore we will continue the rest of this section using forward selection.
15331519
15341520
### Forward selection in R
15351521

1536-
We now turn to implementing forward selection in R.
1522+
We now turn to implementing forward selection in R.\index{variable selection!implementation}
15371523
Unfortunately there is no built-in way to do this using the `tidymodels` framework,
15381524
so we will have to code it ourselves. First we will use the `select` function to extract a smaller set of predictors
15391525
to work with in this illustrative example&mdash;`Smoothness`, `Concavity`, `Perimeter`, `Irrelevant1`, `Irrelevant2`, and `Irrelevant3`&mdash;as

source/clustering.Rmd

Lines changed: 4 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -164,7 +164,7 @@ library(tidyverse)
164164
set.seed(1)
165165
```
166166

167-
Now we can load and preview the `penguins` data.
167+
Now we can load and preview the `penguins` data.\index{read function!read\_csv}
168168

169169
```{r message = FALSE, warning = FALSE}
170170
penguins <- read_csv("data/penguins.csv")
@@ -295,7 +295,7 @@ improves it by making adjustments to the assignment of data
295295
to clusters until it cannot improve any further. But how do we measure
296296
the "quality" of a clustering, and what does it mean to improve it?
297297
In K-means clustering, we measure the quality of a cluster
298-
by its\index{within-cluster sum-of-squared-distances|see{WSSD}}\index{WSSD} *within-cluster sum-of-squared-distances* (WSSD).
298+
by its\index{within-cluster sum of squared distances|see{WSSD}}\index{WSSD} *within-cluster sum-of-squared-distances* (WSSD).
299299
Computing this involves two steps.
300300
First, we find the cluster centers by computing the mean of each variable
301301
over data points in the cluster. For example, suppose we have a
@@ -639,7 +639,7 @@ in the fourth iteration; both the centers and labels will remain the same from t
639639
640640
### Random restarts
641641

642-
Unlike the classification and regression models we studied in previous chapters, K-means \index{K-means!restart, nstart} can get "stuck" in a bad solution.
642+
Unlike the classification and regression models we studied in previous chapters, K-means \index{K-means!restart} can get "stuck" in a bad solution.
643643
For example, Figure \@ref(fig:10-toy-kmeans-bad-init) illustrates an unlucky random initialization by K-means.
644644

645645
```{r 10-toy-kmeans-bad-init, echo = FALSE, warning = FALSE, message = FALSE, fig.height = 3.25, fig.width = 3.75, fig.pos = "H", out.extra="", fig.align = "center", fig.cap = "Random initialization of labels."}
@@ -910,7 +910,7 @@ set.seed(1)
910910

911911
We can perform K-means clustering in R using a `tidymodels` workflow similar
912912
to those in the earlier classification and regression chapters.
913-
We will begin by loading the `tidyclust`\index{tidyclust} library, which contains the necessary
913+
We will begin by loading the `tidyclust`\index{K-means}\index{tidyclust} library, which contains the necessary
914914
functionality.
915915
```{r, echo = TRUE, warning = FALSE, message = FALSE}
916916
library(tidyclust)
@@ -993,18 +993,6 @@ clustered_data <- kmeans_fit |>
993993
clustered_data
994994
```
995995

996-
<!--
997-
If for some reason we need access to just the cluster assignments,
998-
we can extract those from the fit as a data frame using
999-
the `extract_cluster_assignment` function. Note that in this case,
1000-
the cluster assignments variable is named `.cluster`, while the `augment`
1001-
function earlier creates a variable named `.pred_cluster`.
1002-
1003-
```{r 10-kmeans-extract-clusterasgn}
1004-
extract_cluster_assignment(kmeans_fit)
1005-
```
1006-
-->
1007-
1008996
Now that we have the cluster assignments included in the `clustered_data` tidy data frame, we can
1009997
visualize them as shown in Figure \@ref(fig:10-plot-clusters-2).
1010998
Note that we are plotting the *un-standardized* data here; if we for some reason wanted to

source/inference.Rmd

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -270,7 +270,7 @@ We first group the data by the `replicate` variable&mdash;to group the
270270
set of listings in each sample together&mdash;and then use `summarize`
271271
to compute the proportion in each sample.
272272
We print both the first and last few entries of the resulting data frame
273-
below to show that we end up with 20,000 point estimates, one for each of the 20,000 samples.
273+
below to show that we end up with 20,000 point estimates, one for each of the 20,000 samples.\index{group\_by}\index{summarize}
274274

275275
```{r 11-example-proportions6, echo = TRUE, message = FALSE, warning = FALSE}
276276
sample_estimates <- samples |>
@@ -381,7 +381,7 @@ one_sample <- airbnb |>
381381

382382
We can create a histogram to visualize the distribution of observations in the
383383
sample (Figure \@ref(fig:11-example-means-sample-hist)), and calculate the mean
384-
of our sample.
384+
of our sample.\index{ggplot!geom\_histogram}
385385

386386
```{r 11-example-means-sample-hist, echo = TRUE, message = FALSE, warning = FALSE, fig.pos = "H", out.extra="", fig.cap = "Distribution of price per night (dollars) for sample of 40 Airbnb listings.", fig.height = 3.5, fig.width = 4.5}
387387
sample_distribution <- ggplot(one_sample, aes(price)) +
@@ -1116,6 +1116,7 @@ To calculate a 95\% percentile bootstrap confidence interval, we will do the fol
11161116

11171117
To do this in R, we can use the `quantile()` function. Quantiles are expressed in proportions rather than
11181118
percentages, so the 2.5th and 97.5th percentiles would be the 0.025 and 0.975 quantiles, respectively.
1119+
\index{percentile}
11191120
\index{quantile}
11201121
\index{pull}
11211122
\index{select}

source/intro.Rmd

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -388,7 +388,7 @@ filtering the rows. A logical statement evaluates to either `TRUE` or `FALSE`;
388388
`filter` keeps only those rows for which the logical statement evaluates to `TRUE`.
389389
For example, in our analysis, we are interested in keeping only languages in the
390390
"Aboriginal languages" higher-level category. We can use
391-
the *equivalency operator* `==` \index{logical statement!equivalency operator} to compare the values
391+
the *equivalency operator* `==` \index{logical operator!equivalency} to compare the values
392392
of the `category` column with the value `"Aboriginal languages"`; you will learn about
393393
many other kinds of logical statements in Chapter \@ref(wrangling). Similar to
394394
when we loaded the data file and put quotes around the file name, here we need
@@ -590,7 +590,7 @@ Canadian Residents)" would be much more informative.
590590
Adding additional layers \index{plot!layers} to our visualizations that we create in `ggplot` is
591591
one common and easy way to improve and refine our data visualizations. New
592592
layers are added to `ggplot` objects using the `+` symbol. For example, we can
593-
use the `xlab` (short for x axis label) and `ylab` (short for y axis label) functions
593+
use the `xlab` (short for x axis label) \index{ggplot!xlab} and `ylab` (short for y axis label) \index{ggplot!ylab} functions
594594
to add layers where we specify meaningful
595595
and informative labels for the x and y axes. \index{plot!axis labels} Again, since we are specifying
596596
words (e.g. `"Mother Tongue (Number of Canadian Residents)"`) as arguments to

source/jupyter.Rmd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -377,7 +377,7 @@ right-clicking on the file's name in the Jupyter file explorer, selecting
377377
**Open with**, and then selecting **Editor** (Figure \@ref(fig:open-data-w-editor-1)).
378378
Suppose you do not specify to open
379379
the data file with an editor. In that case, Jupyter will render a nice table
380-
for you, and you will not be able to see the column delimiters, and therefore
380+
for you, and you will not be able to see the column delimiters, \index{delimiter} and therefore
381381
you will not know which function to use, nor which arguments to use and values
382382
to specify for them.
383383

0 commit comments

Comments
 (0)