UBC-DSCI
diff --git a/‎source/classification1.Rmd
Lines changed: 4 additions & 3 deletions b/‎source/classification1.Rmd
Lines changed: 4 additions & 3 deletions
diff --git a/‎source/classification2.Rmd
Lines changed: 9 additions & 23 deletions b/‎source/classification2.Rmd
Lines changed: 9 additions & 23 deletions
diff --git a/‎source/clustering.Rmd
Lines changed: 4 additions & 16 deletions b/‎source/clustering.Rmd
Lines changed: 4 additions & 16 deletions
diff --git a/‎source/inference.Rmd
Lines changed: 3 additions & 2 deletions b/‎source/inference.Rmd
Lines changed: 3 additions & 2 deletions
diff --git a/‎source/intro.Rmd
Lines changed: 2 additions & 2 deletions b/‎source/intro.Rmd
Lines changed: 2 additions & 2 deletions
diff --git a/‎source/jupyter.Rmd
Lines changed: 1 addition & 1 deletion b/‎source/jupyter.Rmd
Lines changed: 1 addition & 1 deletion
@@ -1295,7 +1295,7 @@ upsampled_plot
 
 ### Missing data
 
-One of the most common issues in real data sets in the wild is *missing data*,
+One of the most common issues in real data sets in the wild is *missing data*,\index{missing data}
 i.e., observations where the values of some of the variables were not recorded.
 Unfortunately, as common as it is, handling missing data properly is very
 challenging and generally relies on expert knowledge about the data, setting,
@@ -1329,7 +1329,7 @@ data.  So how can we perform K-nearest neighbors classification in the presence
 of missing data?  Well, since there are not too many observations with missing
 entries, one option is to simply remove those observations prior to building
 the K-nearest neighbors classifier. We can accomplish this by using the
-`drop_na` function from `tidyverse` prior to working with the data.
+`drop_na` function from `tidyverse` prior to working with the data.\index{missing data!drop\_na}
 
 ```{r 05-naomit}
 no_missing_cancer <- missing_cancer |> drop_na()
@@ -1342,7 +1342,8 @@ possible approach is to *impute* the missing entries, i.e., fill in synthetic
 values based on the other observations in the data set. One reasonable choice
 is to perform *mean imputation*, where missing entries are filled in using the
 mean of the present entries in each variable. To perform mean imputation, we
-add the `step_impute_mean` step to the `tidymodels` preprocessing recipe.
+add the `step_impute_mean` \index{recipe!step\_impute\_mean}\index{missing data!mean imputation}
+step to the `tidymodels` preprocessing recipe.
 ```{r 05-impute, results=FALSE, message=FALSE, echo=TRUE}
 impute_missing_recipe <- recipe(Class ~ ., data = missing_cancer) |>
   step_impute_mean(all_predictors()) |>
 
@@ -117,7 +117,7 @@ a single number.  But prediction accuracy by itself does not tell the whole
 story.  In particular, accuracy alone only tells us how often the classifier
 makes mistakes in general, but does not tell us anything about the *kinds* of
 mistakes the classifier makes.  A more comprehensive view of performance can be
-obtained by additionally examining the **confusion matrix**. The confusion
+obtained by additionally examining the **confusion matrix**. The confusion\index{confusion matrix}
 matrix shows how many test set labels of each type are predicted correctly and
 incorrectly, which gives us more detail about the kinds of mistakes the
 classifier tends to make.  Table \@ref(tab:confusion-matrix) shows an example
@@ -148,7 +148,8 @@ disastrous error, since it may lead to a patient who requires treatment not rece
 Since we are particularly interested in identifying malignant cases, this
 classifier would likely be unacceptable even with an accuracy of 89%.
 
-Focusing more on one label than the other is
+Focusing more on one label than the other
+is\index{positive label}\index{negative label}\index{true positive}\index{false positive}\index{true negative}\index{false negative}
 common in classification problems. In such cases, we typically refer to the label we are more
 interested in identifying as the *positive* label, and the other as the
 *negative* label. In the tumor example, we would refer to malignant
@@ -166,7 +167,7 @@ therefore, 100% accuracy). However, classifiers in practice will almost always
 make some errors. So you should think about which kinds of error are most
 important in your application, and use the confusion matrix to quantify and
 report them. Two commonly used metrics that we can compute using the confusion
-matrix are the **precision** and **recall** of the classifier. These are often
+matrix are the **precision** and **recall** of the classifier.\index{precision}\index{recall} These are often
 reported together with accuracy.  *Precision* quantifies how many of the
 positive predictions the classifier made were actually positive. Intuitively,
 we would like a classifier to have a *high* precision: for a classifier with
@@ -582,7 +583,7 @@ We now know that the classifier was `r round(100*cancer_acc_1$.estimate, 0)`% ac
 on the test data set, and had a precision of `r round(100*cancer_prec_1$.estimate, 0)`% and a recall of `r round(100*cancer_rec_1$.estimate, 0)`%.
 That sounds pretty good! Wait, *is* it good?  Or do we need something higher?
 
-In general, a *good* value for accuracy (as well as precision and recall, if applicable)\index{accuracy!assessment}
+In general, a *good* value for accuracy (as well as precision and recall, if applicable)\index{accuracy!assessment}\index{precision!assessment}\index{recall!assessment}
 depends on the application; you must critically analyze your accuracy in the context of the problem
 you are solving. For example, if we were building a classifier for a kind of tumor that is benign 99%
 of the time, a classifier with 99% accuracy is not terribly impressive (just always guess benign!).
@@ -845,7 +846,7 @@ The `collect_metrics`\index{tidymodels!collect\_metrics}\index{cross-validation!
 of the classifier's validation accuracy across the folds. You will find results
 related to the accuracy in the row with `accuracy` listed under the `.metric` column.
 You should consider the mean (`mean`) to be the estimated accuracy, while the standard
-error (`std_err`) is a measure of how uncertain we are in the mean value. A detailed treatment of this
+error (`std_err`) is\index{standard error}\index{sem|see{standard error}} a measure of how uncertain we are in the mean value. A detailed treatment of this
 is beyond the scope of this chapter; but roughly, if your estimated mean is `r round(filter(collect_metrics(knn_fit), .metric == "accuracy")$mean,2)` and standard
 error is `r round(filter(collect_metrics(knn_fit), .metric == "accuracy")$std_err,2)`, you can expect the *true* average accuracy of the
 classifier to be somewhere roughly between `r (round(filter(collect_metrics(knn_fit), .metric == "accuracy")$mean,2) - round(filter(collect_metrics(knn_fit), .metric == "accuracy")$std_err,2))*100`% and `r (round(filter(collect_metrics(knn_fit), .metric == "accuracy")$mean,2) + round(filter(collect_metrics(knn_fit), .metric == "accuracy")$std_err,2))*100`% (although it may
@@ -859,7 +860,7 @@ knn_fit |>
   collect_metrics()
 ```
 
-We can choose any number of folds, and typically the more we use the better our
+We can choose any number of folds,\index{cross-validation!folds} and typically the more we use the better our
 accuracy estimate will be (lower standard error). However, we are limited
 by computational power: the
 more folds we choose, the  more computation it takes, and hence the more time
@@ -1180,6 +1181,7 @@ knn_fit
 Then to make predictions and assess the estimated accuracy of the best model on the test data, we use the
 `predict` and `metrics` functions as we did earlier in the chapter. We can then pass those predictions to
 the `precision`, `recall`, and `conf_mat` functions to assess the estimated precision and recall, and print a confusion matrix.
+\index{predict}\index{precision}\index{recall}\index{conf\_mat}
 
 ```{r 06-predictions-after-tuning, message = FALSE, warning = FALSE}
 cancer_test_predictions <- predict(knn_fit, cancer_test) |>
@@ -1393,24 +1395,8 @@ accs <- accs |> unlist()
 nghbrs <- nghbrs |> unlist()
 fixedaccs <- fixedaccs |> unlist()
 
-## get accuracy if we always just guess the most frequent label
-#base_acc <- cancer_irrelevant |>
-#                group_by(Class) |>
-#                summarize(n = n()) |>
-#                mutate(frac = n/sum(n)) |>
-#                summarize(mx = max(frac)) |>
-#                select(mx)
-#base_acc <- base_acc$mx |> unlist()
-
 # plot
 res <- tibble(ks = ks, accs = accs, fixedaccs = fixedaccs, nghbrs = nghbrs)
-#res <- res |> mutate(base_acc = base_acc)
-#plt_irrelevant_accuracies <- res |>
-#  ggplot() +
-#  geom_line(mapping = aes(x=ks, y=accs, linetype="Tuned K-NN")) +
-#  geom_hline(data=res, mapping=aes(yintercept=base_acc, linetype="Always Predict Benign")) +
-#  labs(x = "Number of Irrelevant Predictors", y = "Model Accuracy Estimate") +
-#  scale_linetype_manual(name="Method", values = c("dashed", "solid"))
 
 plt_irrelevant_accuracies <- ggplot(res) +
               geom_line(mapping = aes(x=ks, y=accs)) +
@@ -1533,7 +1519,7 @@ Therefore we will continue the rest of this section using forward selection.
 
 ### Forward selection in R
 
-We now turn to implementing forward selection in R.
+We now turn to implementing forward selection in R.\index{variable selection!implementation}
 Unfortunately there is no built-in way to do this using the `tidymodels` framework,
 so we will have to code it ourselves. First we will use the `select` function to extract a smaller set of predictors
 to work with in this illustrative example&mdash;`Smoothness`, `Concavity`, `Perimeter`, `Irrelevant1`, `Irrelevant2`, and `Irrelevant3`&mdash;as
 
@@ -164,7 +164,7 @@ library(tidyverse)
 set.seed(1)
 ```
 
-Now we can load and preview the `penguins` data.
+Now we can load and preview the `penguins` data.\index{read function!read\_csv}
 
 ```{r message = FALSE, warning = FALSE}
 penguins <- read_csv("data/penguins.csv")
@@ -295,7 +295,7 @@ improves it by making adjustments to the assignment of data
 to clusters until it cannot improve any further. But how do we measure
 the "quality" of a clustering, and what does it mean to improve it?
 In K-means clustering, we measure the quality of a cluster
-by its\index{within-cluster sum-of-squared-distances|see{WSSD}}\index{WSSD} *within-cluster sum-of-squared-distances* (WSSD).
+by its\index{within-cluster sum of squared distances|see{WSSD}}\index{WSSD} *within-cluster sum-of-squared-distances* (WSSD).
 Computing this involves two steps.
 First, we find the cluster centers by computing the mean of each variable
 over data points in the cluster. For example, suppose we have a
@@ -639,7 +639,7 @@ in the fourth iteration; both the centers and labels will remain the same from t
 
 ### Random restarts
 
-Unlike the classification and regression models we studied in previous chapters, K-means \index{K-means!restart, nstart} can get "stuck" in a bad solution.
+Unlike the classification and regression models we studied in previous chapters, K-means \index{K-means!restart} can get "stuck" in a bad solution.
 For example, Figure \@ref(fig:10-toy-kmeans-bad-init) illustrates an unlucky random initialization by K-means.
 
 ```{r 10-toy-kmeans-bad-init, echo = FALSE, warning = FALSE, message = FALSE, fig.height = 3.25, fig.width = 3.75, fig.pos = "H", out.extra="", fig.align = "center", fig.cap = "Random initialization of labels."}
@@ -910,7 +910,7 @@ set.seed(1)
 
 We can perform K-means clustering in R using a `tidymodels` workflow similar
 to those in the earlier classification and regression chapters.
-We will begin by loading the `tidyclust`\index{tidyclust} library, which contains the necessary
+We will begin by loading the `tidyclust`\index{K-means}\index{tidyclust} library, which contains the necessary
 functionality.
 ```{r, echo = TRUE, warning = FALSE, message = FALSE}
 library(tidyclust)
@@ -993,18 +993,6 @@ clustered_data <- kmeans_fit |>
 clustered_data
 ```
 
-<!--
-If for some reason we need access to just the cluster assignments,
-we can extract those from the fit as a data frame using
-the `extract_cluster_assignment` function. Note that in this case,
-the cluster assignments variable is named `.cluster`, while the `augment`
-function earlier creates a variable named `.pred_cluster`.
-
-```{r 10-kmeans-extract-clusterasgn}
-extract_cluster_assignment(kmeans_fit)
-```
--->
-
 Now that we have the cluster assignments included in the `clustered_data` tidy data frame, we can
 visualize them as shown in Figure \@ref(fig:10-plot-clusters-2).
 Note that we are plotting the *un-standardized* data here; if we for some reason wanted to
 
@@ -270,7 +270,7 @@ We first group the data by the `replicate` variable&mdash;to group the
 set of listings in each sample together&mdash;and then use `summarize`
 to compute the proportion in each sample.
 We print both the first and last few entries of the resulting data frame
-below to show that we end up with 20,000 point estimates, one for each of the 20,000 samples.
+below to show that we end up with 20,000 point estimates, one for each of the 20,000 samples.\index{group\_by}\index{summarize}
 
 ```{r 11-example-proportions6, echo = TRUE, message = FALSE, warning = FALSE}
 sample_estimates <- samples |>
@@ -381,7 +381,7 @@ one_sample <- airbnb |>
 
 We can create a histogram to visualize the distribution of observations in the
 sample (Figure \@ref(fig:11-example-means-sample-hist)), and calculate the mean
-of our sample.
+of our sample.\index{ggplot!geom\_histogram}
 
 ```{r 11-example-means-sample-hist, echo = TRUE, message = FALSE, warning = FALSE, fig.pos = "H", out.extra="", fig.cap = "Distribution of price per night (dollars) for sample of 40 Airbnb listings.", fig.height = 3.5, fig.width = 4.5}
 sample_distribution <- ggplot(one_sample, aes(price)) +
@@ -1116,6 +1116,7 @@ To calculate a 95\% percentile bootstrap confidence interval, we will do the fol
 
 To do this in R, we can use the `quantile()` function. Quantiles are expressed in proportions rather than
 percentages, so the 2.5th and 97.5th percentiles would be the 0.025 and 0.975 quantiles, respectively.
+\index{percentile}
 \index{quantile}
 \index{pull}
 \index{select}
 
@@ -388,7 +388,7 @@ filtering the rows. A logical statement evaluates to either `TRUE` or `FALSE`;
 `filter` keeps only those rows for which the logical statement evaluates to `TRUE`.
 For example, in our analysis, we are interested in keeping only languages in the
 "Aboriginal languages" higher-level category. We can use
-the *equivalency operator* `==` \index{logical statement!equivalency operator} to compare the values
+the *equivalency operator* `==` \index{logical operator!equivalency} to compare the values
 of the `category` column with the value `"Aboriginal languages"`; you will learn about
 many other kinds of logical statements in Chapter \@ref(wrangling).  Similar to
 when we loaded the data file and put quotes around the file name, here we need
@@ -590,7 +590,7 @@ Canadian Residents)" would be much more informative.
 Adding additional layers \index{plot!layers} to our visualizations that we create in `ggplot` is
 one common and easy way to improve and refine our data visualizations. New
 layers are added to `ggplot` objects using the `+` symbol. For example, we can
-use the `xlab` (short for x axis label) and `ylab` (short for y axis label) functions
+use the `xlab` (short for x axis label) \index{ggplot!xlab} and `ylab` (short for y axis label) \index{ggplot!ylab} functions
 to add layers where we specify meaningful
 and informative labels for the x and y axes. \index{plot!axis labels} Again, since we are specifying
 words (e.g. `"Mother Tongue (Number of Canadian Residents)"`) as arguments to
 
@@ -377,7 +377,7 @@ right-clicking on the file's name in the Jupyter file explorer, selecting
 **Open with**, and then selecting **Editor** (Figure \@ref(fig:open-data-w-editor-1)).
 Suppose you do not specify to open
 the data file with an editor. In that case, Jupyter will render a nice table
-for you, and you will not be able to see the column delimiters, and therefore
+for you, and you will not be able to see the column delimiters, \index{delimiter} and therefore
 you will not know which function to use, nor which arguments to use and values
 to specify for them.