UBC-DSCI
diff --git a/‎classification1.Rmd
Lines changed: 18 additions & 15 deletions b/‎classification1.Rmd
Lines changed: 18 additions & 15 deletions
diff --git a/‎classification2.Rmd
Lines changed: 28 additions & 9 deletions b/‎classification2.Rmd
Lines changed: 28 additions & 9 deletions
diff --git a/‎clustering.Rmd
Lines changed: 16 additions & 8 deletions b/‎clustering.Rmd
Lines changed: 16 additions & 8 deletions
@@ -85,7 +85,7 @@ the classifier to make predictions on new data for which we do not know the clas
 
 There are many possible methods that we could use to predict
 a categorical class/label for an observation. In this book, we will
-focus on the widely used **$K$-nearest neighbors** \index{K-nearest neighbors} algorithm.
+focus on the widely used **$K$-nearest neighbors** \index{K-nearest neighbors} algorithm [@knnfix; @knncover].
 In your future studies, you might encounter decision trees, support vector machines (SVMs),
 logistic regression, neural networks, and more; see the additional resources
 section at the end of the next chapter for where to begin learning more about
@@ -99,9 +99,9 @@ categories (e.g., a diagnosis of healthy, bronchitis, pneumonia, or a common col
 ## Exploring a data set
 
 In this chapter and the next, we will study a data set of 
-[digitized breast cancer image features](http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29),
-created by Dr. William H. Wolberg, W. Nick Street, and Olvi L. Mangasarian at
-the University of Wisconsin, Madison. \index{breast cancer} Each row in the data set represents an
+[digitized breast cancer image features](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29),
+created by Dr. William H. Wolberg, W. Nick Street, and Olvi L. Mangasarian [@streetbreastcancer]. \index{breast cancer} 
+Each row in the data set represents an
 image of a tumor sample, including the diagnosis (benign or malignant) and
 several other measurements (nucleus texture, perimeter, area, and more).
 Diagnosis for each image was conducted by physicians. 
@@ -117,7 +117,7 @@ the diagnosing physician is. Furthermore, benign tumors are not normally
 dangerous; the cells stay in the same place, and the tumor stops growing before
 it gets very large. By contrast, in malignant tumors, the cells invade the
 surrounding tissue and spread into nearby organs, where they can cause serious
-damage (@stanfordhealthcare).
+damage [@stanfordhealthcare].
 Thus, it is important to quickly and accurately diagnose the tumor type to
 guide patient treatment.
 
@@ -689,9 +689,9 @@ In order to classify a new observation using a $K$-nearest neighbor classifier,
 Coding the $K$-nearest neighbors algorithm in R ourselves can get complicated,
 especially if we want to handle multiple classes, more than two variables,
 or predict the class for multiple new observations. Thankfully, in R,
-the $K$-nearest neighbors algorithm is implemented in the `parsnip` package 
-included in the 
-[`tidymodels` package](https://www.tidymodels.org/), along with 
+the $K$-nearest neighbors algorithm is 
+implemented in [the `parsnip` R package](https://parsnip.tidymodels.org/) [@parsnip] 
+included in `tidymodels`, along with 
 many [other models](https://www.tidymodels.org/find/parsnip/) \index{tidymodels}\index{parsnip}
  that you will encounter in this and future chapters of the book. The `tidymodels` collection
 provides tools to help make and use models, such as classifiers.  Using the packages
@@ -723,7 +723,7 @@ distance (`weight_func = "rectangular"`). The `weight_func` argument controls
 how neighbors vote when classifying a new observation; by setting it to `"rectangular"`,
 each of the $K$ nearest neighbors gets exactly 1 vote as described above. Other choices, 
 which weigh each neighbor's vote differently, can be found on 
-[the `tidymodels` website](https://parsnip.tidymodels.org/reference/nearest_neighbor.html).
+[the `parsnip` website](https://parsnip.tidymodels.org/reference/nearest_neighbor.html).
 In the `set_engine` \index{tidymodels!engine} argument, we specify which package or system will be used for training
 the model. Here `kknn` is the R package we will use for performing $K$-nearest neighbors classification.
 Finally, we specify that this is a classification problem with the `set_mode` function.
@@ -846,8 +846,9 @@ loaded, and the standardized version of that same data. But first, we need to
 standardize the `unscaled_cancer` data set with `tidymodels`.
 
 In the `tidymodels` framework, all data preprocessing happens 
-using a [`recipe`](https://tidymodels.github.io/recipes/reference/index.html).
-Here we will initialize a recipe \index{recipe} \index{tidymodels!recipe|see{recipe}} for the `unscaled_cancer` data above, specifying
+using a `recipe` from [the `recipes` R package](https://recipes.tidymodels.org/) [@recipes]
+Here we will initialize a recipe \index{recipe} \index{tidymodels!recipe|see{recipe}} for 
+the `unscaled_cancer` data above, specifying
 that the `Class` variable is the target, and all other variables are predictors:
 
 ```{r 05-scaling-2}
@@ -856,7 +857,9 @@ print(uc_recipe)
 ```
 
 So far, there is not much in the recipe; just a statement about the number of targets
-and predictors. Let's add scaling (`step_scale`) \index{recipe!step\_scale} and centering (`step_center`) \index{recipe!step\_center} steps for 
+and predictors. Let's add 
+scaling (`step_scale`) \index{recipe!step\_scale} and 
+centering (`step_center`) \index{recipe!step\_center} steps for 
 all of the predictors so that they each have a mean of 0 and standard deviation of 1.
 Note that `tidyverse` actually provides `step_normalize`, which does both centering and scaling in
 a single recipe step; in this book we will keep `step_scale` and `step_center` separate
@@ -885,8 +888,8 @@ For example:
 - `Area, Smoothness`: specify both the `Area` and `Smoothness` variable
 - `-Class`: specify everything except the `Class` variable
 
-You can find [a full set of all the steps and variable selection functions](https://tidymodels.github.io/recipes/reference/index.html)
-on the `recipes` home page.
+You can find a full set of all the steps and variable selection functions
+on the [`recipes` reference page](https://recipes.tidymodels.org/reference/index.html).
 
 At this point, we have calculated the required statistics based on the data input into the 
 recipe, but the data are not yet scaled and centered. To actually scale and center 
@@ -1412,7 +1415,7 @@ wkflw_plot
 ## Exercises
 
 Practice exercises for the material covered in this chapter 
-can be found in the accompanying [worksheet](https://github.com/UBC-DSCI/data-science-a-first-intro-worksheets/blob/main/worksheet_06/worksheet_06.ipynb).
+can be found in the accompanying [worksheet](https://github.com/UBC-DSCI/data-science-a-first-intro-worksheets/blob/main/worksheet_classification1/worksheet_classification1.ipynb).
 The worksheet tries to provide automated feedback 
 and help guide you through the problems. 
 To make sure this functionality works as intended, 
 
@@ -54,7 +54,7 @@ Sometimes our classifier might make the wrong prediction. A classifier does not
 need to be right 100\% of the time to be useful, though we don't want the
 classifier to make too many wrong predictions. How do we measure how "good" our
 classifier is? Let's revisit the \index{breast cancer}
-[breast cancer images example](http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29)
+[breast cancer images data](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29) [@streetbreastcancer]
 and think about how our classifier will be used in practice. A biopsy will be
 performed on a *new* patient's tumor, the resulting image will be analyzed,
 and the classifier will be asked to decide whether the tumor is benign or
@@ -1172,7 +1172,7 @@ this chapter to find out where you can learn more about variable selection, incl
 The first idea you might think of for a systematic way to select predictors
 is to try all possible subsets of predictors and then pick the set that results in the "best" classifier.
 This procedure is indeed a well-known variable selection method referred to 
-as *best subset selection*. \index{variable selection!best subset}\index{predictor selection|see{variable selection}}
+as *best subset selection* [@bealesubset; @hockingsubset]. \index{variable selection!best subset}\index{predictor selection|see{variable selection}}
 In particular, you
 
 1. create a separate model for every possible subset of predictors,
@@ -1194,7 +1194,7 @@ So although it is a simple method, best subset selection is usually too computat
 expensive to use in practice.
 
 Another idea is to iteratively build up a model by adding one predictor variable 
-at a time. This method&mdash;known as *forward selection*&mdash;is also widely \index{variable selection!forward}
+at a time. This method&mdash;known as *forward selection* [@forwardefroymson; @forwarddraper]&mdash;is also widely \index{variable selection!forward}
 applicable and fairly straightforward. It involves the following steps:
 
 1. Start with a model having no predictors.
@@ -1273,9 +1273,9 @@ Finally, we need to write some code that performs the task of sequentially
 finding the best predictor to add to the model.
 If you recall the end of the wrangling chapter, we mentioned
 that sometimes one needs more flexible forms of iteration than what 
-we have used earlier, and in these cases, one typically resorts to
-[a for loop](https://r4ds.had.co.nz/iteration.html#iteration).
-This is one of those cases! Here we will use two for loops:
+we have used earlier, and in these cases one typically resorts to
+a *for loop*; see [the chapter on iteration](https://r4ds.had.co.nz/iteration.html) in *R for Data Science* [@wickham2016r].
+Here we will use two for loops:
 one over increasing predictor set sizes 
 (where you see `for (i in 1:length(names))` below),
 and another to check which predictor to add in each round (where you see `for (j in 1:length(names))` below).
@@ -1386,13 +1386,32 @@ fwd_sel_accuracies_plot
 ## Exercises
 
 Practice exercises for the material covered in this chapter 
-can be found in the accompanying [worksheet](https://github.com/UBC-DSCI/data-science-a-first-intro-worksheets/blob/main/worksheet_07/worksheet_07.ipynb).
+can be found in the accompanying [worksheet](https://github.com/UBC-DSCI/data-science-a-first-intro-worksheets/blob/main/worksheet_classification2/worksheet_classification2.ipynb).
 The worksheet tries to provide automated feedback 
 and help guide you through the problems. 
 To make sure this functionality works as intended, 
 please follow the instructions for computer setup needed to run the worksheets 
 found in Chapter \@ref(move-to-your-own-machine).
 
 ## Additional resources
-- The [`tidymodels` website](https://tidymodels.org/packages) is an excellent reference for more details on, and advanced usage of, the functions and packages in the past two chapters. Aside from that, it also has a [nice beginner's tutorial](https://www.tidymodels.org/start/) and [an extensive list of more advanced examples](https://www.tidymodels.org/learn/) that you can use to continue learning beyond the scope of this book. It's worth noting that the `tidymodels` package does a lot more than just classification, and so the examples on the website similarly go beyond classification as well. In the next two chapters, you'll learn about another kind of predictive modeling setting, so it might be worth visiting the website only after reading through those chapters.
-- [*An Introduction to Statistical Learning*](https://www.statlearning.com/) [-@james2013introduction] provides a great next stop in the process of learning about classification. Chapter 4 discusses additional basic techniques for classification that we do not cover, such as logistic regression, linear discriminant analysis, and naive Bayes. Chapter 5 goes into much more detail about cross-validation. Chapters 8 and 9 cover decision trees and support vector machines, two very popular but more advanced classification methods. Finally, Chapter 6 covers a number of methods for selecting predictor variables. Note that while this book is still a very accessible introductory text, it requires a bit more mathematical background than we require.
+- The [`tidymodels` website](https://tidymodels.org/packages) is an excellent
+  reference for more details on, and advanced usage of, the functions and
+  packages in the past two chapters. Aside from that, it also has a [nice
+  beginner's tutorial](https://www.tidymodels.org/start/) and [an extensive list
+  of more advanced examples](https://www.tidymodels.org/learn/) that you can use
+  to continue learning beyond the scope of this book. It's worth noting that the
+  `tidymodels` package does a lot more than just classification, and so the
+  examples on the website similarly go beyond classification as well. In the next
+  two chapters, you'll learn about another kind of predictive modeling setting,
+  so it might be worth visiting the website only after reading through those
+  chapters.
+- *An Introduction to Statistical Learning* [@james2013introduction] provides 
+  a great next stop in the process of
+  learning about classification. Chapter 4 discusses additional basic techniques
+  for classification that we do not cover, such as logistic regression, linear
+  discriminant analysis, and naive Bayes. Chapter 5 goes into much more detail
+  about cross-validation. Chapters 8 and 9 cover decision trees and support
+  vector machines, two very popular but more advanced classification methods.
+  Finally, Chapter 6 covers a number of methods for selecting predictor
+  variables. Note that while this book is still a very accessible introductory
+  text, it requires a bit more mathematical background than we require.
@@ -85,7 +85,7 @@ courses.
 As in the case of classification, 
 there are many possible methods that we could use to cluster our observations 
 to look for subgroups. 
-In this book, we will focus on the widely used K-means \index{K-means} algorithm.
+In this book, we will focus on the widely used K-means \index{K-means} algorithm [@kmeans]. 
 In your future studies, you might encounter hierarchical clustering,
 principal component analysis, multidimensional scaling, and more; 
 see the additional resources section at the end of this chapter 
@@ -103,11 +103,11 @@ for where to begin learning more about these other methods.
 
 **An illustrative example** 
 
-Here we will present an illustrative example using a data set \index{Palmer penguins} from the
-[{palmerpenguins} R data package](https://allisonhorst.github.io/palmerpenguins/). This data set was
-collected by [Dr. Kristen Gorman](https://www.uaf.edu/cfos/people/faculty/detail/kristen-gorman.php) and
-the [Palmer Station, Antarctica Long Term Ecological Research Site](https://pal.lternet.edu/), and includes
-measurements for adult penguins found near there [@palmerpenguins]. We have
+Here we will present an illustrative example using a data set \index{Palmer penguins} from
+[the `palmerpenguins` R package](https://allisonhorst.github.io/palmerpenguins/) [@palmerpenguins]. This 
+data set was collected by Dr. Kristen Gorman and
+the Palmer Station, Antarctica Long Term Ecological Research Site, and includes
+measurements for adult penguins found near there [@penguinpaper]. We have
 modified the data set for use in this chapter. Here we will focus on using two
 variables&mdash;penguin bill and flipper length, both in millimeters&mdash;to determine whether 
 there are distinct types of penguins in our data.
@@ -1098,12 +1098,20 @@ elbow_plot
 ## Exercises
 
 Practice exercises for the material covered in this chapter 
-can be found in the accompanying [worksheet](https://github.com/UBC-DSCI/data-science-a-first-intro-worksheets/blob/main/worksheet_10/worksheet_10.ipynb).
+can be found in the accompanying [worksheet](https://github.com/UBC-DSCI/data-science-a-first-intro-worksheets/blob/main/worksheet_clustering/worksheet_clustering.ipynb).
 The worksheet tries to provide automated feedback 
 and help guide you through the problems. 
 To make sure this functionality works as intended, 
 please follow the instructions for computer setup needed to run the worksheets 
 found in Chapter \@ref(move-to-your-own-machine).
 
 ## Additional resources
-- Chapter 10 of [*An Introduction to Statistical Learning*](https://www.statlearning.com/) [-@james2013introduction] provides a great next stop in the process of learning about clustering and unsupervised learning in general. In the realm of clustering specifically, it provides a great companion introduction to K-means, but also covers *hierarchical* clustering for when you expect there to be subgroups, and then subgroups within subgroups, etc., in your data. In the realm of more general unsupervised learning, it covers *principal components analysis (PCA)*, which is a very popular technique for reducing the number of predictors in a dataset. 
+- Chapter 10 of *An Introduction to Statistical
+  Learning* [@james2013introduction] provides a
+  great next stop in the process of learning about clustering and unsupervised
+  learning in general. In the realm of clustering specifically, it provides a
+  great companion introduction to K-means, but also covers *hierarchical*
+  clustering for when you expect there to be subgroups, and then subgroups within
+  subgroups, etc., in your data. In the realm of more general unsupervised
+  learning, it covers *principal components analysis (PCA)*, which is a very
+  popular technique for reducing the number of predictors in a dataset.