Skip to content

Commit 13a2af1

Browse files
Merge pull request #409 from UBC-DSCI/footnotes
URLs, citations, and footnotes
2 parents d59ce4c + 2474961 commit 13a2af1

13 files changed

+498
-181
lines changed

classification1.Rmd

Lines changed: 18 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -85,7 +85,7 @@ the classifier to make predictions on new data for which we do not know the clas
8585

8686
There are many possible methods that we could use to predict
8787
a categorical class/label for an observation. In this book, we will
88-
focus on the widely used **$K$-nearest neighbors** \index{K-nearest neighbors} algorithm.
88+
focus on the widely used **$K$-nearest neighbors** \index{K-nearest neighbors} algorithm [@knnfix; @knncover].
8989
In your future studies, you might encounter decision trees, support vector machines (SVMs),
9090
logistic regression, neural networks, and more; see the additional resources
9191
section at the end of the next chapter for where to begin learning more about
@@ -99,9 +99,9 @@ categories (e.g., a diagnosis of healthy, bronchitis, pneumonia, or a common col
9999
## Exploring a data set
100100

101101
In this chapter and the next, we will study a data set of
102-
[digitized breast cancer image features](http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29),
103-
created by Dr. William H. Wolberg, W. Nick Street, and Olvi L. Mangasarian at
104-
the University of Wisconsin, Madison. \index{breast cancer} Each row in the data set represents an
102+
[digitized breast cancer image features](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29),
103+
created by Dr. William H. Wolberg, W. Nick Street, and Olvi L. Mangasarian [@streetbreastcancer]. \index{breast cancer}
104+
Each row in the data set represents an
105105
image of a tumor sample, including the diagnosis (benign or malignant) and
106106
several other measurements (nucleus texture, perimeter, area, and more).
107107
Diagnosis for each image was conducted by physicians.
@@ -117,7 +117,7 @@ the diagnosing physician is. Furthermore, benign tumors are not normally
117117
dangerous; the cells stay in the same place, and the tumor stops growing before
118118
it gets very large. By contrast, in malignant tumors, the cells invade the
119119
surrounding tissue and spread into nearby organs, where they can cause serious
120-
damage (@stanfordhealthcare).
120+
damage [@stanfordhealthcare].
121121
Thus, it is important to quickly and accurately diagnose the tumor type to
122122
guide patient treatment.
123123

@@ -689,9 +689,9 @@ In order to classify a new observation using a $K$-nearest neighbor classifier,
689689
Coding the $K$-nearest neighbors algorithm in R ourselves can get complicated,
690690
especially if we want to handle multiple classes, more than two variables,
691691
or predict the class for multiple new observations. Thankfully, in R,
692-
the $K$-nearest neighbors algorithm is implemented in the `parsnip` package
693-
included in the
694-
[`tidymodels` package](https://www.tidymodels.org/), along with
692+
the $K$-nearest neighbors algorithm is
693+
implemented in [the `parsnip` R package](https://parsnip.tidymodels.org/) [@parsnip]
694+
included in `tidymodels`, along with
695695
many [other models](https://www.tidymodels.org/find/parsnip/) \index{tidymodels}\index{parsnip}
696696
that you will encounter in this and future chapters of the book. The `tidymodels` collection
697697
provides tools to help make and use models, such as classifiers. Using the packages
@@ -723,7 +723,7 @@ distance (`weight_func = "rectangular"`). The `weight_func` argument controls
723723
how neighbors vote when classifying a new observation; by setting it to `"rectangular"`,
724724
each of the $K$ nearest neighbors gets exactly 1 vote as described above. Other choices,
725725
which weigh each neighbor's vote differently, can be found on
726-
[the `tidymodels` website](https://parsnip.tidymodels.org/reference/nearest_neighbor.html).
726+
[the `parsnip` website](https://parsnip.tidymodels.org/reference/nearest_neighbor.html).
727727
In the `set_engine` \index{tidymodels!engine} argument, we specify which package or system will be used for training
728728
the model. Here `kknn` is the R package we will use for performing $K$-nearest neighbors classification.
729729
Finally, we specify that this is a classification problem with the `set_mode` function.
@@ -846,8 +846,9 @@ loaded, and the standardized version of that same data. But first, we need to
846846
standardize the `unscaled_cancer` data set with `tidymodels`.
847847

848848
In the `tidymodels` framework, all data preprocessing happens
849-
using a [`recipe`](https://tidymodels.github.io/recipes/reference/index.html).
850-
Here we will initialize a recipe \index{recipe} \index{tidymodels!recipe|see{recipe}} for the `unscaled_cancer` data above, specifying
849+
using a `recipe` from [the `recipes` R package](https://recipes.tidymodels.org/) [@recipes]
850+
Here we will initialize a recipe \index{recipe} \index{tidymodels!recipe|see{recipe}} for
851+
the `unscaled_cancer` data above, specifying
851852
that the `Class` variable is the target, and all other variables are predictors:
852853

853854
```{r 05-scaling-2}
@@ -856,7 +857,9 @@ print(uc_recipe)
856857
```
857858

858859
So far, there is not much in the recipe; just a statement about the number of targets
859-
and predictors. Let's add scaling (`step_scale`) \index{recipe!step\_scale} and centering (`step_center`) \index{recipe!step\_center} steps for
860+
and predictors. Let's add
861+
scaling (`step_scale`) \index{recipe!step\_scale} and
862+
centering (`step_center`) \index{recipe!step\_center} steps for
860863
all of the predictors so that they each have a mean of 0 and standard deviation of 1.
861864
Note that `tidyverse` actually provides `step_normalize`, which does both centering and scaling in
862865
a single recipe step; in this book we will keep `step_scale` and `step_center` separate
@@ -885,8 +888,8 @@ For example:
885888
- `Area, Smoothness`: specify both the `Area` and `Smoothness` variable
886889
- `-Class`: specify everything except the `Class` variable
887890

888-
You can find [a full set of all the steps and variable selection functions](https://tidymodels.github.io/recipes/reference/index.html)
889-
on the `recipes` home page.
891+
You can find a full set of all the steps and variable selection functions
892+
on the [`recipes` reference page](https://recipes.tidymodels.org/reference/index.html).
890893

891894
At this point, we have calculated the required statistics based on the data input into the
892895
recipe, but the data are not yet scaled and centered. To actually scale and center
@@ -1412,7 +1415,7 @@ wkflw_plot
14121415
## Exercises
14131416

14141417
Practice exercises for the material covered in this chapter
1415-
can be found in the accompanying [worksheet](https://github.com/UBC-DSCI/data-science-a-first-intro-worksheets/blob/main/worksheet_06/worksheet_06.ipynb).
1418+
can be found in the accompanying [worksheet](https://github.com/UBC-DSCI/data-science-a-first-intro-worksheets/blob/main/worksheet_classification1/worksheet_classification1.ipynb).
14161419
The worksheet tries to provide automated feedback
14171420
and help guide you through the problems.
14181421
To make sure this functionality works as intended,

classification2.Rmd

Lines changed: 28 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,7 @@ Sometimes our classifier might make the wrong prediction. A classifier does not
5454
need to be right 100\% of the time to be useful, though we don't want the
5555
classifier to make too many wrong predictions. How do we measure how "good" our
5656
classifier is? Let's revisit the \index{breast cancer}
57-
[breast cancer images example](http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29)
57+
[breast cancer images data](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29) [@streetbreastcancer]
5858
and think about how our classifier will be used in practice. A biopsy will be
5959
performed on a *new* patient's tumor, the resulting image will be analyzed,
6060
and the classifier will be asked to decide whether the tumor is benign or
@@ -1172,7 +1172,7 @@ this chapter to find out where you can learn more about variable selection, incl
11721172
The first idea you might think of for a systematic way to select predictors
11731173
is to try all possible subsets of predictors and then pick the set that results in the "best" classifier.
11741174
This procedure is indeed a well-known variable selection method referred to
1175-
as *best subset selection*. \index{variable selection!best subset}\index{predictor selection|see{variable selection}}
1175+
as *best subset selection* [@bealesubset; @hockingsubset]. \index{variable selection!best subset}\index{predictor selection|see{variable selection}}
11761176
In particular, you
11771177

11781178
1. create a separate model for every possible subset of predictors,
@@ -1194,7 +1194,7 @@ So although it is a simple method, best subset selection is usually too computat
11941194
expensive to use in practice.
11951195

11961196
Another idea is to iteratively build up a model by adding one predictor variable
1197-
at a time. This method—known as *forward selection*—is also widely \index{variable selection!forward}
1197+
at a time. This method—known as *forward selection* [@forwardefroymson; @forwarddraper]—is also widely \index{variable selection!forward}
11981198
applicable and fairly straightforward. It involves the following steps:
11991199

12001200
1. Start with a model having no predictors.
@@ -1273,9 +1273,9 @@ Finally, we need to write some code that performs the task of sequentially
12731273
finding the best predictor to add to the model.
12741274
If you recall the end of the wrangling chapter, we mentioned
12751275
that sometimes one needs more flexible forms of iteration than what
1276-
we have used earlier, and in these cases, one typically resorts to
1277-
[a for loop](https://r4ds.had.co.nz/iteration.html#iteration).
1278-
This is one of those cases! Here we will use two for loops:
1276+
we have used earlier, and in these cases one typically resorts to
1277+
a *for loop*; see [the chapter on iteration](https://r4ds.had.co.nz/iteration.html) in *R for Data Science* [@wickham2016r].
1278+
Here we will use two for loops:
12791279
one over increasing predictor set sizes
12801280
(where you see `for (i in 1:length(names))` below),
12811281
and another to check which predictor to add in each round (where you see `for (j in 1:length(names))` below).
@@ -1386,13 +1386,32 @@ fwd_sel_accuracies_plot
13861386
## Exercises
13871387

13881388
Practice exercises for the material covered in this chapter
1389-
can be found in the accompanying [worksheet](https://github.com/UBC-DSCI/data-science-a-first-intro-worksheets/blob/main/worksheet_07/worksheet_07.ipynb).
1389+
can be found in the accompanying [worksheet](https://github.com/UBC-DSCI/data-science-a-first-intro-worksheets/blob/main/worksheet_classification2/worksheet_classification2.ipynb).
13901390
The worksheet tries to provide automated feedback
13911391
and help guide you through the problems.
13921392
To make sure this functionality works as intended,
13931393
please follow the instructions for computer setup needed to run the worksheets
13941394
found in Chapter \@ref(move-to-your-own-machine).
13951395

13961396
## Additional resources
1397-
- The [`tidymodels` website](https://tidymodels.org/packages) is an excellent reference for more details on, and advanced usage of, the functions and packages in the past two chapters. Aside from that, it also has a [nice beginner's tutorial](https://www.tidymodels.org/start/) and [an extensive list of more advanced examples](https://www.tidymodels.org/learn/) that you can use to continue learning beyond the scope of this book. It's worth noting that the `tidymodels` package does a lot more than just classification, and so the examples on the website similarly go beyond classification as well. In the next two chapters, you'll learn about another kind of predictive modeling setting, so it might be worth visiting the website only after reading through those chapters.
1398-
- [*An Introduction to Statistical Learning*](https://www.statlearning.com/) [-@james2013introduction] provides a great next stop in the process of learning about classification. Chapter 4 discusses additional basic techniques for classification that we do not cover, such as logistic regression, linear discriminant analysis, and naive Bayes. Chapter 5 goes into much more detail about cross-validation. Chapters 8 and 9 cover decision trees and support vector machines, two very popular but more advanced classification methods. Finally, Chapter 6 covers a number of methods for selecting predictor variables. Note that while this book is still a very accessible introductory text, it requires a bit more mathematical background than we require.
1397+
- The [`tidymodels` website](https://tidymodels.org/packages) is an excellent
1398+
reference for more details on, and advanced usage of, the functions and
1399+
packages in the past two chapters. Aside from that, it also has a [nice
1400+
beginner's tutorial](https://www.tidymodels.org/start/) and [an extensive list
1401+
of more advanced examples](https://www.tidymodels.org/learn/) that you can use
1402+
to continue learning beyond the scope of this book. It's worth noting that the
1403+
`tidymodels` package does a lot more than just classification, and so the
1404+
examples on the website similarly go beyond classification as well. In the next
1405+
two chapters, you'll learn about another kind of predictive modeling setting,
1406+
so it might be worth visiting the website only after reading through those
1407+
chapters.
1408+
- *An Introduction to Statistical Learning* [@james2013introduction] provides
1409+
a great next stop in the process of
1410+
learning about classification. Chapter 4 discusses additional basic techniques
1411+
for classification that we do not cover, such as logistic regression, linear
1412+
discriminant analysis, and naive Bayes. Chapter 5 goes into much more detail
1413+
about cross-validation. Chapters 8 and 9 cover decision trees and support
1414+
vector machines, two very popular but more advanced classification methods.
1415+
Finally, Chapter 6 covers a number of methods for selecting predictor
1416+
variables. Note that while this book is still a very accessible introductory
1417+
text, it requires a bit more mathematical background than we require.

clustering.Rmd

Lines changed: 16 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -85,7 +85,7 @@ courses.
8585
As in the case of classification,
8686
there are many possible methods that we could use to cluster our observations
8787
to look for subgroups.
88-
In this book, we will focus on the widely used K-means \index{K-means} algorithm.
88+
In this book, we will focus on the widely used K-means \index{K-means} algorithm [@kmeans].
8989
In your future studies, you might encounter hierarchical clustering,
9090
principal component analysis, multidimensional scaling, and more;
9191
see the additional resources section at the end of this chapter
@@ -103,11 +103,11 @@ for where to begin learning more about these other methods.
103103
104104
**An illustrative example**
105105

106-
Here we will present an illustrative example using a data set \index{Palmer penguins} from the
107-
[{palmerpenguins} R data package](https://allisonhorst.github.io/palmerpenguins/). This data set was
108-
collected by [Dr. Kristen Gorman](https://www.uaf.edu/cfos/people/faculty/detail/kristen-gorman.php) and
109-
the [Palmer Station, Antarctica Long Term Ecological Research Site](https://pal.lternet.edu/), and includes
110-
measurements for adult penguins found near there [@palmerpenguins]. We have
106+
Here we will present an illustrative example using a data set \index{Palmer penguins} from
107+
[the `palmerpenguins` R package](https://allisonhorst.github.io/palmerpenguins/) [@palmerpenguins]. This
108+
data set was collected by Dr. Kristen Gorman and
109+
the Palmer Station, Antarctica Long Term Ecological Research Site, and includes
110+
measurements for adult penguins found near there [@penguinpaper]. We have
111111
modified the data set for use in this chapter. Here we will focus on using two
112112
variables—penguin bill and flipper length, both in millimeters—to determine whether
113113
there are distinct types of penguins in our data.
@@ -1098,12 +1098,20 @@ elbow_plot
10981098
## Exercises
10991099

11001100
Practice exercises for the material covered in this chapter
1101-
can be found in the accompanying [worksheet](https://github.com/UBC-DSCI/data-science-a-first-intro-worksheets/blob/main/worksheet_10/worksheet_10.ipynb).
1101+
can be found in the accompanying [worksheet](https://github.com/UBC-DSCI/data-science-a-first-intro-worksheets/blob/main/worksheet_clustering/worksheet_clustering.ipynb).
11021102
The worksheet tries to provide automated feedback
11031103
and help guide you through the problems.
11041104
To make sure this functionality works as intended,
11051105
please follow the instructions for computer setup needed to run the worksheets
11061106
found in Chapter \@ref(move-to-your-own-machine).
11071107

11081108
## Additional resources
1109-
- Chapter 10 of [*An Introduction to Statistical Learning*](https://www.statlearning.com/) [-@james2013introduction] provides a great next stop in the process of learning about clustering and unsupervised learning in general. In the realm of clustering specifically, it provides a great companion introduction to K-means, but also covers *hierarchical* clustering for when you expect there to be subgroups, and then subgroups within subgroups, etc., in your data. In the realm of more general unsupervised learning, it covers *principal components analysis (PCA)*, which is a very popular technique for reducing the number of predictors in a dataset.
1109+
- Chapter 10 of *An Introduction to Statistical
1110+
Learning* [@james2013introduction] provides a
1111+
great next stop in the process of learning about clustering and unsupervised
1112+
learning in general. In the realm of clustering specifically, it provides a
1113+
great companion introduction to K-means, but also covers *hierarchical*
1114+
clustering for when you expect there to be subgroups, and then subgroups within
1115+
subgroups, etc., in your data. In the realm of more general unsupervised
1116+
learning, it covers *principal components analysis (PCA)*, which is a very
1117+
popular technique for reducing the number of predictors in a dataset.

0 commit comments

Comments
 (0)