You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the `set_engine` \index{tidymodels!engine} argument, we specify which package or system will be used for training
728
728
the model. Here `kknn` is the R package we will use for performing $K$-nearest neighbors classification.
729
729
Finally, we specify that this is a classification problem with the `set_mode` function.
@@ -846,8 +846,9 @@ loaded, and the standardized version of that same data. But first, we need to
846
846
standardize the `unscaled_cancer` data set with `tidymodels`.
847
847
848
848
In the `tidymodels` framework, all data preprocessing happens
849
-
using a [`recipe`](https://tidymodels.github.io/recipes/reference/index.html).
850
-
Here we will initialize a recipe \index{recipe} \index{tidymodels!recipe|see{recipe}} for the `unscaled_cancer` data above, specifying
849
+
using a `recipe` from [the `recipes` R package](https://recipes.tidymodels.org/)[@recipes]
850
+
Here we will initialize a recipe \index{recipe} \index{tidymodels!recipe|see{recipe}} for
851
+
the `unscaled_cancer` data above, specifying
851
852
that the `Class` variable is the target, and all other variables are predictors:
852
853
853
854
```{r 05-scaling-2}
@@ -856,7 +857,9 @@ print(uc_recipe)
856
857
```
857
858
858
859
So far, there is not much in the recipe; just a statement about the number of targets
859
-
and predictors. Let's add scaling (`step_scale`) \index{recipe!step\_scale} and centering (`step_center`) \index{recipe!step\_center} steps for
860
+
and predictors. Let's add
861
+
scaling (`step_scale`) \index{recipe!step\_scale} and
862
+
centering (`step_center`) \index{recipe!step\_center} steps for
860
863
all of the predictors so that they each have a mean of 0 and standard deviation of 1.
861
864
Note that `tidyverse` actually provides `step_normalize`, which does both centering and scaling in
862
865
a single recipe step; in this book we will keep `step_scale` and `step_center` separate
@@ -885,8 +888,8 @@ For example:
885
888
-`Area, Smoothness`: specify both the `Area` and `Smoothness` variable
886
889
-`-Class`: specify everything except the `Class` variable
887
890
888
-
You can find [a full set of all the steps and variable selection functions](https://tidymodels.github.io/recipes/reference/index.html)
889
-
on the `recipes`home page.
891
+
You can find a full set of all the steps and variable selection functions
892
+
on the [`recipes`reference page](https://recipes.tidymodels.org/reference/index.html).
890
893
891
894
At this point, we have calculated the required statistics based on the data input into the
892
895
recipe, but the data are not yet scaled and centered. To actually scale and center
@@ -1412,7 +1415,7 @@ wkflw_plot
1412
1415
## Exercises
1413
1416
1414
1417
Practice exercises for the material covered in this chapter
1415
-
can be found in the accompanying [worksheet](https://github.com/UBC-DSCI/data-science-a-first-intro-worksheets/blob/main/worksheet_06/worksheet_06.ipynb).
1418
+
can be found in the accompanying [worksheet](https://github.com/UBC-DSCI/data-science-a-first-intro-worksheets/blob/main/worksheet_classification1/worksheet_classification1.ipynb).
1416
1419
The worksheet tries to provide automated feedback
1417
1420
and help guide you through the problems.
1418
1421
To make sure this functionality works as intended,
Copy file name to clipboardExpand all lines: classification2.Rmd
+28-9Lines changed: 28 additions & 9 deletions
Original file line number
Diff line number
Diff line change
@@ -54,7 +54,7 @@ Sometimes our classifier might make the wrong prediction. A classifier does not
54
54
need to be right 100\% of the time to be useful, though we don't want the
55
55
classifier to make too many wrong predictions. How do we measure how "good" our
56
56
classifier is? Let's revisit the \index{breast cancer}
57
-
[breast cancer images example](http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29)
57
+
[breast cancer images data](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29)[@streetbreastcancer]
58
58
and think about how our classifier will be used in practice. A biopsy will be
59
59
performed on a *new* patient's tumor, the resulting image will be analyzed,
60
60
and the classifier will be asked to decide whether the tumor is benign or
@@ -1172,7 +1172,7 @@ this chapter to find out where you can learn more about variable selection, incl
1172
1172
The first idea you might think of for a systematic way to select predictors
1173
1173
is to try all possible subsets of predictors and then pick the set that results in the "best" classifier.
1174
1174
This procedure is indeed a well-known variable selection method referred to
1175
-
as *best subset selection*. \index{variable selection!best subset}\index{predictor selection|see{variable selection}}
1175
+
as *best subset selection*[@bealesubset; @hockingsubset]. \index{variable selection!best subset}\index{predictor selection|see{variable selection}}
1176
1176
In particular, you
1177
1177
1178
1178
1. create a separate model for every possible subset of predictors,
@@ -1194,7 +1194,7 @@ So although it is a simple method, best subset selection is usually too computat
1194
1194
expensive to use in practice.
1195
1195
1196
1196
Another idea is to iteratively build up a model by adding one predictor variable
1197
-
at a time. This method—known as *forward selection*—is also widely \index{variable selection!forward}
1197
+
at a time. This method—known as *forward selection*[@forwardefroymson; @forwarddraper]—is also widely \index{variable selection!forward}
1198
1198
applicable and fairly straightforward. It involves the following steps:
1199
1199
1200
1200
1. Start with a model having no predictors.
@@ -1273,9 +1273,9 @@ Finally, we need to write some code that performs the task of sequentially
1273
1273
finding the best predictor to add to the model.
1274
1274
If you recall the end of the wrangling chapter, we mentioned
1275
1275
that sometimes one needs more flexible forms of iteration than what
1276
-
we have used earlier, and in these cases, one typically resorts to
1277
-
[a for loop](https://r4ds.had.co.nz/iteration.html#iteration).
1278
-
This is one of those cases! Here we will use two for loops:
1276
+
we have used earlier, and in these cases one typically resorts to
1277
+
a *for loop*; see [the chapter on iteration](https://r4ds.had.co.nz/iteration.html) in *R for Data Science*[@wickham2016r].
1278
+
Here we will use two for loops:
1279
1279
one over increasing predictor set sizes
1280
1280
(where you see `for (i in 1:length(names))` below),
1281
1281
and another to check which predictor to add in each round (where you see `for (j in 1:length(names))` below).
@@ -1386,13 +1386,32 @@ fwd_sel_accuracies_plot
1386
1386
## Exercises
1387
1387
1388
1388
Practice exercises for the material covered in this chapter
1389
-
can be found in the accompanying [worksheet](https://github.com/UBC-DSCI/data-science-a-first-intro-worksheets/blob/main/worksheet_07/worksheet_07.ipynb).
1389
+
can be found in the accompanying [worksheet](https://github.com/UBC-DSCI/data-science-a-first-intro-worksheets/blob/main/worksheet_classification2/worksheet_classification2.ipynb).
1390
1390
The worksheet tries to provide automated feedback
1391
1391
and help guide you through the problems.
1392
1392
To make sure this functionality works as intended,
1393
1393
please follow the instructions for computer setup needed to run the worksheets
1394
1394
found in Chapter \@ref(move-to-your-own-machine).
1395
1395
1396
1396
## Additional resources
1397
-
- The [`tidymodels` website](https://tidymodels.org/packages) is an excellent reference for more details on, and advanced usage of, the functions and packages in the past two chapters. Aside from that, it also has a [nice beginner's tutorial](https://www.tidymodels.org/start/) and [an extensive list of more advanced examples](https://www.tidymodels.org/learn/) that you can use to continue learning beyond the scope of this book. It's worth noting that the `tidymodels` package does a lot more than just classification, and so the examples on the website similarly go beyond classification as well. In the next two chapters, you'll learn about another kind of predictive modeling setting, so it might be worth visiting the website only after reading through those chapters.
1398
-
-[*An Introduction to Statistical Learning*](https://www.statlearning.com/)[-@james2013introduction] provides a great next stop in the process of learning about classification. Chapter 4 discusses additional basic techniques for classification that we do not cover, such as logistic regression, linear discriminant analysis, and naive Bayes. Chapter 5 goes into much more detail about cross-validation. Chapters 8 and 9 cover decision trees and support vector machines, two very popular but more advanced classification methods. Finally, Chapter 6 covers a number of methods for selecting predictor variables. Note that while this book is still a very accessible introductory text, it requires a bit more mathematical background than we require.
1397
+
- The [`tidymodels` website](https://tidymodels.org/packages) is an excellent
1398
+
reference for more details on, and advanced usage of, the functions and
1399
+
packages in the past two chapters. Aside from that, it also has a [nice
1400
+
beginner's tutorial](https://www.tidymodels.org/start/) and [an extensive list
1401
+
of more advanced examples](https://www.tidymodels.org/learn/) that you can use
1402
+
to continue learning beyond the scope of this book. It's worth noting that the
1403
+
`tidymodels` package does a lot more than just classification, and so the
1404
+
examples on the website similarly go beyond classification as well. In the next
1405
+
two chapters, you'll learn about another kind of predictive modeling setting,
1406
+
so it might be worth visiting the website only after reading through those
1407
+
chapters.
1408
+
-*An Introduction to Statistical Learning*[@james2013introduction] provides
1409
+
a great next stop in the process of
1410
+
learning about classification. Chapter 4 discusses additional basic techniques
1411
+
for classification that we do not cover, such as logistic regression, linear
1412
+
discriminant analysis, and naive Bayes. Chapter 5 goes into much more detail
1413
+
about cross-validation. Chapters 8 and 9 cover decision trees and support
1414
+
vector machines, two very popular but more advanced classification methods.
1415
+
Finally, Chapter 6 covers a number of methods for selecting predictor
1416
+
variables. Note that while this book is still a very accessible introductory
1417
+
text, it requires a bit more mathematical background than we require.
Copy file name to clipboardExpand all lines: clustering.Rmd
+16-8Lines changed: 16 additions & 8 deletions
Original file line number
Diff line number
Diff line change
@@ -85,7 +85,7 @@ courses.
85
85
As in the case of classification,
86
86
there are many possible methods that we could use to cluster our observations
87
87
to look for subgroups.
88
-
In this book, we will focus on the widely used K-means \index{K-means} algorithm.
88
+
In this book, we will focus on the widely used K-means \index{K-means} algorithm[@kmeans].
89
89
In your future studies, you might encounter hierarchical clustering,
90
90
principal component analysis, multidimensional scaling, and more;
91
91
see the additional resources section at the end of this chapter
@@ -103,11 +103,11 @@ for where to begin learning more about these other methods.
103
103
104
104
**An illustrative example**
105
105
106
-
Here we will present an illustrative example using a data set \index{Palmer penguins} from the
107
-
[{palmerpenguins} R data package](https://allisonhorst.github.io/palmerpenguins/). This data set was
108
-
collected by [Dr. Kristen Gorman](https://www.uaf.edu/cfos/people/faculty/detail/kristen-gorman.php) and
109
-
the [Palmer Station, Antarctica Long Term Ecological Research Site](https://pal.lternet.edu/), and includes
110
-
measurements for adult penguins found near there [@palmerpenguins]. We have
106
+
Here we will present an illustrative example using a data set \index{Palmer penguins} from
107
+
[the `palmerpenguins` R package](https://allisonhorst.github.io/palmerpenguins/)[@palmerpenguins]. This
108
+
data set was collected by Dr. Kristen Gorman and
109
+
the Palmer Station, Antarctica Long Term Ecological Research Site, and includes
110
+
measurements for adult penguins found near there [@penguinpaper]. We have
111
111
modified the data set for use in this chapter. Here we will focus on using two
112
112
variables—penguin bill and flipper length, both in millimeters—to determine whether
113
113
there are distinct types of penguins in our data.
@@ -1098,12 +1098,20 @@ elbow_plot
1098
1098
## Exercises
1099
1099
1100
1100
Practice exercises for the material covered in this chapter
1101
-
can be found in the accompanying [worksheet](https://github.com/UBC-DSCI/data-science-a-first-intro-worksheets/blob/main/worksheet_10/worksheet_10.ipynb).
1101
+
can be found in the accompanying [worksheet](https://github.com/UBC-DSCI/data-science-a-first-intro-worksheets/blob/main/worksheet_clustering/worksheet_clustering.ipynb).
1102
1102
The worksheet tries to provide automated feedback
1103
1103
and help guide you through the problems.
1104
1104
To make sure this functionality works as intended,
1105
1105
please follow the instructions for computer setup needed to run the worksheets
1106
1106
found in Chapter \@ref(move-to-your-own-machine).
1107
1107
1108
1108
## Additional resources
1109
-
- Chapter 10 of [*An Introduction to Statistical Learning*](https://www.statlearning.com/)[-@james2013introduction] provides a great next stop in the process of learning about clustering and unsupervised learning in general. In the realm of clustering specifically, it provides a great companion introduction to K-means, but also covers *hierarchical* clustering for when you expect there to be subgroups, and then subgroups within subgroups, etc., in your data. In the realm of more general unsupervised learning, it covers *principal components analysis (PCA)*, which is a very popular technique for reducing the number of predictors in a dataset.
1109
+
- Chapter 10 of *An Introduction to Statistical
1110
+
Learning*[@james2013introduction] provides a
1111
+
great next stop in the process of learning about clustering and unsupervised
1112
+
learning in general. In the realm of clustering specifically, it provides a
1113
+
great companion introduction to K-means, but also covers *hierarchical*
1114
+
clustering for when you expect there to be subgroups, and then subgroups within
1115
+
subgroups, etc., in your data. In the realm of more general unsupervised
1116
+
learning, it covers *principal components analysis (PCA)*, which is a very
1117
+
popular technique for reducing the number of predictors in a dataset.
0 commit comments