You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: classification1.Rmd
+15-1Lines changed: 15 additions & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -4,6 +4,7 @@
4
4
library(formatR)
5
5
library(plotly)
6
6
library(knitr)
7
+
library(kableExtra)
7
8
8
9
knitr::opts_chunk$set(echo = TRUE,
9
10
fig.align = "center")
@@ -503,7 +504,10 @@ math_table <- math_table %>%
503
504
```
504
505
505
506
```{r 05-multiknn-mathtable, echo = FALSE}
506
-
knitr::kable(math_table, booktabs = TRUE, caption = "Evaluating the distances from the new observation to each of its 5 nearest neighbors", escape = FALSE)
507
+
kable(math_table, booktabs = TRUE,
508
+
caption = "Evaluating the distances from the new observation to each of its 5 nearest neighbors",
509
+
escape = FALSE) |>
510
+
kable_styling(latex_options = "hold_position")
507
511
```
508
512
509
513
The result of this computation shows that 3 of the 5 nearest neighbors to our new observation are
@@ -1369,3 +1373,13 @@ wkflw_plot <-
1369
1373
1370
1374
wkflw_plot
1371
1375
```
1376
+
1377
+
## Exercises
1378
+
1379
+
Practice exercises for the material covered in this chapter
1380
+
can be found in the accompanying [worksheet](https://ubc-dsci.github.io/data-science-a-first-intro-worksheets/worksheet_06/worksheet_06.ipynb).
1381
+
The worksheet tries to provide automated feedback
1382
+
and help guide you through the problems.
1383
+
To make sure this functionality works as intended,
1384
+
please follow the instructions for computer setup needed to run the worksheets
Practice exercises for the material covered in this chapter
1348
+
can be found in the accompanying [worksheet](https://ubc-dsci.github.io/data-science-a-first-intro-worksheets/worksheet_07/worksheet_07.ipynb).
1349
+
The worksheet tries to provide automated feedback
1350
+
and help guide you through the problems.
1351
+
To make sure this functionality works as intended,
1352
+
please follow the instructions for computer setup needed to run the worksheets
1353
+
found in Chapter \@ref(move-to-your-own-machine).
1354
+
1345
1355
## Additional resources
1346
1356
- The [`tidymodels` website](https://tidymodels.org/packages) is an excellent reference for more details on, and advanced usage of, the functions and packages in the past two chapters. Aside from that, it also has a [nice beginner's tutorial](https://www.tidymodels.org/start/) and [an extensive list of more advanced examples](https://www.tidymodels.org/learn/) that you can use to continue learning beyond the scope of this book. It's worth noting that the `tidymodels` package does a lot more than just classification, and so the examples on the website similarly go beyond classification as well. In the next two chapters, you'll learn about another kind of predictive modeling setting, so it might be worth visiting the website only after reading through those chapters.
1347
1357
-[An Introduction to Statistical Learning](https://www.statlearning.com/)[-@james2013introduction] provides a great next stop in the process of learning about classification. Chapter 4 discusses additional basic techniques for classification that we do not cover, such as logistic regression, linear discriminant analysis, and naive Bayes. Chapter 5 goes into much more detail about cross-validation. Chapters 8 and 9 cover decision trees and support vector machines, two very popular but more advanced classification methods. Finally, Chapter 6 covers a number of methods for selecting predictor variables. Note that while this book is still a very accessible introductory text, it requires a bit more mathematical background than we require.
Copy file name to clipboardExpand all lines: clustering.Rmd
+30-10Lines changed: 30 additions & 10 deletions
Original file line number
Diff line number
Diff line change
@@ -164,7 +164,7 @@ penguin_data
164
164
Next, we can create a scatter plot using this data set
165
165
to see if we can detect subtypes or groups in our data set.
166
166
167
-
```{r 10-toy-example-plot, warning = FALSE, fig.height = 4, fig.width = 4.35, fig.cap = "Scatter plot of standardized bill length versus standardized flipper length."}
167
+
```{r 10-toy-example-plot, warning = FALSE, fig.height = 4, fig.width = 4.35, fig.align = "center", fig.cap = "Scatter plot of standardized bill length versus standardized flipper length."}
168
168
ggplot(data, aes(x = flipper_length_standardized,
169
169
y = bill_length_standardized)) +
170
170
geom_point() +
@@ -198,7 +198,7 @@ This procedure will separate the data into groups;
198
198
Figure \@ref(fig:10-toy-example-clustering) shows these groups
199
199
denoted by colored scatter points.
200
200
201
-
```{r 10-toy-example-clustering, echo = FALSE, warning = FALSE, fig.height = 4, fig.width = 5, fig.cap = "Scatter plot of standardized bill length versus standardized flipper length with colored groups."}
201
+
```{r 10-toy-example-clustering, echo = FALSE, warning = FALSE, fig.height = 4, fig.width = 5, fig.align = "center", fig.cap = "Scatter plot of standardized bill length versus standardized flipper length with colored groups."}
202
202
ggplot(data, aes(y = bill_length_standardized,
203
203
x = flipper_length_standardized, color = cluster)) +
204
204
geom_point() +
@@ -254,7 +254,9 @@ In the first cluster from the example, there are `r nrow(clus1)` data points. Th
254
254
(`r paste("flipper_length_standardized =", round(mean(clus1$flipper_length_standardized),2))` and `r paste("bill_length_standardized =", round(mean(clus1$bill_length_standardized),2))`) highlighted
255
255
in Figure \@ref(fig:10-toy-example-clus1-center).
256
256
257
-
```{r 10-toy-example-clus1-center, echo = FALSE, warning = FALSE, fig.height = 4, fig.width = 4.35, fig.cap = "Cluster 1 from the `penguin_data` data set example. Observations are in blue, with the cluster center highlighted in red."}
257
+
(ref:10-toy-example-clus1-center) Cluster 1 from the `penguin_data` data set example. Observations are in blue, with the cluster center highlighted in red.
These distances are denoted by lines in Figure \@ref(fig:10-toy-example-clus1-dists) for the first cluster of the penguin data example.
301
303
302
-
```{r 10-toy-example-clus1-dists, echo = FALSE, warning = FALSE, fig.height = 4, fig.width = 4.35, fig.cap = "Cluster 1 from the `penguin_data` data set example. Observations are in blue, with the cluster center highlighted in red. The distances from the observations to the cluster center are represented as black lines."}
304
+
(ref:10-toy-example-clus1-dists) Cluster 1 from the `penguin_data` data set example. Observations are in blue, with the cluster center highlighted in red. The distances from the observations to the cluster center are represented as black lines.
@@ -336,7 +340,9 @@ this means adding up all the squared distances for the 18 observations.
336
340
These distances are denoted by black lines in
337
341
Figure \@ref(fig:10-toy-example-all-clus-dists).
338
342
339
-
```{r 10-toy-example-all-clus-dists, echo = FALSE, warning = FALSE, fig.height = 4, fig.width = 5, fig.cap = "All clusters from the `penguin_data` data set example. Observations are in orange, blue, and yellow with the cluster center highlighted in red. The distances from the observations to each of the respective cluster centers are represented as black lines."}
343
+
(ref:10-toy-example-all-clus-dists) All clusters from the `penguin_data` data set example. Observations are in orange, blue, and yellow with the cluster center highlighted in red. The distances from the observations to each of the respective cluster centers are represented as black lines.
@@ -431,7 +437,9 @@ There each row corresponds to an iteration,
431
437
where the left column depicts the center update,
432
438
and the right column depicts the reassignment of data to clusters.
433
439
434
-
```{r 10-toy-kmeans-iter, echo = FALSE, warning = FALSE, fig.height = 16, fig.width = 8, fig.cap = "First four iterations of K-means clustering on the `penguin_data` example data set. Each row corresponds to an iteration, where the left column depicts the center update, and the right column depicts the reassignment of data to clusters. Cluster centers are indicated by larger points that are outlined in black."}
440
+
(ref:10-toy-kmeans-iter) First four iterations of K-means clustering on the `penguin_data` example data set. Each row corresponds to an iteration, where the left column depicts the center update, and the right column depicts the reassignment of data to clusters. Cluster centers are indicated by larger points that are outlined in black.
@@ -538,7 +546,7 @@ These, however, are beyond the scope of this book.
538
546
Unlike the classification and regression models we studied in previous chapters, K-means \index{K-means!restart,nstart} can get "stuck" in a bad solution.
539
547
For example, Figure \@ref(fig:10-toy-kmeans-bad-init) illustrates an unlucky random initialization by K-means.
Figure \@ref(fig:10-toy-kmeans-bad-iter) shows what the iterations of K-means would look like with the unlucky random initialization shown in Figure \@ref(fig:10-toy-kmeans-bad-init).
559
567
560
-
```{r 10-toy-kmeans-bad-iter, echo = FALSE, warning = FALSE, fig.height = 20, fig.width = 8, fig.cap = "First five iterations of K-means clustering on the `penguin_data` example data set with a poor random initialization. Each row corresponds to an iteration, where the left column depicts the center update, and the right column depicts the reassignment of data to clusters. Cluster centers are indicated by larger points that are outlined in black."}
568
+
(ref:10-toy-kmeans-bad-iter) First five iterations of K-means clustering on the `penguin_data` example data set with a poor random initialization. Each row corresponds to an iteration, where the left column depicts the center update, and the right column depicts the reassignment of data to clusters. Cluster centers are indicated by larger points that are outlined in black.
@@ -949,7 +959,7 @@ but there is a trade-off that doing many clusterings
949
959
could take a long time.
950
960
So this is something that needs to be balanced.
951
961
952
-
```{r 10-choose-k-nstart, fig.height = 4, fig.width = 4.35, message= F, warning = F, fig.cap = "A plot showing the total WSSD versus the number of clusters when K-means is run with 10 restarts."}
962
+
```{r 10-choose-k-nstart, fig.height = 4, fig.width = 4.35, message= FALSE, warning = FALSE, fig.align = "center", fig.cap = "A plot showing the total WSSD versus the number of clusters when K-means is run with 10 restarts."}
Practice exercises for the material covered in this chapter
984
+
can be found in the accompanying [worksheet](https://ubc-dsci.github.io/data-science-a-first-intro-worksheets/worksheet_10/worksheet_10.ipynb).
985
+
The worksheet tries to provide automated feedback
986
+
and help guide you through the problems.
987
+
To make sure this functionality works as intended,
988
+
please follow the instructions for computer setup needed to run the worksheets
989
+
found in Chapter \@ref(move-to-your-own-machine).
990
+
971
991
## Additional resources
972
992
- Chapter 10 of [An Introduction to Statistical Learning](https://www.statlearning.com/)[-@james2013introduction] provides a great next stop in the process of learning about clustering and unsupervised learning in general. In the realm of clustering specifically, it provides a great companion introduction to K-means, but also covers *hierarchical* clustering for when you expect there to be subgroups, and then subgroups within subgroups, etc. in your data. In the realm of more general unsupervised learning, it covers *principal components analysis (PCA)*, which is a very popular technique in scientific applications for reducing the number of predictors in a dataset.
and [second worksheet](https://ubc-dsci.github.io/data-science-a-first-intro-worksheets/worksheet_12/worksheet_12.ipynb)).
1167
+
The worksheets try to provide automated feedback
1168
+
and help guide you through the problems.
1169
+
To make sure this functionality works as intended,
1170
+
please follow the instructions for computer setup needed to run the worksheets
1171
+
found in Chapter \@ref(move-to-your-own-machine).
1172
+
1161
1173
## Additional resources
1162
1174
1163
1175
- Chapters 7 to 10 of [Modern Dive](https://moderndive.com/) provide a great next step in learning about inference. In particular, Chapters 7 and 8 cover sampling and bootstrapping using `tidyverse` and `infer` in a slightly more in-depth manner than the present chapter. Chapters 9 and 10 take the next step beyond the scope of this chapter and begin to provide some of the initial mathematical underpinnings of inference and more advanced applications of the concept of inference in testing hypotheses and performing regression. This material offers a great starting point for getting more into the technical side of statistics.
Copy file name to clipboardExpand all lines: intro.Rmd
+10Lines changed: 10 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -686,3 +686,13 @@ you about the different arguments and usage of functions that you have already l
686
686
```{r 01-help, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "The documentation for the `filter` function, including a high-level description, a list of arguments and their meanings, and more.", fig.retina = 2, out.width="100%"}
687
687
knitr::include_graphics("img/help-filter.png")
688
688
```
689
+
690
+
## Exercises
691
+
692
+
Practice exercises for the material covered in this chapter
693
+
can be found in the accompanying [worksheet](https://ubc-dsci.github.io/data-science-a-first-intro-worksheets/worksheet_01/worksheet_01.ipynb).
694
+
The worksheet tries to provide automated feedback
695
+
and help guide you through the problems.
696
+
To make sure this functionality works as intended,
697
+
please follow the instructions for computer setup needed to run the worksheets
0 commit comments