UBC-DSCI
diff --git a/‎build_html.sh
Lines changed: 1 addition & 1 deletion b/‎build_html.sh
Lines changed: 1 addition & 1 deletion
diff --git a/‎build_pdf.sh
Lines changed: 1 addition & 1 deletion b/‎build_pdf.sh
Lines changed: 1 addition & 1 deletion
diff --git a/‎classification1.Rmd
Lines changed: 15 additions & 1 deletion b/‎classification1.Rmd
Lines changed: 15 additions & 1 deletion
diff --git a/‎classification2.Rmd
Lines changed: 10 additions & 0 deletions b/‎classification2.Rmd
Lines changed: 10 additions & 0 deletions
diff --git a/‎clustering.Rmd
Lines changed: 30 additions & 10 deletions b/‎clustering.Rmd
Lines changed: 30 additions & 10 deletions
diff --git a/‎docker-compose.yml
Lines changed: 1 addition & 1 deletion b/‎docker-compose.yml
Lines changed: 1 addition & 1 deletion
diff --git a/‎inference.Rmd
Lines changed: 12 additions & 0 deletions b/‎inference.Rmd
Lines changed: 12 additions & 0 deletions
diff --git a/‎intro.Rmd
Lines changed: 10 additions & 0 deletions b/‎intro.Rmd
Lines changed: 10 additions & 0 deletions
diff --git a/‎pdf/index.Rmd
Lines changed: 8 additions & 0 deletions b/‎pdf/index.Rmd
Lines changed: 8 additions & 0 deletions
diff --git a/‎preface-text.Rmd
Lines changed: 14 additions & 0 deletions b/‎preface-text.Rmd
Lines changed: 14 additions & 0 deletions
@@ -1,2 +1,2 @@
 # Script to generate HTML book
-docker run --rm -m 5g -v $(pwd):/home/rstudio/introduction-to-datascience ubcdsci/intro-to-ds:v0.16.0 /bin/bash -c "cd /home/rstudio/introduction-to-datascience; Rscript _build_html.r"
+docker run --rm -m 5g -v $(pwd):/home/rstudio/introduction-to-datascience ubcdsci/intro-to-ds:v0.18.0 /bin/bash -c "cd /home/rstudio/introduction-to-datascience; Rscript _build_html.r"
@@ -22,7 +22,7 @@ cp -r data/ pdf/data
 cp -r img/ pdf/img
 
 # Build the book with bookdown
-docker run --rm -m 5g -v $(pwd):/home/rstudio/introduction-to-datascience ubcdsci/intro-to-ds:v0.16.0 /bin/bash -c "cd /home/rstudio/introduction-to-datascience/pdf; Rscript _build_pdf.r"
+docker run --rm -m 5g -v $(pwd):/home/rstudio/introduction-to-datascience ubcdsci/intro-to-ds:v0.18.0 /bin/bash -c "cd /home/rstudio/introduction-to-datascience/pdf; Rscript _build_pdf.r"
 
 # clean files in pdf dir
 rm -rf pdf/references.bib
 
@@ -4,6 +4,7 @@
 library(formatR)
 library(plotly)
 library(knitr)
+library(kableExtra)
 
 knitr::opts_chunk$set(echo = TRUE, 
                       fig.align = "center")
@@ -503,7 +504,10 @@ math_table <- math_table %>%
 ```
 
 ```{r 05-multiknn-mathtable, echo = FALSE}
-knitr::kable(math_table, booktabs = TRUE, caption = "Evaluating the distances from the new observation to each of its 5 nearest neighbors", escape = FALSE)
+kable(math_table, booktabs = TRUE, 
+      caption = "Evaluating the distances from the new observation to each of its 5 nearest neighbors", 
+      escape = FALSE) |>
+  kable_styling(latex_options = "hold_position")
 ```
 
 The result of this computation shows that 3 of the 5 nearest neighbors to our new observation are
@@ -1369,3 +1373,13 @@ wkflw_plot <-
 
 wkflw_plot
 ```
+
+## Exercises
+
+Practice exercises for the material covered in this chapter 
+can be found in the accompanying [worksheet](https://ubc-dsci.github.io/data-science-a-first-intro-worksheets/worksheet_06/worksheet_06.ipynb).
+The worksheet tries to provide automated feedback 
+and help guide you through the problems. 
+To make sure this functionality works as intended, 
+please follow the instructions for computer setup needed to run the worksheets 
+found in Chapter \@ref(move-to-your-own-machine).
@@ -1342,6 +1342,16 @@ fwd_sel_accuracies_plot <- accuracies |>
 fwd_sel_accuracies_plot
 ```
 
+## Exercises
+
+Practice exercises for the material covered in this chapter 
+can be found in the accompanying [worksheet](https://ubc-dsci.github.io/data-science-a-first-intro-worksheets/worksheet_07/worksheet_07.ipynb).
+The worksheet tries to provide automated feedback 
+and help guide you through the problems. 
+To make sure this functionality works as intended, 
+please follow the instructions for computer setup needed to run the worksheets 
+found in Chapter \@ref(move-to-your-own-machine).
+
 ## Additional resources
 - The [`tidymodels` website](https://tidymodels.org/packages) is an excellent reference for more details on, and advanced usage of, the functions and packages in the past two chapters. Aside from that, it also has a [nice beginner's tutorial](https://www.tidymodels.org/start/) and [an extensive list of more advanced examples](https://www.tidymodels.org/learn/) that you can use to continue learning beyond the scope of this book. It's worth noting that the `tidymodels` package does a lot more than just classification, and so the examples on the website similarly go beyond classification as well. In the next two chapters, you'll learn about another kind of predictive modeling setting, so it might be worth visiting the website only after reading through those chapters.
 - [An Introduction to Statistical Learning](https://www.statlearning.com/) [-@james2013introduction] provides a great next stop in the process of learning about classification. Chapter 4 discusses additional basic techniques for classification that we do not cover, such as logistic regression, linear discriminant analysis, and naive Bayes. Chapter 5 goes into much more detail about cross-validation. Chapters 8 and 9 cover decision trees and support vector machines, two very popular but more advanced classification methods. Finally, Chapter 6 covers a number of methods for selecting predictor variables. Note that while this book is still a very accessible introductory text, it requires a bit more mathematical background than we require.
@@ -164,7 +164,7 @@ penguin_data
 Next, we can create a scatter plot using this data set 
 to see if we can detect subtypes or groups in our data set.
 
-```{r 10-toy-example-plot, warning = FALSE, fig.height = 4, fig.width = 4.35, fig.cap = "Scatter plot of standardized bill length versus standardized flipper length."}
+```{r 10-toy-example-plot, warning = FALSE, fig.height = 4, fig.width = 4.35, fig.align = "center", fig.cap = "Scatter plot of standardized bill length versus standardized flipper length."}
 ggplot(data, aes(x = flipper_length_standardized, 
                  y = bill_length_standardized)) +
   geom_point() +
@@ -198,7 +198,7 @@ This procedure will separate the data into groups;
 Figure \@ref(fig:10-toy-example-clustering) shows these groups
 denoted by colored scatter points.
 
-```{r 10-toy-example-clustering, echo = FALSE, warning = FALSE, fig.height = 4, fig.width = 5, fig.cap = "Scatter plot of standardized bill length versus standardized flipper length with colored groups."}
+```{r 10-toy-example-clustering, echo = FALSE, warning = FALSE, fig.height = 4, fig.width = 5, fig.align = "center", fig.cap = "Scatter plot of standardized bill length versus standardized flipper length with colored groups."}
 ggplot(data, aes(y = bill_length_standardized, 
                  x = flipper_length_standardized, color = cluster)) +
   geom_point() +
@@ -254,7 +254,9 @@ In the first cluster from the example, there are `r nrow(clus1)` data points. Th
 (`r paste("flipper_length_standardized =", round(mean(clus1$flipper_length_standardized),2))` and `r paste("bill_length_standardized =", round(mean(clus1$bill_length_standardized),2))`) highlighted 
 in Figure \@ref(fig:10-toy-example-clus1-center).
 
-```{r 10-toy-example-clus1-center, echo = FALSE, warning = FALSE, fig.height = 4, fig.width = 4.35, fig.cap = "Cluster 1 from the `penguin_data` data set example. Observations are in blue, with the cluster center highlighted in red."}
+(ref:10-toy-example-clus1-center) Cluster 1 from the `penguin_data` data set example. Observations are in blue, with the cluster center highlighted in red.
+
+```{r 10-toy-example-clus1-center, echo = FALSE, warning = FALSE, fig.height = 4, fig.width = 4.35, fig.align = "center", fig.cap = "(ref:10-toy-example-clus1-center)"}
 base <- ggplot(data, aes(x = flipper_length_standardized, y = bill_length_standardized)) +
   geom_point() +
   xlab("Flipper Length (standardized)") +
@@ -299,7 +301,9 @@ S^2 = \left((x_1 - \mu_x)^2 + (y_1 - \mu_y)^2\right) + \left((x_2 - \mu_x)^2 + (
 
 These distances are denoted by lines in Figure \@ref(fig:10-toy-example-clus1-dists) for the first cluster of the penguin data example. 
 
-```{r 10-toy-example-clus1-dists, echo = FALSE, warning = FALSE, fig.height = 4, fig.width = 4.35, fig.cap = "Cluster 1 from the `penguin_data` data set example. Observations are in blue, with the cluster center highlighted in red. The distances from the observations to the cluster center are represented as black lines."}
+(ref:10-toy-example-clus1-dists) Cluster 1 from the `penguin_data` data set example. Observations are in blue, with the cluster center highlighted in red. The distances from the observations to the cluster center are represented as black lines.
+
+```{r 10-toy-example-clus1-dists, echo = FALSE, warning = FALSE, fig.height = 4, fig.width = 4.35, fig.align = "center", fig.cap = "(ref:10-toy-example-clus1-dists)"}
 base <- ggplot(clus1) +
   geom_point(aes(y = bill_length_standardized, 
                  x = flipper_length_standardized),
@@ -336,7 +340,9 @@ this means adding up all the squared distances for the 18 observations.
 These distances are denoted by black lines in
 Figure \@ref(fig:10-toy-example-all-clus-dists).
 
-```{r 10-toy-example-all-clus-dists, echo = FALSE, warning = FALSE, fig.height = 4, fig.width = 5, fig.cap = "All clusters from the `penguin_data` data set example. Observations are in orange, blue, and yellow with the cluster center highlighted in red. The distances from the observations to each of the respective cluster centers are represented as black lines."}
+(ref:10-toy-example-all-clus-dists) All clusters from the `penguin_data` data set example. Observations are in orange, blue, and yellow with the cluster center highlighted in red. The distances from the observations to each of the respective cluster centers are represented as black lines.
+
+```{r 10-toy-example-all-clus-dists, echo = FALSE, warning = FALSE, fig.height = 4, fig.width = 5, fig.align = "center", fig.cap = "(ref:10-toy-example-all-clus-dists)"}
 
 
 all_clusters_base <- data |>
@@ -402,7 +408,7 @@ and randomly assigning a roughly equal number of observations
 to each of the K clusters.
 An example random initialization is shown in Figure \@ref(fig:10-toy-kmeans-init).
 
-```{r 10-toy-kmeans-init, echo = FALSE, message = FALSE, warning = FALSE, fig.height = 4, fig.width = 4.35, fig.cap = "Random initialization of labels."}
+```{r 10-toy-kmeans-init, echo = FALSE, message = FALSE, warning = FALSE, fig.height = 4, fig.width = 4.35, fig.align = "center", fig.cap = "Random initialization of labels."}
 set.seed(14)
 penguin_data["label"] <- factor(sample(1:3, nrow(penguin_data), replace = TRUE))
 
@@ -431,7 +437,9 @@ There each row corresponds to an iteration,
 where the left column depicts the center update, 
 and the right column depicts the reassignment of data to clusters.
 
-```{r 10-toy-kmeans-iter, echo = FALSE, warning = FALSE, fig.height = 16, fig.width = 8, fig.cap = "First four iterations of K-means clustering on the `penguin_data` example data set. Each row corresponds to an iteration, where the left column depicts the center update, and the right column depicts the reassignment of data to clusters. Cluster centers are indicated by larger points that are outlined in black."}
+(ref:10-toy-kmeans-iter) First four iterations of K-means clustering on the `penguin_data` example data set. Each row corresponds to an iteration, where the left column depicts the center update, and the right column depicts the reassignment of data to clusters. Cluster centers are indicated by larger points that are outlined in black.
+
+```{r 10-toy-kmeans-iter, echo = FALSE, warning = FALSE, fig.height = 16, fig.width = 8, fig.align = "center", fig.cap = "(ref:10-toy-kmeans-iter)"}
 list_plot_cntrs <- vector(mode = "list", length = 4)
 list_plot_lbls <- vector(mode = "list", length = 4)
 
@@ -538,7 +546,7 @@ These, however, are beyond the scope of this book.
 Unlike the classification and regression models we studied in previous chapters, K-means \index{K-means!restart,nstart} can get "stuck" in a bad solution.
 For example, Figure \@ref(fig:10-toy-kmeans-bad-init) illustrates an unlucky random initialization by K-means.
 
-```{r 10-toy-kmeans-bad-init, echo = FALSE, warning = FALSE, message = FALSE, fig.height = 4, fig.width = 4.35, fig.cap = "Random initialization of labels."}
+```{r 10-toy-kmeans-bad-init, echo = FALSE, warning = FALSE, message = FALSE, fig.height = 4, fig.width = 4.35, fig.align = "center", fig.cap = "Random initialization of labels."}
 penguin_data <- penguin_data |>
   mutate(label = as_factor(c(3L, 3L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 
                              1L, 3L, 1L, 2L, 2L, 2L, 3L, 3L, 3L)))
@@ -557,7 +565,9 @@ plt_lbl
 
 Figure \@ref(fig:10-toy-kmeans-bad-iter) shows what the iterations of K-means would look like with the unlucky random initialization shown in Figure \@ref(fig:10-toy-kmeans-bad-init).
 
-```{r 10-toy-kmeans-bad-iter, echo = FALSE, warning = FALSE, fig.height = 20, fig.width = 8, fig.cap = "First five iterations of K-means clustering on the `penguin_data` example data set with a poor random initialization. Each row corresponds to an iteration, where the left column depicts the center update, and the right column depicts the reassignment of data to clusters. Cluster centers are indicated by larger points that are outlined in black."}
+(ref:10-toy-kmeans-bad-iter) First five iterations of K-means clustering on the `penguin_data` example data set with a poor random initialization. Each row corresponds to an iteration, where the left column depicts the center update, and the right column depicts the reassignment of data to clusters. Cluster centers are indicated by larger points that are outlined in black.
+
+```{r 10-toy-kmeans-bad-iter, echo = FALSE, warning = FALSE, fig.height = 20, fig.width = 8, fig.align = "center", fig.cap = "(ref:10-toy-kmeans-bad-iter)"}
 list_plot_cntrs <- vector(mode = "list", length = 5)
 list_plot_lbls <- vector(mode = "list", length = 5)
 
@@ -949,7 +959,7 @@ but there is a trade-off that doing many clusterings
 could take a long time.
 So this is something that needs to be balanced.
 
-```{r 10-choose-k-nstart, fig.height = 4, fig.width = 4.35, message= F, warning = F, fig.cap = "A plot showing the total WSSD versus the number of clusters when K-means is run with 10 restarts."}
+```{r 10-choose-k-nstart, fig.height = 4, fig.width = 4.35, message= FALSE, warning = FALSE, fig.align = "center", fig.cap = "A plot showing the total WSSD versus the number of clusters when K-means is run with 10 restarts."}
 penguin_clust_ks <- tibble(k = 1:9) |>
   rowwise() |>
   mutate(penguin_clusts = list(kmeans(standardized_data, nstart = 10, k)),
@@ -968,5 +978,15 @@ elbow_plot <- ggplot(clustering_statistics, aes(x = k, y = tot.withinss)) +
 elbow_plot
 ```
 
+## Exercises
+
+Practice exercises for the material covered in this chapter 
+can be found in the accompanying [worksheet](https://ubc-dsci.github.io/data-science-a-first-intro-worksheets/worksheet_10/worksheet_10.ipynb).
+The worksheet tries to provide automated feedback 
+and help guide you through the problems. 
+To make sure this functionality works as intended, 
+please follow the instructions for computer setup needed to run the worksheets 
+found in Chapter \@ref(move-to-your-own-machine).
+
 ## Additional resources
 - Chapter 10 of [An Introduction to Statistical Learning](https://www.statlearning.com/) [-@james2013introduction] provides a great next stop in the process of learning about clustering and unsupervised learning in general. In the realm of clustering specifically, it provides a great companion introduction to K-means, but also covers *hierarchical* clustering for when you expect there to be subgroups, and then subgroups within subgroups, etc. in your data. In the realm of more general unsupervised learning, it covers *principal components analysis (PCA)*, which is a very popular technique in scientific applications for reducing the number of predictors in a dataset. 
@@ -1,6 +1,6 @@
 services:
   book-env:
-    image: ubcdsci/intro-to-ds:v0.16.0
+    image: ubcdsci/intro-to-ds:v0.18.0
     ports:
       - "8787:8787"
     volumes:
 
@@ -1158,6 +1158,18 @@ more. We have just scratched the surface of statistical inference; however, the
 material presented here will serve as the foundation for more advanced
 statistical techniques you may learn about in the future!
 
+## Exercises
+
+Practice exercises for the material covered in this chapter 
+can be found in the two accompanying worksheets
+([first worksheet](https://ubc-dsci.github.io/data-science-a-first-intro-worksheets/worksheet_11/worksheet_11.ipynb) 
+and [second worksheet](https://ubc-dsci.github.io/data-science-a-first-intro-worksheets/worksheet_12/worksheet_12.ipynb)).
+The worksheets try to provide automated feedback 
+and help guide you through the problems. 
+To make sure this functionality works as intended, 
+please follow the instructions for computer setup needed to run the worksheets 
+found in Chapter \@ref(move-to-your-own-machine).
+
 ## Additional resources
 
 - Chapters 7 to 10 of [Modern Dive](https://moderndive.com/) provide a great next step in learning about inference. In particular, Chapters 7 and 8 cover sampling and bootstrapping using `tidyverse` and `infer` in a slightly more in-depth manner than the present chapter. Chapters 9 and 10 take the next step beyond the scope of this chapter and begin to provide some of the initial mathematical underpinnings of inference and more advanced applications of the concept of inference in testing hypotheses and performing regression. This material offers a great starting point for getting more into the technical side of statistics.
 
@@ -686,3 +686,13 @@ you about the different arguments and usage of functions that you have already l
 ```{r 01-help, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "The documentation for the `filter` function, including a high-level description, a list of arguments and their meanings, and more.", fig.retina = 2, out.width="100%"}
 knitr::include_graphics("img/help-filter.png")
 ```
+
+## Exercises
+
+Practice exercises for the material covered in this chapter 
+can be found in the accompanying [worksheet](https://ubc-dsci.github.io/data-science-a-first-intro-worksheets/worksheet_01/worksheet_01.ipynb).
+The worksheet tries to provide automated feedback 
+and help guide you through the problems. 
+To make sure this functionality works as intended, 
+please follow the instructions for computer setup needed to run the worksheets 
+found in Chapter \@ref(move-to-your-own-machine).
@@ -6,6 +6,8 @@ knit: "bookdown::render_book"
 documentclass: krantz
 classoption:
   - krantz2
+header-includes: 
+- \usepackage{float}
 bibliography: [references.bib]
 biblio-style: plainnat
 link-citations: yes
@@ -20,5 +22,11 @@ github-repo: UBC-DSCI/introduction-to-datascience
 cover-image: img/chapter_overview.jpg
 ---
 
+```{r setup-pdf, include=FALSE}
+knitr::opts_chunk$set(fig.pos = "H",
+                      out.extra = "")
+
+```
+
 ```{r preface, child="preface-text.Rmd"}
 ```
@@ -47,3 +47,17 @@ try out the example code that we include throughout the book.
 ```{r img-chapter-overview, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "Where are we going?", out.width="100%", fig.retina = 2, fig.align = "center"}
 knitr::include_graphics("img/chapter_overview.jpeg")
 ```
+
+Each chapter in the book has an accompanying worksheet that provides exercises
+to help you practice the concepts you will learn. We strongly recommend that you
+work through the worksheet when you finish reading each chapter 
+before moving on to the next chapter. All of the worksheets
+are available at 
+[https://ubc-dsci.github.io/data-science-a-first-intro-worksheets](https://ubc-dsci.github.io/data-science-a-first-intro-worksheets);
+the "Exercises" section at the end of each chapter points you to the right worksheet for that chapter.
+The worksheets are designed to provide automated feedback and help guide you through the problems.
+To make sure that functionality works as intended, make sure to follow the setup directions 
+in Chapter \@ref(move-to-your-own-machine) regarding downloading the worksheets.
+
+
+
Original file line number	Diff line number	Diff line change
`@@ -1,2 +1,2 @@`
`1`	`1`	`# Script to generate HTML book`
`2`		`-docker run --rm -m 5g -v $(pwd):/home/rstudio/introduction-to-datascience ubcdsci/intro-to-ds:v0.16.0 /bin/bash -c "cd /home/rstudio/introduction-to-datascience; Rscript _build_html.r"`
	`2`	`+docker run --rm -m 5g -v $(pwd):/home/rstudio/introduction-to-datascience ubcdsci/intro-to-ds:v0.18.0 /bin/bash -c "cd /home/rstudio/introduction-to-datascience; Rscript _build_html.r"`