Skip to content

Commit 14dac93

Browse files
Merge pull request #350 from UBC-DSCI/patch-fig-floating
Patch fig floating
2 parents dad07d8 + ca1562b commit 14dac93

17 files changed

+177
-15
lines changed

build_html.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,2 @@
11
# Script to generate HTML book
2-
docker run --rm -m 5g -v $(pwd):/home/rstudio/introduction-to-datascience ubcdsci/intro-to-ds:v0.16.0 /bin/bash -c "cd /home/rstudio/introduction-to-datascience; Rscript _build_html.r"
2+
docker run --rm -m 5g -v $(pwd):/home/rstudio/introduction-to-datascience ubcdsci/intro-to-ds:v0.18.0 /bin/bash -c "cd /home/rstudio/introduction-to-datascience; Rscript _build_html.r"

build_pdf.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ cp -r data/ pdf/data
2222
cp -r img/ pdf/img
2323

2424
# Build the book with bookdown
25-
docker run --rm -m 5g -v $(pwd):/home/rstudio/introduction-to-datascience ubcdsci/intro-to-ds:v0.16.0 /bin/bash -c "cd /home/rstudio/introduction-to-datascience/pdf; Rscript _build_pdf.r"
25+
docker run --rm -m 5g -v $(pwd):/home/rstudio/introduction-to-datascience ubcdsci/intro-to-ds:v0.18.0 /bin/bash -c "cd /home/rstudio/introduction-to-datascience/pdf; Rscript _build_pdf.r"
2626

2727
# clean files in pdf dir
2828
rm -rf pdf/references.bib

classification1.Rmd

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
library(formatR)
55
library(plotly)
66
library(knitr)
7+
library(kableExtra)
78
89
knitr::opts_chunk$set(echo = TRUE,
910
fig.align = "center")
@@ -503,7 +504,10 @@ math_table <- math_table %>%
503504
```
504505

505506
```{r 05-multiknn-mathtable, echo = FALSE}
506-
knitr::kable(math_table, booktabs = TRUE, caption = "Evaluating the distances from the new observation to each of its 5 nearest neighbors", escape = FALSE)
507+
kable(math_table, booktabs = TRUE,
508+
caption = "Evaluating the distances from the new observation to each of its 5 nearest neighbors",
509+
escape = FALSE) |>
510+
kable_styling(latex_options = "hold_position")
507511
```
508512

509513
The result of this computation shows that 3 of the 5 nearest neighbors to our new observation are
@@ -1369,3 +1373,13 @@ wkflw_plot <-
13691373
13701374
wkflw_plot
13711375
```
1376+
1377+
## Exercises
1378+
1379+
Practice exercises for the material covered in this chapter
1380+
can be found in the accompanying [worksheet](https://ubc-dsci.github.io/data-science-a-first-intro-worksheets/worksheet_06/worksheet_06.ipynb).
1381+
The worksheet tries to provide automated feedback
1382+
and help guide you through the problems.
1383+
To make sure this functionality works as intended,
1384+
please follow the instructions for computer setup needed to run the worksheets
1385+
found in Chapter \@ref(move-to-your-own-machine).

classification2.Rmd

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1342,6 +1342,16 @@ fwd_sel_accuracies_plot <- accuracies |>
13421342
fwd_sel_accuracies_plot
13431343
```
13441344

1345+
## Exercises
1346+
1347+
Practice exercises for the material covered in this chapter
1348+
can be found in the accompanying [worksheet](https://ubc-dsci.github.io/data-science-a-first-intro-worksheets/worksheet_07/worksheet_07.ipynb).
1349+
The worksheet tries to provide automated feedback
1350+
and help guide you through the problems.
1351+
To make sure this functionality works as intended,
1352+
please follow the instructions for computer setup needed to run the worksheets
1353+
found in Chapter \@ref(move-to-your-own-machine).
1354+
13451355
## Additional resources
13461356
- The [`tidymodels` website](https://tidymodels.org/packages) is an excellent reference for more details on, and advanced usage of, the functions and packages in the past two chapters. Aside from that, it also has a [nice beginner's tutorial](https://www.tidymodels.org/start/) and [an extensive list of more advanced examples](https://www.tidymodels.org/learn/) that you can use to continue learning beyond the scope of this book. It's worth noting that the `tidymodels` package does a lot more than just classification, and so the examples on the website similarly go beyond classification as well. In the next two chapters, you'll learn about another kind of predictive modeling setting, so it might be worth visiting the website only after reading through those chapters.
13471357
- [An Introduction to Statistical Learning](https://www.statlearning.com/) [-@james2013introduction] provides a great next stop in the process of learning about classification. Chapter 4 discusses additional basic techniques for classification that we do not cover, such as logistic regression, linear discriminant analysis, and naive Bayes. Chapter 5 goes into much more detail about cross-validation. Chapters 8 and 9 cover decision trees and support vector machines, two very popular but more advanced classification methods. Finally, Chapter 6 covers a number of methods for selecting predictor variables. Note that while this book is still a very accessible introductory text, it requires a bit more mathematical background than we require.

clustering.Rmd

Lines changed: 30 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -164,7 +164,7 @@ penguin_data
164164
Next, we can create a scatter plot using this data set
165165
to see if we can detect subtypes or groups in our data set.
166166

167-
```{r 10-toy-example-plot, warning = FALSE, fig.height = 4, fig.width = 4.35, fig.cap = "Scatter plot of standardized bill length versus standardized flipper length."}
167+
```{r 10-toy-example-plot, warning = FALSE, fig.height = 4, fig.width = 4.35, fig.align = "center", fig.cap = "Scatter plot of standardized bill length versus standardized flipper length."}
168168
ggplot(data, aes(x = flipper_length_standardized,
169169
y = bill_length_standardized)) +
170170
geom_point() +
@@ -198,7 +198,7 @@ This procedure will separate the data into groups;
198198
Figure \@ref(fig:10-toy-example-clustering) shows these groups
199199
denoted by colored scatter points.
200200

201-
```{r 10-toy-example-clustering, echo = FALSE, warning = FALSE, fig.height = 4, fig.width = 5, fig.cap = "Scatter plot of standardized bill length versus standardized flipper length with colored groups."}
201+
```{r 10-toy-example-clustering, echo = FALSE, warning = FALSE, fig.height = 4, fig.width = 5, fig.align = "center", fig.cap = "Scatter plot of standardized bill length versus standardized flipper length with colored groups."}
202202
ggplot(data, aes(y = bill_length_standardized,
203203
x = flipper_length_standardized, color = cluster)) +
204204
geom_point() +
@@ -254,7 +254,9 @@ In the first cluster from the example, there are `r nrow(clus1)` data points. Th
254254
(`r paste("flipper_length_standardized =", round(mean(clus1$flipper_length_standardized),2))` and `r paste("bill_length_standardized =", round(mean(clus1$bill_length_standardized),2))`) highlighted
255255
in Figure \@ref(fig:10-toy-example-clus1-center).
256256

257-
```{r 10-toy-example-clus1-center, echo = FALSE, warning = FALSE, fig.height = 4, fig.width = 4.35, fig.cap = "Cluster 1 from the `penguin_data` data set example. Observations are in blue, with the cluster center highlighted in red."}
257+
(ref:10-toy-example-clus1-center) Cluster 1 from the `penguin_data` data set example. Observations are in blue, with the cluster center highlighted in red.
258+
259+
```{r 10-toy-example-clus1-center, echo = FALSE, warning = FALSE, fig.height = 4, fig.width = 4.35, fig.align = "center", fig.cap = "(ref:10-toy-example-clus1-center)"}
258260
base <- ggplot(data, aes(x = flipper_length_standardized, y = bill_length_standardized)) +
259261
geom_point() +
260262
xlab("Flipper Length (standardized)") +
@@ -299,7 +301,9 @@ S^2 = \left((x_1 - \mu_x)^2 + (y_1 - \mu_y)^2\right) + \left((x_2 - \mu_x)^2 + (
299301

300302
These distances are denoted by lines in Figure \@ref(fig:10-toy-example-clus1-dists) for the first cluster of the penguin data example.
301303

302-
```{r 10-toy-example-clus1-dists, echo = FALSE, warning = FALSE, fig.height = 4, fig.width = 4.35, fig.cap = "Cluster 1 from the `penguin_data` data set example. Observations are in blue, with the cluster center highlighted in red. The distances from the observations to the cluster center are represented as black lines."}
304+
(ref:10-toy-example-clus1-dists) Cluster 1 from the `penguin_data` data set example. Observations are in blue, with the cluster center highlighted in red. The distances from the observations to the cluster center are represented as black lines.
305+
306+
```{r 10-toy-example-clus1-dists, echo = FALSE, warning = FALSE, fig.height = 4, fig.width = 4.35, fig.align = "center", fig.cap = "(ref:10-toy-example-clus1-dists)"}
303307
base <- ggplot(clus1) +
304308
geom_point(aes(y = bill_length_standardized,
305309
x = flipper_length_standardized),
@@ -336,7 +340,9 @@ this means adding up all the squared distances for the 18 observations.
336340
These distances are denoted by black lines in
337341
Figure \@ref(fig:10-toy-example-all-clus-dists).
338342

339-
```{r 10-toy-example-all-clus-dists, echo = FALSE, warning = FALSE, fig.height = 4, fig.width = 5, fig.cap = "All clusters from the `penguin_data` data set example. Observations are in orange, blue, and yellow with the cluster center highlighted in red. The distances from the observations to each of the respective cluster centers are represented as black lines."}
343+
(ref:10-toy-example-all-clus-dists) All clusters from the `penguin_data` data set example. Observations are in orange, blue, and yellow with the cluster center highlighted in red. The distances from the observations to each of the respective cluster centers are represented as black lines.
344+
345+
```{r 10-toy-example-all-clus-dists, echo = FALSE, warning = FALSE, fig.height = 4, fig.width = 5, fig.align = "center", fig.cap = "(ref:10-toy-example-all-clus-dists)"}
340346
341347
342348
all_clusters_base <- data |>
@@ -402,7 +408,7 @@ and randomly assigning a roughly equal number of observations
402408
to each of the K clusters.
403409
An example random initialization is shown in Figure \@ref(fig:10-toy-kmeans-init).
404410

405-
```{r 10-toy-kmeans-init, echo = FALSE, message = FALSE, warning = FALSE, fig.height = 4, fig.width = 4.35, fig.cap = "Random initialization of labels."}
411+
```{r 10-toy-kmeans-init, echo = FALSE, message = FALSE, warning = FALSE, fig.height = 4, fig.width = 4.35, fig.align = "center", fig.cap = "Random initialization of labels."}
406412
set.seed(14)
407413
penguin_data["label"] <- factor(sample(1:3, nrow(penguin_data), replace = TRUE))
408414
@@ -431,7 +437,9 @@ There each row corresponds to an iteration,
431437
where the left column depicts the center update,
432438
and the right column depicts the reassignment of data to clusters.
433439

434-
```{r 10-toy-kmeans-iter, echo = FALSE, warning = FALSE, fig.height = 16, fig.width = 8, fig.cap = "First four iterations of K-means clustering on the `penguin_data` example data set. Each row corresponds to an iteration, where the left column depicts the center update, and the right column depicts the reassignment of data to clusters. Cluster centers are indicated by larger points that are outlined in black."}
440+
(ref:10-toy-kmeans-iter) First four iterations of K-means clustering on the `penguin_data` example data set. Each row corresponds to an iteration, where the left column depicts the center update, and the right column depicts the reassignment of data to clusters. Cluster centers are indicated by larger points that are outlined in black.
441+
442+
```{r 10-toy-kmeans-iter, echo = FALSE, warning = FALSE, fig.height = 16, fig.width = 8, fig.align = "center", fig.cap = "(ref:10-toy-kmeans-iter)"}
435443
list_plot_cntrs <- vector(mode = "list", length = 4)
436444
list_plot_lbls <- vector(mode = "list", length = 4)
437445
@@ -538,7 +546,7 @@ These, however, are beyond the scope of this book.
538546
Unlike the classification and regression models we studied in previous chapters, K-means \index{K-means!restart,nstart} can get "stuck" in a bad solution.
539547
For example, Figure \@ref(fig:10-toy-kmeans-bad-init) illustrates an unlucky random initialization by K-means.
540548

541-
```{r 10-toy-kmeans-bad-init, echo = FALSE, warning = FALSE, message = FALSE, fig.height = 4, fig.width = 4.35, fig.cap = "Random initialization of labels."}
549+
```{r 10-toy-kmeans-bad-init, echo = FALSE, warning = FALSE, message = FALSE, fig.height = 4, fig.width = 4.35, fig.align = "center", fig.cap = "Random initialization of labels."}
542550
penguin_data <- penguin_data |>
543551
mutate(label = as_factor(c(3L, 3L, 1L, 1L, 2L, 1L, 2L, 1L, 1L,
544552
1L, 3L, 1L, 2L, 2L, 2L, 3L, 3L, 3L)))
@@ -557,7 +565,9 @@ plt_lbl
557565

558566
Figure \@ref(fig:10-toy-kmeans-bad-iter) shows what the iterations of K-means would look like with the unlucky random initialization shown in Figure \@ref(fig:10-toy-kmeans-bad-init).
559567

560-
```{r 10-toy-kmeans-bad-iter, echo = FALSE, warning = FALSE, fig.height = 20, fig.width = 8, fig.cap = "First five iterations of K-means clustering on the `penguin_data` example data set with a poor random initialization. Each row corresponds to an iteration, where the left column depicts the center update, and the right column depicts the reassignment of data to clusters. Cluster centers are indicated by larger points that are outlined in black."}
568+
(ref:10-toy-kmeans-bad-iter) First five iterations of K-means clustering on the `penguin_data` example data set with a poor random initialization. Each row corresponds to an iteration, where the left column depicts the center update, and the right column depicts the reassignment of data to clusters. Cluster centers are indicated by larger points that are outlined in black.
569+
570+
```{r 10-toy-kmeans-bad-iter, echo = FALSE, warning = FALSE, fig.height = 20, fig.width = 8, fig.align = "center", fig.cap = "(ref:10-toy-kmeans-bad-iter)"}
561571
list_plot_cntrs <- vector(mode = "list", length = 5)
562572
list_plot_lbls <- vector(mode = "list", length = 5)
563573
@@ -949,7 +959,7 @@ but there is a trade-off that doing many clusterings
949959
could take a long time.
950960
So this is something that needs to be balanced.
951961

952-
```{r 10-choose-k-nstart, fig.height = 4, fig.width = 4.35, message= F, warning = F, fig.cap = "A plot showing the total WSSD versus the number of clusters when K-means is run with 10 restarts."}
962+
```{r 10-choose-k-nstart, fig.height = 4, fig.width = 4.35, message= FALSE, warning = FALSE, fig.align = "center", fig.cap = "A plot showing the total WSSD versus the number of clusters when K-means is run with 10 restarts."}
953963
penguin_clust_ks <- tibble(k = 1:9) |>
954964
rowwise() |>
955965
mutate(penguin_clusts = list(kmeans(standardized_data, nstart = 10, k)),
@@ -968,5 +978,15 @@ elbow_plot <- ggplot(clustering_statistics, aes(x = k, y = tot.withinss)) +
968978
elbow_plot
969979
```
970980

981+
## Exercises
982+
983+
Practice exercises for the material covered in this chapter
984+
can be found in the accompanying [worksheet](https://ubc-dsci.github.io/data-science-a-first-intro-worksheets/worksheet_10/worksheet_10.ipynb).
985+
The worksheet tries to provide automated feedback
986+
and help guide you through the problems.
987+
To make sure this functionality works as intended,
988+
please follow the instructions for computer setup needed to run the worksheets
989+
found in Chapter \@ref(move-to-your-own-machine).
990+
971991
## Additional resources
972992
- Chapter 10 of [An Introduction to Statistical Learning](https://www.statlearning.com/) [-@james2013introduction] provides a great next stop in the process of learning about clustering and unsupervised learning in general. In the realm of clustering specifically, it provides a great companion introduction to K-means, but also covers *hierarchical* clustering for when you expect there to be subgroups, and then subgroups within subgroups, etc. in your data. In the realm of more general unsupervised learning, it covers *principal components analysis (PCA)*, which is a very popular technique in scientific applications for reducing the number of predictors in a dataset.

docker-compose.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
services:
22
book-env:
3-
image: ubcdsci/intro-to-ds:v0.16.0
3+
image: ubcdsci/intro-to-ds:v0.18.0
44
ports:
55
- "8787:8787"
66
volumes:

inference.Rmd

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1158,6 +1158,18 @@ more. We have just scratched the surface of statistical inference; however, the
11581158
material presented here will serve as the foundation for more advanced
11591159
statistical techniques you may learn about in the future!
11601160

1161+
## Exercises
1162+
1163+
Practice exercises for the material covered in this chapter
1164+
can be found in the two accompanying worksheets
1165+
([first worksheet](https://ubc-dsci.github.io/data-science-a-first-intro-worksheets/worksheet_11/worksheet_11.ipynb)
1166+
and [second worksheet](https://ubc-dsci.github.io/data-science-a-first-intro-worksheets/worksheet_12/worksheet_12.ipynb)).
1167+
The worksheets try to provide automated feedback
1168+
and help guide you through the problems.
1169+
To make sure this functionality works as intended,
1170+
please follow the instructions for computer setup needed to run the worksheets
1171+
found in Chapter \@ref(move-to-your-own-machine).
1172+
11611173
## Additional resources
11621174

11631175
- Chapters 7 to 10 of [Modern Dive](https://moderndive.com/) provide a great next step in learning about inference. In particular, Chapters 7 and 8 cover sampling and bootstrapping using `tidyverse` and `infer` in a slightly more in-depth manner than the present chapter. Chapters 9 and 10 take the next step beyond the scope of this chapter and begin to provide some of the initial mathematical underpinnings of inference and more advanced applications of the concept of inference in testing hypotheses and performing regression. This material offers a great starting point for getting more into the technical side of statistics.

intro.Rmd

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -686,3 +686,13 @@ you about the different arguments and usage of functions that you have already l
686686
```{r 01-help, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "The documentation for the `filter` function, including a high-level description, a list of arguments and their meanings, and more.", fig.retina = 2, out.width="100%"}
687687
knitr::include_graphics("img/help-filter.png")
688688
```
689+
690+
## Exercises
691+
692+
Practice exercises for the material covered in this chapter
693+
can be found in the accompanying [worksheet](https://ubc-dsci.github.io/data-science-a-first-intro-worksheets/worksheet_01/worksheet_01.ipynb).
694+
The worksheet tries to provide automated feedback
695+
and help guide you through the problems.
696+
To make sure this functionality works as intended,
697+
please follow the instructions for computer setup needed to run the worksheets
698+
found in Chapter \@ref(move-to-your-own-machine).

pdf/index.Rmd

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,8 @@ knit: "bookdown::render_book"
66
documentclass: krantz
77
classoption:
88
- krantz2
9+
header-includes:
10+
- \usepackage{float}
911
bibliography: [references.bib]
1012
biblio-style: plainnat
1113
link-citations: yes
@@ -20,5 +22,11 @@ github-repo: UBC-DSCI/introduction-to-datascience
2022
cover-image: img/chapter_overview.jpg
2123
---
2224

25+
```{r setup-pdf, include=FALSE}
26+
knitr::opts_chunk$set(fig.pos = "H",
27+
out.extra = "")
28+
29+
```
30+
2331
```{r preface, child="preface-text.Rmd"}
2432
```

preface-text.Rmd

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,3 +47,17 @@ try out the example code that we include throughout the book.
4747
```{r img-chapter-overview, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "Where are we going?", out.width="100%", fig.retina = 2, fig.align = "center"}
4848
knitr::include_graphics("img/chapter_overview.jpeg")
4949
```
50+
51+
Each chapter in the book has an accompanying worksheet that provides exercises
52+
to help you practice the concepts you will learn. We strongly recommend that you
53+
work through the worksheet when you finish reading each chapter
54+
before moving on to the next chapter. All of the worksheets
55+
are available at
56+
[https://ubc-dsci.github.io/data-science-a-first-intro-worksheets](https://ubc-dsci.github.io/data-science-a-first-intro-worksheets);
57+
the "Exercises" section at the end of each chapter points you to the right worksheet for that chapter.
58+
The worksheets are designed to provide automated feedback and help guide you through the problems.
59+
To make sure that functionality works as intended, make sure to follow the setup directions
60+
in Chapter \@ref(move-to-your-own-machine) regarding downloading the worksheets.
61+
62+
63+

0 commit comments

Comments
 (0)