Skip to content

Commit 5e066ac

Browse files
authored
Merge branch 'dev' into html-landing-page
2 parents 66e810a + 8baeda9 commit 5e066ac

File tree

102 files changed

+1676
-872
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

102 files changed

+1676
-872
lines changed

CNAME

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
datasciencebook.ca

Dockerfile

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,7 @@ RUN Rscript -e "reticulate::install_miniconda()"
3939
RUN Rscript -e "reticulate::conda_install('r-reticulate', 'python-kaleido')"
4040
RUN Rscript -e "reticulate::conda_install('r-reticulate', 'plotly', channel = 'plotly')"
4141

42+
RUN Rscript -e "devtools::install_github('mountainMath/cancensus@5a5d61759d477986d40dd87fa9a6532ff6037efe')"
4243
RUN Rscript -e "devtools::install_github('ttimbers/[email protected]')"
4344

4445
# install LaTeX packages
@@ -100,3 +101,9 @@ RUN tlmgr install amsmath \
100101
RUN sed -i 's/256MiB/4GiB/' /etc/ImageMagick-6/policy.xml
101102
RUN sed -i 's/512MiB/4GiB/' /etc/ImageMagick-6/policy.xml
102103
RUN sed -i 's/1GiB/4GiB/' /etc/ImageMagick-6/policy.xml
104+
105+
# install version of tinytex with fixed index double-compile (no release for this yet, so install from commit hash)
106+
RUN Rscript -e "remove.packages('xfun')"
107+
RUN Rscript -e "devtools::install_github('yihui/[email protected]')"
108+
RUN Rscript -e "remove.packages('tinytex')"
109+
RUN Rscript -e "devtools::install_github('yihui/tinytex@5d211d43944d322fca49e5f0d97f34b9c46ff9ab')"

acknowledgements.Rmd

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# Acknowledgments {-}
22

33
We'd like to thank everyone that has contributed to the development of
4-
[*Data Science: A First Introduction*](https://ubc-dsci.github.io/introduction-to-datascience/).
4+
[*Data Science: A First Introduction*](https://datasciencebook.ca).
55
This is an open source textbook that began as a collection of course readings
66
for DSCI 100, a new introductory data science course
77
at the University of British Columbia (UBC).
@@ -19,7 +19,7 @@ Rohan Alexander, Isabella Ghement, Virgilio Gómez Rubio, Albert Kim, Adam Loy,
1919
The book was improved substantially by their insights.
2020
We would like to give special thanks to Jim Zidek
2121
for his support and encouragement throughout the process, and to
22-
Roger Peng for graciously offering to write the foreword.
22+
Roger Peng for graciously offering to write the Foreword.
2323

2424
Finally, we owe a debt of gratitude to all of the students of DSCI 100 over the past
2525
few years. They provided invaluable feedback on the book and worksheets;

build_html.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,2 @@
11
# Script to generate HTML book
2-
docker run --rm -m 5g -v $(pwd):/home/rstudio/introduction-to-datascience ubcdsci/intro-to-ds:v0.21.0 /bin/bash -c "cd /home/rstudio/introduction-to-datascience; Rscript _build_html.r"
2+
docker run --rm -m 5g -v $(pwd):/home/rstudio/introduction-to-datascience ubcdsci/intro-to-ds:v0.23.0 /bin/bash -c "cd /home/rstudio/introduction-to-datascience; Rscript _build_html.r"

build_pdf.sh

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
# Copy files
44
cp references.bib pdf/
55
cp authors.Rmd pdf/
6+
cp foreword-text.Rmd pdf/
67
cp preface-text.Rmd pdf/
78
cp acknowledgements.Rmd pdf/
89
cp intro.Rmd pdf/
@@ -24,11 +25,12 @@ cp -r data/ pdf/data
2425
cp -r img/ pdf/img
2526

2627
# Build the book with bookdown
27-
docker run --rm -m 5g -v $(pwd):/home/rstudio/introduction-to-datascience ubcdsci/intro-to-ds:v0.21.0 /bin/bash -c "cd /home/rstudio/introduction-to-datascience/pdf; Rscript _build_pdf.r"
28+
docker run --rm -m 5g -v $(pwd):/home/rstudio/introduction-to-datascience ubcdsci/intro-to-ds:v0.23.0 /bin/bash -c "cd /home/rstudio/introduction-to-datascience/pdf; Rscript _build_pdf.r"
2829

2930
# clean files in pdf dir
3031
rm -rf pdf/references.bib
3132
rm -rf pdf/authors.Rmd
33+
rm -rf pdf/foreword-text.Rmd
3234
rm -rf pdf/preface-text.Rmd
3335
rm -rf pdf/acknowledgements.Rmd
3436
rm -rf pdf/intro.Rmd

classification1.Rmd

Lines changed: 14 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -455,7 +455,7 @@ You will see in the `mutate` \index{mutate} step below, we compute the straight-
455455
distance using the formula above: we square the differences between the two observations' perimeter
456456
and concavity coordinates, add the squared differences, and then take the square root.
457457

458-
```{r 05-multiknn-1, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap="Scatter plot of concavity versus perimeter with new observation represented as a red diamond."}
458+
```{r 05-multiknn-1, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.pos = "H", out.extra="", fig.cap="Scatter plot of concavity versus perimeter with new observation represented as a red diamond."}
459459
perim_concav <- bind_rows(cancer,
460460
tibble(Perimeter = new_point[1],
461461
Concavity = new_point[2],
@@ -1096,7 +1096,7 @@ The new imbalanced data is shown in Figure \@ref(fig:05-unbalanced).
10961096
set.seed(3)
10971097
```
10981098

1099-
```{r 05-unbalanced, fig.height = 3.5, fig.width = 4.5, fig.cap = "Imbalanced data."}
1099+
```{r 05-unbalanced, fig.height = 3.5, fig.width = 4.5, fig.pos = "H", out.extra="", fig.cap = "Imbalanced data."}
11001100
rare_cancer <- bind_rows(
11011101
filter(cancer, Class == "B"),
11021102
cancer |> filter(Class == "M") |> slice_head(n = 3)
@@ -1255,7 +1255,7 @@ classifier would make. We can see that the decision is more reasonable; when the
12551255
to those labeled malignant, the classifier predicts a malignant tumor, and vice versa when they are
12561256
closer to the benign tumor observations.
12571257

1258-
```{r 05-upsample-plot, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap = "Upsampled data with background color indicating the decision of the classifier."}
1258+
```{r 05-upsample-plot, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.pos = "H", out.extra="", fig.cap = "Upsampled data with background color indicating the decision of the classifier."}
12591259
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 7) |>
12601260
set_engine("kknn") |>
12611261
set_mode("classification")
@@ -1415,9 +1415,14 @@ wkflw_plot
14151415
## Exercises
14161416

14171417
Practice exercises for the material covered in this chapter
1418-
can be found in the accompanying [worksheet](https://github.com/UBC-DSCI/data-science-a-first-intro-worksheets/blob/main/worksheet_classification1/worksheet_classification1.ipynb).
1419-
The worksheet tries to provide automated feedback
1420-
and help guide you through the problems.
1421-
To make sure this functionality works as intended,
1422-
please follow the instructions for computer setup needed to run the worksheets
1423-
found in Chapter \@ref(move-to-your-own-machine).
1418+
can be found in the accompanying
1419+
[worksheets repository](https://github.com/UBC-DSCI/data-science-a-first-intro-worksheets#readme)
1420+
in the "Classification I: training and predicting" row.
1421+
You can launch an interactive version of the worksheet in your browser by clicking the "launch binder" button.
1422+
You can also preview a non-interactive version of the worksheet by clicking "view worksheet."
1423+
If you instead decide to download the worksheet and run it on your own machine,
1424+
make sure to follow the instructions for computer setup
1425+
found in Chapter \@ref(move-to-your-own-machine). This will ensure that the automated feedback
1426+
and guidance that the worksheets provide will function as intended.
1427+
1428+

classification2.Rmd

Lines changed: 31 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -643,7 +643,7 @@ Here, $C=5$ different chunks of the data set are used,
643643
resulting in 5 different choices for the **validation set**; we call this
644644
*5-fold* cross-validation.
645645

646-
```{r 06-cv-image, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "5-fold cross-validation.", fig.retina = 2, out.width = "100%"}
646+
```{r 06-cv-image, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "5-fold cross-validation.", fig.pos = "H", out.extra="", fig.retina = 2, out.width = "100%"}
647647
knitr::include_graphics("img/cv.png")
648648
```
649649

@@ -863,24 +863,7 @@ regardless of what the new observation looks like. In general, if the model
863863
*isn't influenced enough* by the training data, it is said to **underfit** the
864864
data.
865865

866-
**Overfitting:** \index{overfitting!classification} In contrast, when we decrease the number of neighbors, each
867-
individual data point has a stronger and stronger vote regarding nearby points.
868-
Since the data themselves are noisy, this causes a more "jagged" boundary
869-
corresponding to a *less simple* model. If you take this case to the extreme,
870-
setting $K = 1$, then the classifier is essentially just matching each new
871-
observation to its closest neighbor in the training data set. This is just as
872-
problematic as the large $K$ case, because the classifier becomes unreliable on
873-
new data: if we had a different training set, the predictions would be
874-
completely different. In general, if the model *is influenced too much* by the
875-
training data, it is said to **overfit** the data.
876-
877-
Both overfitting and underfitting are problematic and will lead to a model
878-
that does not generalize well to new data. When fitting a model, we need to strike
879-
a balance between the two. You can see these two effects in Figure
880-
\@ref(fig:06-decision-grid-K), which shows how the classifier changes as
881-
we set the number of neighbors $K$ to 1, 7, 20, and 300.
882-
883-
```{r 06-decision-grid-K, echo = FALSE, message = FALSE, fig.height = 10, fig.width = 10, fig.cap = "Effect of K in overfitting and underfitting."}
866+
```{r 06-decision-grid-K, echo = FALSE, message = FALSE, fig.height = 10, fig.width = 10, fig.pos = "H", out.extra="", fig.cap = "Effect of K in overfitting and underfitting."}
884867
ks <- c(1, 7, 20, 300)
885868
plots <- list()
886869
@@ -935,6 +918,23 @@ p_grid <- plot_grid(plotlist = p_no_legend, ncol = 2)
935918
plot_grid(p_grid, legend, ncol = 1, rel_heights = c(1, 0.2))
936919
```
937920

921+
**Overfitting:** \index{overfitting!classification} In contrast, when we decrease the number of neighbors, each
922+
individual data point has a stronger and stronger vote regarding nearby points.
923+
Since the data themselves are noisy, this causes a more "jagged" boundary
924+
corresponding to a *less simple* model. If you take this case to the extreme,
925+
setting $K = 1$, then the classifier is essentially just matching each new
926+
observation to its closest neighbor in the training data set. This is just as
927+
problematic as the large $K$ case, because the classifier becomes unreliable on
928+
new data: if we had a different training set, the predictions would be
929+
completely different. In general, if the model *is influenced too much* by the
930+
training data, it is said to **overfit** the data.
931+
932+
Both overfitting and underfitting are problematic and will lead to a model
933+
that does not generalize well to new data. When fitting a model, we need to strike
934+
a balance between the two. You can see these two effects in Figure
935+
\@ref(fig:06-decision-grid-K), which shows how the classifier changes as
936+
we set the number of neighbors $K$ to 1, 7, 20, and 300.
937+
938938
## Summary
939939

940940
Classification algorithms use one or more quantitative variables to predict the
@@ -948,7 +948,7 @@ can tune the classifier (e.g., select the number of neighbors $K$ in $K$-NN)
948948
by maximizing estimated accuracy via cross-validation. The overall
949949
process is summarized in Figure \@ref(fig:06-overview).
950950

951-
```{r 06-overview, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "Overview of KNN classification.", fig.retina = 2, out.width = "100%"}
951+
```{r 06-overview, echo = FALSE, message = FALSE, warning = FALSE, fig.pos = "H", out.extra="", fig.cap = "Overview of KNN classification.", fig.retina = 2, out.width = "100%"}
952952
knitr::include_graphics("img/train-test-overview.jpeg")
953953
```
954954

@@ -1386,12 +1386,17 @@ fwd_sel_accuracies_plot
13861386
## Exercises
13871387

13881388
Practice exercises for the material covered in this chapter
1389-
can be found in the accompanying [worksheet](https://github.com/UBC-DSCI/data-science-a-first-intro-worksheets/blob/main/worksheet_classification2/worksheet_classification2.ipynb).
1390-
The worksheet tries to provide automated feedback
1391-
and help guide you through the problems.
1392-
To make sure this functionality works as intended,
1393-
please follow the instructions for computer setup needed to run the worksheets
1394-
found in Chapter \@ref(move-to-your-own-machine).
1389+
can be found in the accompanying
1390+
[worksheets repository](https://github.com/UBC-DSCI/data-science-a-first-intro-worksheets#readme)
1391+
in the "Classification II: evaluation and tuning" row.
1392+
You can launch an interactive version of the worksheet in your browser by clicking the "launch binder" button.
1393+
You can also preview a non-interactive version of the worksheet by clicking "view worksheet."
1394+
If you instead decide to download the worksheet and run it on your own machine,
1395+
make sure to follow the instructions for computer setup
1396+
found in Chapter \@ref(move-to-your-own-machine). This will ensure that the automated feedback
1397+
and guidance that the worksheets provide will function as intended.
1398+
1399+
13951400

13961401
## Additional resources
13971402
- The [`tidymodels` website](https://tidymodels.org/packages) is an excellent

0 commit comments

Comments
 (0)