Skip to content

Commit 32d7843

Browse files
Merge pull request #302 from UBC-DSCI/bugfixing1
Initial bug fix pass
2 parents 27ab8a6 + 85f6332 commit 32d7843

15 files changed

+236
-186
lines changed

authors.Rmd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,4 +6,4 @@ Tiffany Timbers is an Assistant Professor of Teaching in the Department of Stati
66
Trevor Campbell is an Assistant Professor in the Department of Statistics at the University of British Columbia. His research focuses on automated, scalable Bayesian inference algorithms, Bayesian nonparametrics, streaming data, and Bayesian theory. He was previously a postdoctoral associate advised by Tamara Broderick in the Computer Science and Artificial Intelligence Laboratory (CSAIL) and Institute for Data, Systems, and Society (IDSS) at MIT, a Ph.D. candidate under Jonathan How in the Laboratory for Information and Decision Systems (LIDS) at MIT, and before that he was in the Engineering Science program at the University of Toronto.
77

88

9-
Melissa Lee is an Assistant Professor of Teaching in the Department of Statistics at the University of British Columbia. With a focus on teaching, she develops curriculum for undergraduate statistics and data science courses. She enjoys using student-centered approaches, developing and assessing open educational resources, and promoting equity, diversity, and inclusion initiatives.
9+
Melissa Lee is an Assistant Professor of Teaching in the Department of Statistics at the University of British Columbia. She teaches and develops curriculum for undergraduate statistics and data science courses. Her work focuses on student-centered approaches to teaching, developing and assessing open educational resources, and promoting equity, diversity, and inclusion initiatives.

build_pdf.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,8 +18,8 @@ cp version-control.Rmd pdf/
1818
cp setup.Rmd pdf/
1919
cp references.Rmd pdf/
2020
cp printindex.tex pdf/
21-
cp -r data/ pdf/
22-
cp -r img/ pdf/
21+
cp -r data/ pdf/data
22+
cp -r img/ pdf/img
2323

2424
# Build the book with bookdown
2525
docker run --rm -m 4g -v $(pwd):/home/rstudio/introduction-to-datascience ubcdsci/intro-to-ds:v0.12.0 /bin/bash -c "cd /home/rstudio/introduction-to-datascience/pdf; Rscript _build_pdf.r"

classification1.Rmd

Lines changed: 4 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,9 @@ library(knitr)
77
88
knitr::opts_chunk$set(echo = TRUE,
99
fig.align = "center")
10+
options(knitr.table.format = function() {
11+
if (knitr::is_latex_output()) 'latex' else 'pandoc'
12+
})
1013
```
1114

1215
## Overview
@@ -565,7 +568,7 @@ Based on $K=5$ nearest neighbors with these three predictors we would classify t
565568
Figure \@ref(fig:05-more) shows what the data look like when we visualize them
566569
as a 3-dimensional scatter with lines from the new observation to its five nearest neighbors.
567570

568-
```{r 05-more, echo = FALSE, message = FALSE, fig.cap = "3D scatter plot of the standardized symmetry, concavity, and perimeter variables.", fig.retina=2}
571+
```{r 05-more, echo = FALSE, message = FALSE, fig.cap = "3D scatter plot of the standardized symmetry, concavity, and perimeter variables. Note that in general we recommend against using 3D visualizations; here we show the data in 3D only to illustrate what higher dimensions and nearest neighbors look like, for learning purposes.", fig.retina=2, out.width="80%"}
569572
attrs <- c("Perimeter", "Concavity", "Symmetry")
570573
571574
# create new scaled obs and get NNs
@@ -638,11 +641,6 @@ if(!is_latex_output()){
638641
}
639642
```
640643

641-
*Click and drag the plot above to rotate it, and scroll to zoom. Note that in
642-
general we recommend against using 3D visualizations; here we show the data in
643-
3D only to illustrate what "higher dimensions" and "nearest neighbors" look like,
644-
for learning purposes.*
645-
646644
### Summary of $K$-nearest neighbors algorithm
647645

648646
In order to classify a new observation using a $K$-nearest neighbor classifier, we have to:

classification2.Rmd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@ labels for the observations in the **test set**, then we have some
4747
confidence that our classifier might also accurately predict the class
4848
labels for new observations without known class labels.
4949

50-
> Note: if there were a golden rule of machine learning, \index{golden rule of machine learning} it might be this:
50+
> **Note:** if there were a golden rule of machine learning, \index{golden rule of machine learning} it might be this:
5151
> *you cannot use the test data to build the model!* If you do, the model gets to
5252
> "see" the test data in advance, making it look more accurate than it really
5353
> is. Imagine how bad it would be to overestimate your classifier's accuracy

clustering.Rmd

Lines changed: 12 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -87,7 +87,7 @@ principal component analysis, multidimensional scaling, and more;
8787
see the additional resources section at the end of this chapter
8888
for where to begin learning more about these other methods.
8989

90-
> There are also so-called *semisupervised* tasks, \index{semisupervised}
90+
> **Note:** There are also so-called *semisupervised* tasks, \index{semisupervised}
9191
> where only some of the data come with response variable labels/values,
9292
> but the vast majority don't.
9393
> The goal is to try to uncover underlying structure in the data
@@ -110,7 +110,7 @@ there are distinct types of penguins in our data.
110110
Understanding this might help us with species discovery and classification in a data-driven
111111
way.
112112

113-
```{r 09-penguins, echo = FALSE, message = FALSE, warning = FALSE, fig.height = 3, fig.width = 4, fig.cap = "Gentoo penguin.", fig.retina = 2}
113+
```{r 09-penguins, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "Gentoo penguin.", out.width="60%", fig.align = "center", fig.retina = 2}
114114
# image source: https://commons.wikimedia.org/wiki/File:Gentoo_Penguin._(8671680772).jpg
115115
knitr::include_graphics("img/gentoo.jpg")
116116
```
@@ -142,6 +142,7 @@ data <- read_csv("data/toy_penguins.csv") |>
142142
143143
penguin_data <- data |> select(flipper_length_standardized,
144144
bill_length_standardized)
145+
145146
write_csv(penguin_data, "data/penguins_standardized.csv")
146147
```
147148

@@ -431,6 +432,7 @@ where the left column depicts the center update,
431432
and the right column depicts the reassignment of data to clusters.
432433

433434
**Center Update** &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp;**Label Update**
435+
434436
```{r 10-toy-kmeans-iter, echo = FALSE, warning = FALSE, fig.height = 16, fig.width = 8, fig.cap = "First four iterations of K-means clustering on the `penguin_data` example data set. Each row corresponds to an iteration, where the left column depicts the center update, and the right column depicts the reassignment of data to clusters. Cluster centers are indicated by larger points that are outlined in black."}
435437
list_plot_cntrs <- vector(mode = "list", length = 4)
436438
list_plot_lbls <- vector(mode = "list", length = 4)
@@ -527,7 +529,6 @@ plt_lbl <- ggplot(penguin_data, aes(y = bill_length_standardized,
527529
plt_lbl
528530
```
529531

530-
531532
Figure \@ref(fig:10-toy-kmeans-bad-iter) shows what the iterations of K-means would look like with the unlucky random initialization shown in Figure \@ref(fig:10-toy-kmeans-bad-init).
532533

533534
**Center Update** &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp;**Label Update**
@@ -838,12 +839,17 @@ penguin_clust_ks
838839

839840
If we wanted to get one of the clusterings out
840841
of the list column in the data frame,
841-
we can use our familiar friends `slice` and `pull`.
842+
we can a familiar friend: `pull`.
843+
`pull` will return to us a data frame column as a simpler data structure,
844+
here that would be a list.
845+
And then to extract the first item of the list,
846+
we can use the `pluck` function;
847+
passing it the index for the element we would like to extract (here 1).
842848

843849
```{r}
844850
penguin_clust_ks |>
845-
slice(1) |>
846-
pull(penguin_clusts)
851+
pull(penguin_clusts) |>
852+
pluck(1)
847853
```
848854

849855
Next, we use `mutate` again to apply `glance` \index{glance}

inference.Rmd

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Introduction to statistical inference {#inference}
1+
# Statistical inference {#inference}
22

33
```{r inference-setup, include = FALSE}
44
knitr::opts_chunk$set(warning = FALSE, fig.align = "center")
@@ -512,30 +512,30 @@ sample_estimates_500 <- rep_sample_n(airbnb, size = 500, reps = 20000) |>
512512
sampling_distribution_20 <- ggplot(sample_estimates_20, aes(x = sample_mean)) +
513513
geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
514514
ylab("Count") +
515-
xlab("Sample mean price per night(Canadian dollars)") +
515+
xlab("Sample mean price per night\n(Canadian dollars)") +
516516
ggtitle("n = 20")
517517
518518
## Sampling distribution n = 50
519519
sampling_distribution_50 <- ggplot(sample_estimates_50, aes(x = sample_mean)) +
520520
geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
521521
ylab("Count") +
522-
xlab("Sample mean price per night(Canadian dollars)") +
522+
xlab("Sample mean price per night\n(Canadian dollars)") +
523523
ggtitle("n = 50") +
524524
xlim(min_x(sampling_distribution_20), max_x(sampling_distribution_20))
525525
526526
## Sampling distribution n = 100
527527
sampling_distribution_100 <- ggplot(sample_estimates_100, aes(x = sample_mean)) +
528528
geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
529529
ylab("Count") +
530-
xlab("Sample mean price per night (Canadian dollars)") +
530+
xlab("Sample mean price per night\n(Canadian dollars)") +
531531
ggtitle("n = 100") +
532532
xlim(min_x(sampling_distribution_20), max_x(sampling_distribution_20))
533533
534534
## Sampling distribution n = 500
535535
sampling_distribution_500 <- ggplot(sample_estimates_500, aes(x = sample_mean)) +
536536
geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
537537
ylab("Count") +
538-
xlab("Sample mean price per night (Canadian dollars)") +
538+
xlab("Sample mean price per night\n(Canadian dollars)") +
539539
ggtitle("n = 500") +
540540
xlim(min_x(sampling_distribution_20), max_x(sampling_distribution_20))
541541
```
@@ -739,7 +739,7 @@ called **the bootstrap**. Note that by taking many samples from our single, obs
739739
sample, we do not obtain the true sampling distribution, but rather an
740740
approximation that we call **the bootstrap distribution**. \index{bootstrap!distribution}
741741

742-
> Note that we must sample *with* replacement when using the bootstrap.
742+
> **Note:** we must sample *with* replacement when using the bootstrap.
743743
> Otherwise, if we had a sample of size $n$, and obtained a sample from it of
744744
> size $n$ *without* replacement, it would just return our original sample!
745745
@@ -876,7 +876,7 @@ tail(boot20000_means)
876876
boot_est_dist <- ggplot(boot20000_means, aes(x = mean)) +
877877
geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
878878
ylab("Count") +
879-
xlab("Sample mean price per night (Canadian dollars)")
879+
xlab("Sample mean price per night \n (Canadian dollars)")
880880
881881
boot_est_dist
882882
```
@@ -894,7 +894,7 @@ sample_estimates <- samples |>
894894
sampling_dist <- ggplot(sample_estimates, aes(x = sample_mean)) +
895895
geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
896896
ylab("Count") +
897-
xlab("Sample mean price per night (Canadian dollars)")
897+
xlab("Sample mean price per night \n (Canadian dollars)")
898898
899899
annotated_sampling_dist <- sampling_dist +
900900
xlim(min_x(sampling_dist), max_x(sampling_dist)) +

intro.Rmd

Lines changed: 25 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -94,8 +94,8 @@ the analysis as well as the selection of appropriate tools.\index{question!data
9494

9595
Table: (\#tab:questions-table) Types of data analysis question [@leek2015question; @peng2015art].
9696

97-
| Question type | Description | Example |
98-
|---------------|-------------|---------|
97+
|Question type| Description | Example |
98+
|-------------|------------------------|--------------------|
9999
| Descriptive | A question that asks about summarized characteristics of a data set without interpretation (i.e., report a fact). | How many people live in each province and territory in Canada? |
100100
| Exploratory | A question asks if there are patterns, trends, or relationships within a single data set. Often used to propose hypotheses for future study. | Does political party voting change with indicators of wealth in a set of data collected on 2,000 people living in Canada? |
101101
| Predictive | A question that asks about predicting measurements or labels for individuals (people or things). The focus is on what things predict some outcome, but not what causes the outcome. | What political party will someone vote for in the next Canadian election? |
@@ -253,7 +253,9 @@ file satisfies everything else that the `read_csv` function expects in the defau
253253
use-case. Figure \@ref(fig:img-read-csv) describes how we use the `read_csv`
254254
to read data into R.
255255

256-
``` {r img-read-csv, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "Syntax for the read_csv function.", fig.retina = 2}
256+
(ref:img-read-csv) Syntax for the `read_csv` function.
257+
258+
``` {r img-read-csv, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "(ref:img-read-csv)", fig.retina = 2, out.width="100%"}
257259
knitr::include_graphics("img/read_csv_function.jpeg")
258260
```
259261

@@ -279,7 +281,7 @@ to the data frame that `read_csv` outputs,
279281
so that we can refer to it later for analysis and visualization.
280282

281283
The way to assign a name to a value in R is via the *assignment symbol* `<-`.
282-
\index{assignsymb@\texttt{<-}|see{assignment symbol}}\index{assignment symbol}
284+
\index{aaaassignsymb@\texttt{<-}|see{assignment symbol}}\index{assignment symbol}
283285
On the left side of the assignment symbol you put the name that you want
284286
to use, and on the right side of the assignment symbol
285287
you put the value that you want the name to refer to.
@@ -298,7 +300,7 @@ we do not need to surround the name we are creating with quotes. This is
298300
because we are formally telling R that this special word denotes
299301
the value of whatever is on the right hand side.
300302
Only characters and words that act as *values* on the right hand side of the assignment
301-
symbol&mdash;e.g., the file name `"data/can_lang.csv"` that we specified before, or `"Alice"` above&mdash;&mdash;need
303+
symbol&mdash;e.g., the file name `"data/can_lang.csv"` that we specified before, or `"Alice"` above&mdash;need
302304
to be surrounded by quotes.
303305

304306
After making the assignment, we can use the special name words we have created in
@@ -388,7 +390,9 @@ is a string *value* \index{string} and not one of the special words that make up
388390
programming language, nor one of the names we have given to data frames in the
389391
code we have already written.
390392

391-
``` {r img-filter, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "Syntax for the filter function.", out.width="100%", fig.retina = 2}
393+
(ref:img-filter) Syntax for the `filter` function.
394+
395+
```{r img-filter, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "(ref:img-filter)", fig.retina = 2, out.width="100%"}
392396
knitr::include_graphics("img/filter_function.jpeg")
393397
```
394398

@@ -421,7 +425,9 @@ able to name things in R is useful: you can see that we are using the
421425
result of our earlier `filter` step (which we named `aboriginal_lang`) here
422426
in the next step of the analysis!
423427

424-
``` {r img-select, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "Syntax for the select function.", out.width="100%", fig.retina = 2}
428+
(ref:img-select) Syntax for the `select` function.
429+
430+
``` {r img-select, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "(ref:img-select)", out.width="100%", fig.retina = 2}
425431
knitr::include_graphics("img/select_function.jpeg")
426432
```
427433

@@ -430,6 +436,7 @@ knitr::include_graphics("img/select_function.jpeg")
430436
selected_lang <- select(aboriginal_lang, language, mother_tongue)
431437
selected_lang
432438
```
439+
433440
### Using `arrange` to order and `slice` to select rows by index number
434441

435442
We have used `filter` and `select` to obtain a table with only the Aboriginal
@@ -449,7 +456,9 @@ language, we will use the `arrange` function to order the rows in our
449456
arrange the rows in descending order (from largest to smallest),
450457
so we pass the column to the `desc` function before using it as an argument.
451458

452-
``` {r img-arrange, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "Syntax for the arrange function.", out.width="100%", fig.retina = 2}
459+
(ref:img-arrange) Syntax for the `arrange` function.
460+
461+
``` {r img-arrange, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "(ref:img-arrange)", out.width="100%", fig.retina = 2}
453462
knitr::include_graphics("img/arrange_function.jpeg")
454463
```
455464

@@ -503,7 +512,9 @@ function and its common usage is illustrated in Figure \@ref(fig:img-ggplot).
503512
Figure \@ref(fig:barplot-mother-tongue) shows the resulting bar plot
504513
generated by following the instructions in Figure \@ref(fig:img-ggplot).
505514

506-
```{r img-ggplot, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "Creating a bar plot with the ggplot function.", out.width="100%", fig.retina = 2}
515+
(ref:img-ggplot) Creating a bar plot with the `ggplot` function.
516+
517+
```{r img-ggplot, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "(ref:img-ggplot)", out.width="100%", fig.retina = 2}
507518
knitr::include_graphics("img/ggplot_function.jpeg")
508519
```
509520

@@ -516,7 +527,7 @@ ggplot(ten_lang, aes(x = language, y = mother_tongue)) +
516527
> time, a single expression in R must be contained in a single line of code.
517528
> However, there *are* a small number of situations in which you can have a
518529
> single R expression span multiple lines. Above is one such case: here, R knows that a line cannot
519-
> end with a `+` symbol, \index{plussymb@$+$} and so it keeps reading the next line to figure out
530+
> end with a `+` symbol, \index{aaaplussymb@$+$|see{ggplot (add layer)}} and so it keeps reading the next line to figure out
520531
> what the right hand side of the `+` symbol should be. We could, of course,
521532
> put all of the added layers on one line of code, but splitting them across
522533
> multiple lines helps a lot with code readability. \index{multi-line expression}
@@ -591,7 +602,7 @@ were, according to the 2016 Candian census, and how many people speak each of th
591602
instance, we can see that the Aboriginal language most often reported was Cree
592603
n.o.s. with over 60,000 Canadian residents reporting it as their mother tongue.
593604

594-
> "n.o.s." means "not otherwise specified", so Cree n.o.s. refers to
605+
> **Note:** "n.o.s." means "not otherwise specified", so Cree n.o.s. refers to
595606
> individuals who reported Cree as their mother tongue. In this data set, the
596607
> Cree languages include the following categories: Cree n.o.s., Swampy Cree,
597608
> Plains Cree, Woods Cree, and a 'Cree not included elsewhere' category (which
@@ -609,7 +620,7 @@ grey to white to improve the contrast. We have also actually skipped the
609620
in the `ggplot` function, you don't actually need to `select` the columns in advance
610621
when creating a visualization. And finally, we provided *comments* next to
611622
many of the lines of code below using the
612-
hash symbol `#`. When R sees a `#` sign, \index{comment} \index{commentsymb@\#|see{comment}} it
623+
hash symbol `#`. When R sees a `#` sign, \index{comment} \index{aaacommentsymb@\#|see{comment}} it
613624
will ignore all of the text that
614625
comes after the symbol on that line. So you can use comments to explain lines
615626
of code for others, and perhaps more importantly, your future self!
@@ -650,7 +661,7 @@ There are many R functions in the `tidyverse` package (and beyond!), and
650661
nobody can be expected to remember what every one of them does
651662
nor all of the arguments we have to give them. Fortunately R provides
652663
the `?` symbol, which
653-
\index{questionmark@? symbol|see{documentation}}
664+
\index{aaaquestionmark@?|see{documentation}}
654665
\index{help|see{documentation}}
655666
\index{documentation} provides an easy way to pull up the documentation for
656667
most functions quickly. To use the `?` symbol to access documentation, you
@@ -672,6 +683,6 @@ documentation like that shown in Figure \@ref(fig:01-help). But do keep in mind
672683
is not written to *teach* you about a function; it is just there as a reference to *remind*
673684
you about the different arguments and usage of functions that you have already learned about elsewhere.
674685

675-
```{r 01-help, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "The documentation for the `filter` function, including a high-level description, a list of arguments and their meanings, and more.", fig.retina = 2}
686+
```{r 01-help, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "The documentation for the `filter` function, including a high-level description, a list of arguments and their meanings, and more.", fig.retina = 2, out.width="100%"}
676687
knitr::include_graphics("img/help-filter.png")
677688
```

jupyter.Rmd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,7 @@ to make a conscious effort to perform data analysis in a reproducible manner.
4848
An example of what a Jupyter notebook looks like is shown in
4949
Figure \@ref(fig:img-jupyter).
5050

51-
```{r img-jupyter, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "A screenshot of a Jupyter Notebook.", fig.retina = 2}
51+
```{r img-jupyter, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "A screenshot of a Jupyter Notebook.", fig.retina = 2, out.width="100%"}
5252
knitr::include_graphics("img/jupyter.png")
5353
```
5454

0 commit comments

Comments
 (0)