UBC-DSCI
diff --git a/‎authors.Rmd
Lines changed: 1 addition & 1 deletion b/‎authors.Rmd
Lines changed: 1 addition & 1 deletion
diff --git a/‎build_pdf.sh
Lines changed: 2 additions & 2 deletions b/‎build_pdf.sh
Lines changed: 2 additions & 2 deletions
diff --git a/‎classification1.Rmd
Lines changed: 4 additions & 6 deletions b/‎classification1.Rmd
Lines changed: 4 additions & 6 deletions
diff --git a/‎classification2.Rmd
Lines changed: 1 addition & 1 deletion b/‎classification2.Rmd
Lines changed: 1 addition & 1 deletion
diff --git a/‎clustering.Rmd
Lines changed: 12 additions & 6 deletions b/‎clustering.Rmd
Lines changed: 12 additions & 6 deletions
diff --git a/‎inference.Rmd
Lines changed: 8 additions & 8 deletions b/‎inference.Rmd
Lines changed: 8 additions & 8 deletions
diff --git a/‎intro.Rmd
Lines changed: 25 additions & 14 deletions b/‎intro.Rmd
Lines changed: 25 additions & 14 deletions
diff --git a/‎jupyter.Rmd
Lines changed: 1 addition & 1 deletion b/‎jupyter.Rmd
Lines changed: 1 addition & 1 deletion
@@ -6,4 +6,4 @@ Tiffany Timbers is an Assistant Professor of Teaching in the Department of Stati
 Trevor Campbell is an Assistant Professor in the Department of Statistics at the University of British Columbia. His research focuses on automated, scalable Bayesian inference algorithms, Bayesian nonparametrics, streaming data, and Bayesian theory. He was previously a postdoctoral associate advised by Tamara Broderick in the Computer Science and Artificial Intelligence Laboratory (CSAIL) and Institute for Data, Systems, and Society (IDSS) at MIT, a Ph.D. candidate under Jonathan How in the Laboratory for Information and Decision Systems (LIDS) at MIT, and before that he was in the Engineering Science program at the University of Toronto.
 
 
-Melissa Lee is an Assistant Professor of Teaching in the Department of Statistics at the University of British Columbia. With a focus on teaching, she develops curriculum for undergraduate statistics and data science courses. She enjoys using student-centered approaches, developing and assessing open educational resources, and promoting equity, diversity, and inclusion initiatives.
+Melissa Lee is an Assistant Professor of Teaching in the Department of Statistics at the University of British Columbia. She teaches and develops curriculum for undergraduate statistics and data science courses. Her work focuses on student-centered approaches to teaching, developing and assessing open educational resources, and promoting equity, diversity, and inclusion initiatives.
@@ -18,8 +18,8 @@ cp version-control.Rmd pdf/
 cp setup.Rmd pdf/
 cp references.Rmd pdf/
 cp printindex.tex pdf/
-cp -r data/ pdf/
-cp -r img/ pdf/
+cp -r data/ pdf/data
+cp -r img/ pdf/img
 
 # Build the book with bookdown
 docker run --rm -m 4g -v $(pwd):/home/rstudio/introduction-to-datascience ubcdsci/intro-to-ds:v0.12.0 /bin/bash -c "cd /home/rstudio/introduction-to-datascience/pdf; Rscript _build_pdf.r"
 
@@ -7,6 +7,9 @@ library(knitr)
 
 knitr::opts_chunk$set(echo = TRUE, 
                       fig.align = "center")
+options(knitr.table.format = function() {
+  if (knitr::is_latex_output()) 'latex' else 'pandoc'
+})
 ```
 
 ## Overview 
@@ -565,7 +568,7 @@ Based on $K=5$ nearest neighbors with these three predictors we would classify t
 Figure \@ref(fig:05-more) shows what the data look like when we visualize them 
 as a 3-dimensional scatter with lines from the new observation to its five nearest neighbors.
 
-```{r 05-more, echo = FALSE, message = FALSE, fig.cap = "3D scatter plot of the standardized symmetry, concavity, and perimeter variables.", fig.retina=2}
+```{r 05-more, echo = FALSE, message = FALSE, fig.cap = "3D scatter plot of the standardized symmetry, concavity, and perimeter variables. Note that in general we recommend against using 3D visualizations; here we show the data in 3D only to illustrate what higher dimensions and nearest neighbors look like, for learning purposes.", fig.retina=2, out.width="80%"}
 attrs <- c("Perimeter", "Concavity", "Symmetry")
 
 # create new scaled obs and get NNs
@@ -638,11 +641,6 @@ if(!is_latex_output()){
 }
 ```
 
-*Click and drag the plot above to rotate it, and scroll to zoom. Note that in
-general we recommend against using 3D visualizations; here we show the data in
-3D only to illustrate what "higher dimensions" and "nearest neighbors" look like,
-for learning purposes.*
-
 ### Summary of $K$-nearest neighbors algorithm
 
 In order to classify a new observation using a $K$-nearest neighbor classifier, we have to:
 
@@ -47,7 +47,7 @@ labels for the observations in the **test set**, then we have some
 confidence that our classifier might also accurately predict the class
 labels for new observations without known class labels.
 
-> Note: if there were a golden rule of machine learning, \index{golden rule of machine learning} it might be this: 
+> **Note:** if there were a golden rule of machine learning, \index{golden rule of machine learning} it might be this: 
 > *you cannot use the test data to build the model!* If you do, the model gets to
 > "see" the test data in advance, making it look more accurate than it really
 > is. Imagine how bad it would be to overestimate your classifier's accuracy
 
@@ -87,7 +87,7 @@ principal component analysis, multidimensional scaling, and more;
 see the additional resources section at the end of this chapter 
 for where to begin learning more about these other methods.
 
-> There are also so-called *semisupervised* tasks, \index{semisupervised} 
+> **Note:** There are also so-called *semisupervised* tasks, \index{semisupervised} 
 > where only some of the data come with response variable labels/values, 
 > but the vast majority don't. 
 > The goal is to try to uncover underlying structure in the data 
@@ -110,7 +110,7 @@ there are distinct types of penguins in our data.
 Understanding this might help us with species discovery and classification in a data-driven
 way.
 
-```{r 09-penguins, echo = FALSE, message = FALSE, warning = FALSE, fig.height = 3, fig.width = 4, fig.cap = "Gentoo penguin.", fig.retina = 2}
+```{r 09-penguins, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "Gentoo penguin.", out.width="60%", fig.align = "center", fig.retina = 2}
 # image source: https://commons.wikimedia.org/wiki/File:Gentoo_Penguin._(8671680772).jpg
 knitr::include_graphics("img/gentoo.jpg")
 ```
@@ -142,6 +142,7 @@ data <- read_csv("data/toy_penguins.csv") |>
 
 penguin_data <- data |> select(flipper_length_standardized, 
 bill_length_standardized)
+
 write_csv(penguin_data, "data/penguins_standardized.csv")
 ```
 
@@ -431,6 +432,7 @@ where the left column depicts the center update,
 and the right column depicts the reassignment of data to clusters.
 
 **Center Update**  &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp;**Label Update**
+
 ```{r 10-toy-kmeans-iter, echo = FALSE, warning = FALSE, fig.height = 16, fig.width = 8, fig.cap = "First four iterations of K-means clustering on the `penguin_data` example data set. Each row corresponds to an iteration, where the left column depicts the center update, and the right column depicts the reassignment of data to clusters. Cluster centers are indicated by larger points that are outlined in black."}
 list_plot_cntrs <- vector(mode = "list", length = 4)
 list_plot_lbls <- vector(mode = "list", length = 4)
@@ -527,7 +529,6 @@ plt_lbl <- ggplot(penguin_data, aes(y = bill_length_standardized,
 plt_lbl
 ```
 
-
 Figure \@ref(fig:10-toy-kmeans-bad-iter) shows what the iterations of K-means would look like with the unlucky random initialization shown in Figure \@ref(fig:10-toy-kmeans-bad-init).
 
 **Center Update**  &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp;**Label Update**
@@ -838,12 +839,17 @@ penguin_clust_ks
 
 If we wanted to get one of the clusterings out 
 of the list column in the data frame,
-we can use our familiar friends `slice` and `pull`.
+we can a familiar friend: `pull`.
+`pull` will return to us a data frame column as a simpler data structure,
+here that would be a list.
+And then to extract the first item of the list, 
+we can use the `pluck` function; 
+passing it the index for the element we would like to extract (here 1).
 
 ```{r}
 penguin_clust_ks |>
-  slice(1) |>
-  pull(penguin_clusts)
+  pull(penguin_clusts) |>
+  pluck(1)
 ```
 
 Next, we use `mutate` again to apply `glance` \index{glance} 
 
@@ -1,4 +1,4 @@
-# Introduction to statistical inference {#inference}
+# Statistical inference {#inference}
 
 ```{r inference-setup, include = FALSE}
 knitr::opts_chunk$set(warning = FALSE, fig.align = "center")
@@ -512,30 +512,30 @@ sample_estimates_500 <- rep_sample_n(airbnb, size = 500, reps = 20000) |>
 sampling_distribution_20 <- ggplot(sample_estimates_20, aes(x = sample_mean)) +
   geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
   ylab("Count") +
-  xlab("Sample mean price per night(Canadian dollars)") +
+  xlab("Sample mean price per night\n(Canadian dollars)") +
   ggtitle("n = 20") 
 
 ## Sampling distribution n = 50
 sampling_distribution_50 <- ggplot(sample_estimates_50, aes(x = sample_mean)) +
   geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
   ylab("Count") +
-  xlab("Sample mean price per night(Canadian dollars)") +
+  xlab("Sample mean price per night\n(Canadian dollars)") +
   ggtitle("n = 50") +
   xlim(min_x(sampling_distribution_20), max_x(sampling_distribution_20))
 
 ## Sampling distribution n = 100
 sampling_distribution_100 <- ggplot(sample_estimates_100, aes(x = sample_mean)) +
   geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
   ylab("Count") +
-  xlab("Sample mean price per night (Canadian dollars)") +
+  xlab("Sample mean price per night\n(Canadian dollars)") +
   ggtitle("n = 100") +
   xlim(min_x(sampling_distribution_20), max_x(sampling_distribution_20))
 
 ## Sampling distribution n = 500
 sampling_distribution_500 <- ggplot(sample_estimates_500, aes(x = sample_mean)) +
   geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
   ylab("Count") +
-  xlab("Sample mean price per night (Canadian dollars)") +
+  xlab("Sample mean price per night\n(Canadian dollars)") +
   ggtitle("n = 500") +
   xlim(min_x(sampling_distribution_20), max_x(sampling_distribution_20))
 ```
@@ -739,7 +739,7 @@ called **the bootstrap**.  Note that by taking many samples from our single, obs
 sample, we do not obtain the true sampling distribution, but rather an
 approximation that we call **the bootstrap distribution**. \index{bootstrap!distribution}
 
-> Note that we must sample *with* replacement when using the bootstrap.
+> **Note:** we must sample *with* replacement when using the bootstrap.
 > Otherwise, if we had a sample of size $n$, and obtained a sample from it of
 > size $n$ *without* replacement, it would just return our original sample!
 
@@ -876,7 +876,7 @@ tail(boot20000_means)
 boot_est_dist <- ggplot(boot20000_means, aes(x = mean)) +
   geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
   ylab("Count") +
-  xlab("Sample mean price per night (Canadian dollars)") 
+  xlab("Sample mean price per night \n (Canadian dollars)") 
 
 boot_est_dist
 ```
@@ -894,7 +894,7 @@ sample_estimates <- samples |>
 sampling_dist <- ggplot(sample_estimates, aes(x = sample_mean)) +
   geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
   ylab("Count") +
-  xlab("Sample mean price per night (Canadian dollars)")  
+  xlab("Sample mean price per night \n (Canadian dollars)")  
 
 annotated_sampling_dist <- sampling_dist +  
   xlim(min_x(sampling_dist), max_x(sampling_dist)) + 
 
@@ -94,8 +94,8 @@ the analysis as well as the selection of appropriate tools.\index{question!data
 
 Table: (\#tab:questions-table) Types of data analysis question [@leek2015question; @peng2015art].
 
-| Question type | Description | Example |
-|---------------|-------------|---------|
+|Question type|    Description         |     Example        |
+|-------------|------------------------|--------------------|
 | Descriptive | A question that asks about summarized characteristics of a data set without interpretation (i.e., report a fact). | How many people live in each province and territory in Canada? |
 | Exploratory | A question asks if there are patterns, trends, or relationships within a single data set. Often used to propose hypotheses for future study. | Does political party voting change with indicators of wealth in a set of data collected on 2,000 people living in Canada? |
 | Predictive | A question that asks about predicting measurements or labels for individuals (people or things). The focus is on what things predict some outcome, but not what causes the outcome. | What political party will someone vote for in the next Canadian election? |
@@ -253,7 +253,9 @@ file satisfies everything else that the `read_csv` function expects in the defau
 use-case. Figure \@ref(fig:img-read-csv) describes how we use the `read_csv`
 to read data into R. 
 
-``` {r img-read-csv, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "Syntax for the read_csv function.", fig.retina = 2}
+(ref:img-read-csv) Syntax for the `read_csv` function.
+
+``` {r img-read-csv, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "(ref:img-read-csv)", fig.retina = 2, out.width="100%"}
 knitr::include_graphics("img/read_csv_function.jpeg")
 ```
 
@@ -279,7 +281,7 @@ to the data frame that `read_csv` outputs,
 so that we can refer to it later for analysis and visualization.
 
 The way to assign a name to a value in R is via the *assignment symbol* `<-`. 
-\index{assignsymb@\texttt{<-}|see{assignment symbol}}\index{assignment symbol}
+\index{aaaassignsymb@\texttt{<-}|see{assignment symbol}}\index{assignment symbol}
 On the left side of the assignment symbol you put the name that you want
 to use, and on the right side of the assignment symbol
 you put the value that you want the name to refer to.
@@ -298,7 +300,7 @@ we do not need to surround the name we are creating  with quotes. This is
 because we are formally telling R that this special word denotes
 the value of whatever is on the right hand side.
 Only characters and words that act as *values* on the right hand side of the assignment
-symbol&mdash;e.g., the file name `"data/can_lang.csv"` that we specified before, or `"Alice"` above&mdash;&mdash;need 
+symbol&mdash;e.g., the file name `"data/can_lang.csv"` that we specified before, or `"Alice"` above&mdash;need 
 to be surrounded by quotes.
 
 After making the assignment, we can use the special name words we have created in
@@ -388,7 +390,9 @@ is a string *value* \index{string} and not one of the special words that make up
 programming language, nor one of the names we have given to data frames in the
 code we have already written. 
 
-``` {r img-filter, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "Syntax for the filter function.", out.width="100%", fig.retina = 2}
+(ref:img-filter) Syntax for the `filter` function.
+
+```{r img-filter, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "(ref:img-filter)", fig.retina = 2, out.width="100%"}
 knitr::include_graphics("img/filter_function.jpeg")
 ```
 
@@ -421,7 +425,9 @@ able to name things in R is useful: you can see that we are using the
 result of our earlier `filter` step (which we named `aboriginal_lang`) here 
 in the next step of the analysis!
 
-``` {r img-select, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "Syntax for the select function.", out.width="100%", fig.retina = 2}
+(ref:img-select) Syntax for the `select` function.
+
+``` {r img-select, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "(ref:img-select)", out.width="100%", fig.retina = 2}
 knitr::include_graphics("img/select_function.jpeg")
 ```
 
@@ -430,6 +436,7 @@ knitr::include_graphics("img/select_function.jpeg")
 selected_lang <- select(aboriginal_lang, language, mother_tongue)
 selected_lang
 ```
+
 ### Using `arrange` to order and `slice` to select rows by index number
 
 We have used `filter` and `select` to obtain a table with only the Aboriginal
@@ -449,7 +456,9 @@ language, we will use the `arrange` function to order the rows in our
 arrange the rows in descending order (from largest to smallest),
 so we pass the column to the `desc` function before using it as an argument. 
 
-``` {r img-arrange, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "Syntax for the arrange function.", out.width="100%", fig.retina = 2}
+(ref:img-arrange) Syntax for the `arrange` function.
+
+``` {r img-arrange, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "(ref:img-arrange)", out.width="100%", fig.retina = 2}
 knitr::include_graphics("img/arrange_function.jpeg")
 ```
 
@@ -503,7 +512,9 @@ function and its common usage is illustrated in Figure \@ref(fig:img-ggplot).
 Figure \@ref(fig:barplot-mother-tongue) shows the resulting bar plot
 generated by following the instructions in Figure \@ref(fig:img-ggplot).
 
-```{r img-ggplot, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "Creating a bar plot with the ggplot function.", out.width="100%", fig.retina = 2}
+(ref:img-ggplot) Creating a bar plot with the `ggplot` function.
+
+```{r img-ggplot, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "(ref:img-ggplot)", out.width="100%", fig.retina = 2}
 knitr::include_graphics("img/ggplot_function.jpeg")
 ```
 
@@ -516,7 +527,7 @@ ggplot(ten_lang, aes(x = language, y = mother_tongue)) +
 > time, a single expression in R must be contained in a single line of code.
 > However, there *are* a small number of situations in which you can have a
 > single R expression span multiple lines. Above is one such case: here, R knows that a line cannot
-> end with a `+` symbol, \index{plussymb@$+$} and so it keeps reading the next line to figure out
+> end with a `+` symbol, \index{aaaplussymb@$+$|see{ggplot (add layer)}} and so it keeps reading the next line to figure out
 > what the right hand side of the `+` symbol should be.  We could, of course,
 > put all of the added layers on one line of code, but splitting them across
 > multiple lines helps a lot with code readability. \index{multi-line expression}
@@ -591,7 +602,7 @@ were, according to the 2016 Candian census, and how many people speak each of th
 instance, we can see that the Aboriginal language most often reported was Cree
 n.o.s. with over 60,000 Canadian residents reporting it as their mother tongue.
 
-> "n.o.s." means "not otherwise specified", so Cree n.o.s. refers to
+> **Note:** "n.o.s." means "not otherwise specified", so Cree n.o.s. refers to
 > individuals who reported Cree as their mother tongue. In this data set, the
 > Cree languages include the following categories: Cree n.o.s., Swampy Cree,
 > Plains Cree, Woods Cree, and a 'Cree not included elsewhere' category (which
@@ -609,7 +620,7 @@ grey to white to improve the contrast. We have also actually skipped the
 in the `ggplot` function, you don't actually need to `select` the columns in advance
 when creating a visualization. And finally, we provided *comments* next to 
 many of the lines of code below using the
-hash symbol `#`. When R sees a `#` sign, \index{comment} \index{commentsymb@\#|see{comment}} it 
+hash symbol `#`. When R sees a `#` sign, \index{comment} \index{aaacommentsymb@\#|see{comment}} it 
 will ignore all of the text that
 comes after the symbol on that line. So you can use comments to explain lines 
 of code for others, and perhaps more importantly, your future self!
@@ -650,7 +661,7 @@ There are many R functions in the `tidyverse` package (and beyond!), and
 nobody can be expected to remember what every one of them does
 nor all of the arguments we have to give them. Fortunately R provides 
 the `?` symbol, which 
-\index{questionmark@? symbol|see{documentation}}
+\index{aaaquestionmark@?|see{documentation}}
 \index{help|see{documentation}}
 \index{documentation} provides an easy way to pull up the documentation for 
 most functions quickly. To use the `?` symbol to access documentation, you 
@@ -672,6 +683,6 @@ documentation like that shown in Figure \@ref(fig:01-help). But do keep in mind
 is not written to *teach* you about a function; it is just there as a reference to *remind*
 you about the different arguments and usage of functions that you have already learned about elsewhere.
 
-```{r 01-help, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "The documentation for the `filter` function, including a high-level description, a list of arguments and their meanings, and more.", fig.retina = 2}
+```{r 01-help, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "The documentation for the `filter` function, including a high-level description, a list of arguments and their meanings, and more.", fig.retina = 2, out.width="100%"}
 knitr::include_graphics("img/help-filter.png")
 ```
@@ -48,7 +48,7 @@ to make a conscious effort to perform data analysis in a reproducible manner.
 An example of what a Jupyter notebook looks like is shown in 
 Figure \@ref(fig:img-jupyter).
 
-```{r img-jupyter, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "A screenshot of a Jupyter Notebook.", fig.retina = 2}
+```{r img-jupyter, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "A screenshot of a Jupyter Notebook.", fig.retina = 2, out.width="100%"}
 knitr::include_graphics("img/jupyter.png")
 ```
Original file line number	Diff line number	Diff line change
`@@ -6,4 +6,4 @@ Tiffany Timbers is an Assistant Professor of Teaching in the Department of Stati`
`6`	`6`	Trevor Campbell is an Assistant Professor in the Department of Statistics at the University of British Columbia. His research focuses on automated, scalable Bayesian inference algorithms, Bayesian nonparametrics, streaming data, and Bayesian theory. He was previously a postdoctoral associate advised by Tamara Broderick in the Computer Science and Artificial Intelligence Laboratory (CSAIL) and Institute for Data, Systems, and Society (IDSS) at MIT, a Ph.D. candidate under Jonathan How in the Laboratory for Information and Decision Systems (LIDS) at MIT, and before that he was in the Engineering Science program at the University of Toronto.
`7`	`7`
`8`	`8`
`9`		`-Melissa Lee is an Assistant Professor of Teaching in the Department of Statistics at the University of British Columbia. With a focus on teaching, she develops curriculum for undergraduate statistics and data science courses. She enjoys using student-centered approaches, developing and assessing open educational resources, and promoting equity, diversity, and inclusion initiatives.`
	`9`	`+Melissa Lee is an Assistant Professor of Teaching in the Department of Statistics at the University of British Columbia. She teaches and develops curriculum for undergraduate statistics and data science courses. Her work focuses on student-centered approaches to teaching, developing and assessing open educational resources, and promoting equity, diversity, and inclusion initiatives.`