UBC-DSCI
diff --git a/‎classification1.Rmd
Lines changed: 1 addition & 1 deletion b/‎classification1.Rmd
Lines changed: 1 addition & 1 deletion
diff --git a/‎classification2.Rmd
Lines changed: 7 additions & 7 deletions b/‎classification2.Rmd
Lines changed: 7 additions & 7 deletions
diff --git a/‎clustering.Rmd
Lines changed: 5 additions & 5 deletions b/‎clustering.Rmd
Lines changed: 5 additions & 5 deletions
diff --git a/‎inference.Rmd
Lines changed: 4 additions & 4 deletions b/‎inference.Rmd
Lines changed: 4 additions & 4 deletions
diff --git a/‎jupyter.Rmd
Lines changed: 5 additions & 5 deletions b/‎jupyter.Rmd
Lines changed: 5 additions & 5 deletions
diff --git a/‎reading.Rmd
Lines changed: 10 additions & 10 deletions b/‎reading.Rmd
Lines changed: 10 additions & 10 deletions
@@ -147,7 +147,7 @@ Traditionally these procedures were quite invasive; modern methods such as fine
 needle aspiration, used to collect the present data set, extract only a small
 amount of tissue and are less invasive. Based on a digital image of each breast
 tissue sample collected for this data set, ten different variables were measured
-for each cell nucleus in the image (items 3-12 of the list of variables below), and then the mean 
+for each cell nucleus in the image (items 3&ndash;12 of the list of variables below), and then the mean 
  for each variable across the nuclei was recorded. As part of the
 data preparation, these values have been *standardized (centered and scaled)*; we will discuss what this
 means and why we do it later in this chapter. Each image additionally was given
 
@@ -643,7 +643,7 @@ knitr::include_graphics("img/cv.png")
 ```
 
 To perform 5-fold cross-validation in R with `tidymodels`, we use another
-function: `vfold_cv`. \index{tidymodels!vfold\_cv}\index{cross validation!vfold\_cv} This function splits our training data into `v` folds
+function: `vfold_cv`. \index{tidymodels!vfold\_cv}\index{cross-validation!vfold\_cv} This function splits our training data into `v` folds
 automatically. We set the `strata` argument to the categorical label variable
 (here, `Class`) to ensure that the training and validation subsets contain the
 right proportions of each category of observation.
@@ -653,7 +653,7 @@ cancer_vfold <- vfold_cv(cancer_train, v = 5, strata = Class)
 cancer_vfold
 ```
 
-Then, when we create our data analysis workflow, we use the `fit_resamples` function \index{cross validation!fit\_resamples}\index{tidymodels!fit\_resamples}
+Then, when we create our data analysis workflow, we use the `fit_resamples` function \index{cross-validation!fit\_resamples}\index{tidymodels!fit\_resamples}
 instead of the `fit` function for training. This runs cross-validation on each
 train/validation split. 
 
@@ -679,7 +679,7 @@ knn_fit <- workflow() |>
 knn_fit
 ```
 
-The `collect_metrics` \index{tidymodels!collect\_metrics}\index{cross validation!collect\_metrics} function is used to aggregate the *mean* and *standard error*
+The `collect_metrics` \index{tidymodels!collect\_metrics}\index{cross-validation!collect\_metrics} function is used to aggregate the *mean* and *standard error*
 of the classifier's validation accuracy across the folds. You will find results
 related to the accuracy in the row with `accuracy` listed under the `.metric` column. 
 You should consider the mean (`mean`) to be the estimated accuracy, while the standard 
@@ -747,7 +747,7 @@ knn_spec <- nearest_neighbor(weight_func = "rectangular",
   set_mode("classification")
 ```
 
-Then instead of using `fit` or `fit_resamples`, we will use the `tune_grid` function \index{cross validation!tune\_grid}\index{tidymodels!tune\_grid}
+Then instead of using `fit` or `fit_resamples`, we will use the `tune_grid` function \index{cross-validation!tune\_grid}\index{tidymodels!tune\_grid}
 to fit the model for each value in a range of parameter values. 
 In particular, we first create a data frame with a `neighbors`
 variable that contains the sequence of values of $K$ to try; below we create the `k_vals`
@@ -1176,9 +1176,9 @@ Best subset selection is applicable to any classification method ($K$-NN or othe
 However, it becomes very slow when you have even a moderate
 number of predictors to choose from (say, around 10). This is because the number of possible predictor subsets
 grows very quickly with the number of predictors, and you have to train the model (itself
-a slow process!) for each one. For example, if we have $2$ predictors---let's call
-them A and B---then we have 3 variable sets to try: A alone, B alone, and finally A
-and B together. If we have $3$ predictors---A, B, and C---then we have 7
+a slow process!) for each one. For example, if we have $2$ predictors&mdash;let's call
+them A and B&mdash;then we have 3 variable sets to try: A alone, B alone, and finally A
+and B together. If we have $3$ predictors&mdash;A, B, and C&mdash;then we have 7
 to try: A, B, C, AB, BC, AC, and ABC. In general, the number of models
 we have to train for $m$ predictors is $2^m-1$; in other words, when we 
 get to $10$ predictors we have over *one thousand* models to train, and 
 
@@ -107,7 +107,7 @@ collected by [Dr. Kristen Gorman](https://www.uaf.edu/cfos/people/faculty/detail
 the [Palmer Station, Antarctica Long Term Ecological Research Site](https://pal.lternet.edu/), and includes
 measurements for adult penguins found near there [@palmerpenguins]. We have
 modified the data set for use in this chapter. Here we will focus on using two
-variables---penguin bill and flipper length, both in millimeters---to determine whether 
+variables&mdash;penguin bill and flipper length, both in millimeters&mdash;to determine whether 
 there are distinct types of penguins in our data.
 Understanding this might help us with species discovery and classification in a data-driven
 way.
@@ -332,7 +332,7 @@ base <- base +
 base
 ```
 
-The larger the value of $S^2$, the more spread-out the cluster is, since large $S^2$ means that points are far from the cluster center.
+The larger the value of $S^2$, the more spread out the cluster is, since large $S^2$ means that points are far from the cluster center.
 Note, however, that "large" is relative to *both* the scale of the variables for clustering *and* the number of points in the cluster. A cluster where points are very close to the center might still have a large $S^2$ if there are many data points in the cluster.
 
 After we have calculated the WSSD for all the clusters, 
@@ -591,7 +591,7 @@ These, however, are beyond the scope of this book.
 
 ### Random restarts
 
-Unlike the classification and regression models we studied in previous chapters, K-means \index{K-means!restart,nstart} can get "stuck" in a bad solution.
+Unlike the classification and regression models we studied in previous chapters, K-means \index{K-means!restart, nstart} can get "stuck" in a bad solution.
 For example, Figure \@ref(fig:10-toy-kmeans-bad-init) illustrates an unlucky random initialization by K-means.
 
 ```{r 10-toy-kmeans-bad-init, echo = FALSE, warning = FALSE, message = FALSE, fig.height = 3.5, fig.width = 3.75, fig.align = "center", fig.cap = "Random initialization of labels."}
@@ -859,7 +859,7 @@ each other. Therefore, the *scale* of each of the variables in the data
 will influence which cluster data points end up being assigned.
 Variables with a large scale will have a much larger 
 effect on deciding cluster assignment than variables with a small scale. 
-To address this problem, we typically standardize \index{standardization!K-means}\index{K-means!stanardization} our data before clustering,
+To address this problem, we typically standardize \index{standardization!K-means}\index{K-means!standardization} our data before clustering,
 which ensures that each variable has a mean of 0 and standard deviation of 1.
 The `scale` function in R can be used to do this. 
 We show an example of how to use this function 
@@ -1050,7 +1050,7 @@ But why is there a "bump" in the total WSSD plot here?
 Shouldn't total WSSD always decrease as we add more clusters? 
 Technically yes, but remember:  K-means can get "stuck" in a bad solution. 
 Unfortunately, for K = 8 we had an unlucky initialization
-and found a bad clustering! \index{K-means!restart,nstart} 
+and found a bad clustering! \index{K-means!restart, nstart} 
 We can help prevent finding a bad clustering 
 by trying a few different random initializations 
 via the `nstart` argument (Figure \@ref(fig:10-choose-k-nstart) 
 
@@ -750,9 +750,9 @@ For a sample of size $n$, you would do the following:
 1. Randomly select an observation from the original sample, which was drawn from the population.
 2. Record the observation's value.
 3. Replace that observation.
-4. Repeat steps 1 - 3 (sampling *with* replacement) until you have $n$ observations, which form a bootstrap sample.
+4. Repeat steps 1&ndash;3 (sampling *with* replacement) until you have $n$ observations, which form a bootstrap sample.
 5. Calculate the bootstrap point estimate (e.g., mean, median, proportion, slope, etc.) of the $n$ observations in your bootstrap sample.
-6. Repeat steps (1) - (5) many times to create a distribution of point estimates (the bootstrap distribution).
+6. Repeat steps 1&ndash;5 many times to create a distribution of point estimates (the bootstrap distribution).
 7. Calculate the plausible range of values around our observed point estimate.
 
 ```{r 11-intro-bootstrap-image, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "Overview of the bootstrap process.", fig.retina = 2, out.width="100%"}
@@ -789,7 +789,7 @@ mean of the sample is \$`r round(estimates$sample_mean, 2)`.
 Remember, in practice, we usually only have this one sample from the population. So
 this sample and estimate are the only data we can work with.
 
-We now perform steps (1) - (5) listed above to generate a single bootstrap
+We now perform steps 1&ndash;5 listed above to generate a single bootstrap
 sample in R and calculate a point estimate from that bootstrap sample. We will 
 use the `rep_sample_n` function as we did when we were
 creating our sampling distribution. But critically, note that we now
@@ -1173,4 +1173,4 @@ found in Chapter \@ref(move-to-your-own-machine).
 ## Additional resources
 
 - Chapters 7 to 10 of [*Modern Dive*](https://moderndive.com/) provide a great next step in learning about inference. In particular, Chapters 7 and 8 cover sampling and bootstrapping using `tidyverse` and `infer` in a slightly more in-depth manner than the present chapter. Chapters 9 and 10 take the next step beyond the scope of this chapter and begin to provide some of the initial mathematical underpinnings of inference and more advanced applications of the concept of inference in testing hypotheses and performing regression. This material offers a great starting point for getting more into the technical side of statistics.
-- Chapters 4 to 7 of [*OpenIntro Statistics - Fourth Edition*](https://www.openintro.org/) provide a good next step after *Modern Dive*. Although it is still certainly an introductory text, things get a bit more mathematical here. Depending on your background, you may actually want to start going through Chapters 1 to 3 first, where you will learn some fundamental concepts in probability theory. Although it may seem like a diversion, probability theory is *the language of statistics*; if you have a solid grasp of probability, more advanced statistics will come naturally to you!
+- Chapters 4 to 7 of [*OpenIntro Statistics*](https://www.openintro.org/) provide a good next step after *Modern Dive*. Although it is still certainly an introductory text, things get a bit more mathematical here. Depending on your background, you may actually want to start going through Chapters 1 to 3 first, where you will learn some fundamental concepts in probability theory. Although it may seem like a diversion, probability theory is *the language of statistics*; if you have a solid grasp of probability, more advanced statistics will come naturally to you!
@@ -144,7 +144,7 @@ that indicates the status of your kernel. If the circle is empty (`r fa("circle"
 the kernel is idle and ready to execute code. If the circle is filled in (`r fa("circle", fill = "black", stroke = "black", stroke_width = "10px", height = "12px")`), 
 the kernel is busy running some code.
 
-You may run into problems where your kernel \index{kernel!interrupt,restart} is stuck for an excessive amount 
+You may run into problems where your kernel \index{kernel!interrupt, restart} is stuck for an excessive amount 
 of time, your notebook is very slow and unresponsive, or your kernel loses its
 connection. If this happens, try the following steps:
 
@@ -245,8 +245,8 @@ referenced in another distinct code cell (Figure \@ref(fig:out-of-order-1)).
 Together, this means that you could then write a code cell further above in the
 notebook that references `y` and execute it without error in the current session 
 (Figure \@ref(fig:out-of-order-2)). This could also be done successfully in 
-future sessions if, and only if, you run the cells in the same non-conventional 
-order. However, it is difficult to remember this non-conventional order, and it 
+future sessions if, and only if, you run the cells in the same unconventional 
+order. However, it is difficult to remember this unconventional order, and it 
 is not the order that others would expect your code to be executed in. Thus, in 
 the future, this would lead 
 to errors when the notebook is run in the conventional 
@@ -287,7 +287,7 @@ is an issue. Knowing this sooner rather than later will allow you to
 fix the issue and ensure your notebook can be run linearly from start to finish.
 
 We recommend as a best practice to run the entire notebook in a fresh R session
-at least 2-3 times within any period of work. Note that,
+at least 2&ndash;3 times within any period of work. Note that,
 critically, you *must do this in a fresh R session* by restarting your kernel.
 We recommend using either the **Kernel** >> 
 **Restart Kernel and Run All Cells...** command from the menu or the `r fa("fast-forward", height = "11px")` 
@@ -328,7 +328,7 @@ their computer to run the analysis successfully.
 1. Write code so that it can be executed in a linear order.
 
 2. As you write code in a Jupyter notebook, run the notebook in a linear order 
-and in its entirety often (2-3 times every work session) via the **Kernel** >> 
+and in its entirety often (2&ndash;3 times every work session) via the **Kernel** >> 
 **Restart Kernel and Run All Cells...** command from the Jupyter menu or the `r fa("fast-forward", height = "11px")` 
 button in the toolbar.
 
 
@@ -61,7 +61,7 @@ into R, but before we can talk about *how* we read the data into R with these
 functions, we first need to talk about *where* the data lives. When you load a
 data set into R, you first need to tell R where those files live. The file
 could live on your  computer (*local*) 
-\index{location|see{path}} \index{path!local,remote,relative,absolute} 
+\index{location|see{path}} \index{path!local, remote, relative, absolute} 
 or somewhere on the internet (*remote*). 
 
 The place where the file lives on your computer is called the "path". You can
@@ -810,10 +810,10 @@ directly from what a website displays is called \index{web scraping}
 information manually is a painstaking and error-prone process, especially when
 there is a lot of information to gather. So instead of asking your browser to
 translate the information that the web server provides into something you can
-see, you can collect that data programmatically---in the form of
+see, you can collect that data programmatically&mdash;in the form of
 **h**yper**t**ext **m**arkup **l**anguage 
 (HTML) \index{hypertext markup language|see{HTML}}\index{cascading style sheet|see{CSS}}\index{CSS}\index{HTML} 
-and **c**ascading **s**tyle **s**heet (CSS) code---and process it 
+and **c**ascading **s**tyle **s**heet (CSS) code&mdash;and process it 
 to extract useful information. HTML provides the
 basic structure of a site and tells the webpage how to display the content
 (e.g., titles, paragraphs, bullet lists etc.), whereas CSS helps style the
@@ -921,8 +921,8 @@ take a look at another line of the source snippet above:
 
 It's yet another price for an apartment listing, and the tags surrounding it
 have the `"result-price"` class. Wonderful! Now that we know what pattern we
-are looking for---a dollar amount between opening and closing tags that have the
-`"result-price"` class---we should be able to use code to pull out all of the 
+are looking for&mdash;a dollar amount between opening and closing tags that have the
+`"result-price"` class&mdash;we should be able to use code to pull out all of the 
 matching patterns from the source code to obtain our data. This sort of "pattern"
 is known as a *CSS selector* (where CSS stands for **c**ascading **s**tyle **s**heet).
 
@@ -1063,7 +1063,7 @@ population_nodes <- html_nodes(page, selectors)
 head(population_nodes)
 ```
 
-Next we extract the meaningful data---in other words, we get rid of the HTML code syntax and tags---from 
+Next we extract the meaningful data&mdash;in other words, we get rid of the HTML code syntax and tags&mdash;from 
 the nodes using the `html_text`
 function. In the case of the example
 node above, `html_text` function returns `"London"`.
@@ -1090,7 +1090,7 @@ Rather than posting a data file at a URL for you to download, many websites thes
 provide an API \index{API} that must be accessed through a programming language like R. The benefit of this
 is that data owners have much more control over the data they provide to users. However, unlike
 web scraping, there is no consistent way to access an API across websites. Every website typically
-has its own API designed especially for its own use-case. Therefore we will just provide one example
+has its own API designed especially for its own use case. Therefore we will just provide one example
 of accessing data through an API in this book, with the hope that it gives you enough of a basic
 idea that you can learn how to use another API if needed.
 
@@ -1120,8 +1120,8 @@ knitr::include_graphics("img/tidyverse_twitter.png")
 When you access an API, you are initiating a transfer of data from a web server
 to your computer. Web servers are expensive to run and do not have infinite resources.
 If you try to ask for *too much data* at once, you can use up a huge amount of the server's bandwidth. 
-If you try to ask for data *too frequently*---e.g., if you 
-make many requests to the server in quick succession---you can also bog the server down and make
+If you try to ask for data *too frequently*&mdash;e.g., if you 
+make many requests to the server in quick succession&mdash;you can also bog the server down and make
 it unable to talk to anyone else. Most servers have mechanisms to revoke your access if you are not
 careful, but you should try to prevent issues from happening in the first place by being extra careful
 with how you write and run your code. You should also keep in mind that when a website owner
@@ -1195,7 +1195,7 @@ tidyverse_tweets
 If you look back up at the image of the Tidyverse Twitter page, you will
 recognize the text of the most recent few tweets in the above data frame.  In
 other words, we have successfully created a small data set using the Twitter
-API---neat! This data is also quite different from what we obtained from web scraping;
+API&mdash;neat! This data is also quite different from what we obtained from web scraping;
 it is already well-organized into a `tidyverse` data frame (although not *every* API
 will provide data in such a nice format).
  From this point onward, the `tidyverse_tweets` data frame is stored on your