Merge branch 'dev' into index-refs-edits

trevorcampbell · web-flow · commit fdcfc98e74ba · 2021-12-08T16:56:46.000-08:00
diff --git a/classification1.Rmd b/classification1.Rmd
@@ -147,7 +147,7 @@ Traditionally these procedures were quite invasive; modern methods such as fine
 needle aspiration, used to collect the present data set, extract only a small
 amount of tissue and are less invasive. Based on a digital image of each breast
 tissue sample collected for this data set, ten different variables were measured
-for each cell nucleus in the image (items 3-12 of the list of variables below), and then the mean 
+for each cell nucleus in the image (items 3&ndash;12 of the list of variables below), and then the mean 
  for each variable across the nuclei was recorded. As part of the
 data preparation, these values have been *standardized (centered and scaled)*; we will discuss what this
 means and why we do it later in this chapter. Each image additionally was given
diff --git a/classification2.Rmd b/classification2.Rmd
@@ -1176,9 +1176,9 @@ Best subset selection is applicable to any classification method ($K$-NN or othe
 However, it becomes very slow when you have even a moderate
 number of predictors to choose from (say, around 10). This is because the number of possible predictor subsets
 grows very quickly with the number of predictors, and you have to train the model (itself
-a slow process!) for each one. For example, if we have $2$ predictors---let's call
-them A and B---then we have 3 variable sets to try: A alone, B alone, and finally A
-and B together. If we have $3$ predictors---A, B, and C---then we have 7
+a slow process!) for each one. For example, if we have $2$ predictors&mdash;let's call
+them A and B&mdash;then we have 3 variable sets to try: A alone, B alone, and finally A
+and B together. If we have $3$ predictors&mdash;A, B, and C&mdash;then we have 7
 to try: A, B, C, AB, BC, AC, and ABC. In general, the number of models
 we have to train for $m$ predictors is $2^m-1$; in other words, when we 
 get to $10$ predictors we have over *one thousand* models to train, and 
diff --git a/clustering.Rmd b/clustering.Rmd
@@ -107,7 +107,7 @@ collected by [Dr. Kristen Gorman](https://www.uaf.edu/cfos/people/faculty/detail
 the [Palmer Station, Antarctica Long Term Ecological Research Site](https://pal.lternet.edu/), and includes
 measurements for adult penguins found near there [@palmerpenguins]. We have
 modified the data set for use in this chapter. Here we will focus on using two
-variables---penguin bill and flipper length, both in millimeters---to determine whether 
+variables&mdash;penguin bill and flipper length, both in millimeters&mdash;to determine whether 
 there are distinct types of penguins in our data.
 Understanding this might help us with species discovery and classification in a data-driven
 way.
@@ -332,7 +332,7 @@ base <- base +
 base
 ```
 
-The larger the value of $S^2$, the more spread-out the cluster is, since large $S^2$ means that points are far from the cluster center.
+The larger the value of $S^2$, the more spread out the cluster is, since large $S^2$ means that points are far from the cluster center.
 Note, however, that "large" is relative to *both* the scale of the variables for clustering *and* the number of points in the cluster. A cluster where points are very close to the center might still have a large $S^2$ if there are many data points in the cluster.
 
 After we have calculated the WSSD for all the clusters, 
@@ -464,7 +464,7 @@ for (i in 1:4) {
                aes(y = bill_length_standardized, 
                                    x = flipper_length_standardized, 
                                    fill = label), 
-               size = 3, 
+               size = 4, 
                shape = 21, 
                stroke = 1, 
                color = "black", 
@@ -498,7 +498,7 @@ for (i in 1:4) {
     geom_point(data = centers, 
                aes(y = bill_length_standardized, 
                    x = flipper_length_standardized, fill = label), 
-               size = 3, 
+               size = 4, 
                shape = 21, 
                stroke = 1, 
                color = "black", 
@@ -637,7 +637,7 @@ for (i in 1:5) {
     geom_point(data = centers, aes(y = bill_length_standardized, 
                                    x = flipper_length_standardized, 
                                    fill = label), 
-               size = 3, 
+               size = 4, 
                shape = 21, 
                stroke = 1, 
                color = "black", 
@@ -670,7 +670,7 @@ for (i in 1:5) {
     geom_point(data = centers, aes(y = bill_length_standardized, 
                                    x = flipper_length_standardized, 
                                    fill = label), 
-               size = 3, 
+               size = 4, 
                shape = 21, 
                stroke = 1, 
                color = "black", 
@@ -811,7 +811,7 @@ levels(clusters$k) <- clusters_levels
 
 p1 <- ggplot(assignments, aes(flipper_length_standardized, 
                               bill_length_standardized)) +
-  geom_point(aes(color = .cluster, size = 1)) +
+  geom_point(aes(color = .cluster, size = I(2))) +
   facet_wrap(~k) +   scale_color_manual(values = cbbPalette) +
   labs(x = "Flipper Length (standardized)", 
        y = "Bill Length (standardized)", 
@@ -820,7 +820,7 @@ p1 <- ggplot(assignments, aes(flipper_length_standardized,
   geom_point(data = clusters, 
              aes(fill = cluster), 
              color = "black", 
-             size = 5, 
+             size = 4, 
              shape = 21, 
              stroke = 1) + 
   scale_fill_manual(values = cbbPalette)
diff --git a/inference.Rmd b/inference.Rmd
@@ -750,9 +750,9 @@ For a sample of size $n$, you would do the following:
 1. Randomly select an observation from the original sample, which was drawn from the population.
 2. Record the observation's value.
 3. Replace that observation.
-4. Repeat steps 1 - 3 (sampling *with* replacement) until you have $n$ observations, which form a bootstrap sample.
+4. Repeat steps 1&ndash;3 (sampling *with* replacement) until you have $n$ observations, which form a bootstrap sample.
 5. Calculate the bootstrap point estimate (e.g., mean, median, proportion, slope, etc.) of the $n$ observations in your bootstrap sample.
-6. Repeat steps (1) - (5) many times to create a distribution of point estimates (the bootstrap distribution).
+6. Repeat steps 1&ndash;5 many times to create a distribution of point estimates (the bootstrap distribution).
 7. Calculate the plausible range of values around our observed point estimate.
 
 ```{r 11-intro-bootstrap-image, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "Overview of the bootstrap process.", fig.retina = 2, out.width="100%"}
@@ -789,7 +789,7 @@ mean of the sample is \$`r round(estimates$sample_mean, 2)`.
 Remember, in practice, we usually only have this one sample from the population. So
 this sample and estimate are the only data we can work with.
 
-We now perform steps (1) - (5) listed above to generate a single bootstrap
+We now perform steps 1&ndash;5 listed above to generate a single bootstrap
 sample in R and calculate a point estimate from that bootstrap sample. We will 
 use the `rep_sample_n` function as we did when we were
 creating our sampling distribution. But critically, note that we now
@@ -1173,4 +1173,4 @@ found in Chapter \@ref(move-to-your-own-machine).
 ## Additional resources
 
 - Chapters 7 to 10 of [*Modern Dive*](https://moderndive.com/) provide a great next step in learning about inference. In particular, Chapters 7 and 8 cover sampling and bootstrapping using `tidyverse` and `infer` in a slightly more in-depth manner than the present chapter. Chapters 9 and 10 take the next step beyond the scope of this chapter and begin to provide some of the initial mathematical underpinnings of inference and more advanced applications of the concept of inference in testing hypotheses and performing regression. This material offers a great starting point for getting more into the technical side of statistics.
-- Chapters 4 to 7 of [*OpenIntro Statistics - Fourth Edition*](https://www.openintro.org/) provide a good next step after *Modern Dive*. Although it is still certainly an introductory text, things get a bit more mathematical here. Depending on your background, you may actually want to start going through Chapters 1 to 3 first, where you will learn some fundamental concepts in probability theory. Although it may seem like a diversion, probability theory is *the language of statistics*; if you have a solid grasp of probability, more advanced statistics will come naturally to you!
+- Chapters 4 to 7 of [*OpenIntro Statistics*](https://www.openintro.org/) provide a good next step after *Modern Dive*. Although it is still certainly an introductory text, things get a bit more mathematical here. Depending on your background, you may actually want to start going through Chapters 1 to 3 first, where you will learn some fundamental concepts in probability theory. Although it may seem like a diversion, probability theory is *the language of statistics*; if you have a solid grasp of probability, more advanced statistics will come naturally to you!
diff --git a/jupyter.Rmd b/jupyter.Rmd
@@ -245,8 +245,8 @@ referenced in another distinct code cell (Figure \@ref(fig:out-of-order-1)).
 Together, this means that you could then write a code cell further above in the
 notebook that references `y` and execute it without error in the current session 
 (Figure \@ref(fig:out-of-order-2)). This could also be done successfully in 
-future sessions if, and only if, you run the cells in the same non-conventional 
-order. However, it is difficult to remember this non-conventional order, and it 
+future sessions if, and only if, you run the cells in the same unconventional 
+order. However, it is difficult to remember this unconventional order, and it 
 is not the order that others would expect your code to be executed in. Thus, in 
 the future, this would lead 
 to errors when the notebook is run in the conventional 
@@ -287,7 +287,7 @@ is an issue. Knowing this sooner rather than later will allow you to
 fix the issue and ensure your notebook can be run linearly from start to finish.
 
 We recommend as a best practice to run the entire notebook in a fresh R session
-at least 2-3 times within any period of work. Note that,
+at least 2&ndash;3 times within any period of work. Note that,
 critically, you *must do this in a fresh R session* by restarting your kernel.
 We recommend using either the **Kernel** >> 
 **Restart Kernel and Run All Cells...** command from the menu or the `r fa("fast-forward", height = "11px")` 
@@ -328,7 +328,7 @@ their computer to run the analysis successfully.
 1. Write code so that it can be executed in a linear order.
 
 2. As you write code in a Jupyter notebook, run the notebook in a linear order 
-and in its entirety often (2-3 times every work session) via the **Kernel** >> 
+and in its entirety often (2&ndash;3 times every work session) via the **Kernel** >> 
 **Restart Kernel and Run All Cells...** command from the Jupyter menu or the `r fa("fast-forward", height = "11px")` 
 button in the toolbar.
 
diff --git a/reading.Rmd b/reading.Rmd
@@ -810,10 +810,10 @@ directly from what a website displays is called \index{web scraping}
 information manually is a painstaking and error-prone process, especially when
 there is a lot of information to gather. So instead of asking your browser to
 translate the information that the web server provides into something you can
-see, you can collect that data programmatically---in the form of
+see, you can collect that data programmatically&mdash;in the form of
 **h**yper**t**ext **m**arkup **l**anguage 
 (HTML) \index{hypertext markup language|see{HTML}}\index{cascading style sheet|see{CSS}}\index{CSS}\index{HTML} 
-and **c**ascading **s**tyle **s**heet (CSS) code---and process it 
+and **c**ascading **s**tyle **s**heet (CSS) code&mdash;and process it 
 to extract useful information. HTML provides the
 basic structure of a site and tells the webpage how to display the content
 (e.g., titles, paragraphs, bullet lists etc.), whereas CSS helps style the
@@ -921,8 +921,8 @@ take a look at another line of the source snippet above:
 
 It's yet another price for an apartment listing, and the tags surrounding it
 have the `"result-price"` class. Wonderful! Now that we know what pattern we
-are looking for---a dollar amount between opening and closing tags that have the
-`"result-price"` class---we should be able to use code to pull out all of the 
+are looking for&mdash;a dollar amount between opening and closing tags that have the
+`"result-price"` class&mdash;we should be able to use code to pull out all of the 
 matching patterns from the source code to obtain our data. This sort of "pattern"
 is known as a *CSS selector* (where CSS stands for **c**ascading **s**tyle **s**heet).
 
@@ -1063,7 +1063,7 @@ population_nodes <- html_nodes(page, selectors)
 head(population_nodes)
 ```
 
-Next we extract the meaningful data---in other words, we get rid of the HTML code syntax and tags---from 
+Next we extract the meaningful data&mdash;in other words, we get rid of the HTML code syntax and tags&mdash;from 
 the nodes using the `html_text`
 function. In the case of the example
 node above, `html_text` function returns `"London"`.
@@ -1090,7 +1090,7 @@ Rather than posting a data file at a URL for you to download, many websites thes
 provide an API \index{API} that must be accessed through a programming language like R. The benefit of this
 is that data owners have much more control over the data they provide to users. However, unlike
 web scraping, there is no consistent way to access an API across websites. Every website typically
-has its own API designed especially for its own use-case. Therefore we will just provide one example
+has its own API designed especially for its own use case. Therefore we will just provide one example
 of accessing data through an API in this book, with the hope that it gives you enough of a basic
 idea that you can learn how to use another API if needed.
 
@@ -1120,8 +1120,8 @@ knitr::include_graphics("img/tidyverse_twitter.png")
 When you access an API, you are initiating a transfer of data from a web server
 to your computer. Web servers are expensive to run and do not have infinite resources.
 If you try to ask for *too much data* at once, you can use up a huge amount of the server's bandwidth. 
-If you try to ask for data *too frequently*---e.g., if you 
-make many requests to the server in quick succession---you can also bog the server down and make
+If you try to ask for data *too frequently*&mdash;e.g., if you 
+make many requests to the server in quick succession&mdash;you can also bog the server down and make
 it unable to talk to anyone else. Most servers have mechanisms to revoke your access if you are not
 careful, but you should try to prevent issues from happening in the first place by being extra careful
 with how you write and run your code. You should also keep in mind that when a website owner
@@ -1195,7 +1195,7 @@ tidyverse_tweets
 If you look back up at the image of the Tidyverse Twitter page, you will
 recognize the text of the most recent few tweets in the above data frame.  In
 other words, we have successfully created a small data set using the Twitter
-API---neat! This data is also quite different from what we obtained from web scraping;
+API&mdash;neat! This data is also quite different from what we obtained from web scraping;
 it is already well-organized into a `tidyverse` data frame (although not *every* API
 will provide data in such a nice format).
  From this point onward, the `tidyverse_tweets` data frame is stored on your
diff --git a/regression1.Rmd b/regression1.Rmd
@@ -39,7 +39,7 @@ use `tidymodels` workflows, we will use a K-nearest neighbors (KNN)
 approach to make predictions, and we will use cross-validation to choose K.
 Because of how similar these procedures are, make sure to read Chapters
 \@ref(classification) and \@ref(classification2) before reading 
-this one---we will move a little bit faster here with the
+this one&mdash;we will move a little bit faster here with the
 concepts that have already been covered.
 This chapter will primarily focus on the case where there is a single predictor, 
 but the end of the chapter shows how to perform
@@ -69,7 +69,7 @@ The variable that you want to predict is often called the *response variable*. \
 For example, we could try to use the number of hours a person spends on
 exercise each week to predict their race time in the annual Boston marathon. As 
 another example, we could try to use the size of a house to
-predict its sale price. Both of these response variables---race time and sale price---are 
+predict its sale price. Both of these response variables&mdash;race time and sale price&mdash;are 
 numerical, and so predicting them given past data is considered a regression problem.
 
 Just like in the \index{classification!comparison to regression} 
@@ -91,8 +91,8 @@ choices of model parameters (e.g., K in a K-nearest neighbors model). The major
 is that we are now predicting numerical variables instead of categorical variables.
 
 > **Note:** You can usually tell whether a \index{categorical variable}\index{numerical variable}
-> variable is numerical or categorical---and therefore whether you
-> need to perform regression or classification---by taking two response variables X and Y from your
+> variable is numerical or categorical&mdash;and therefore whether you
+> need to perform regression or classification&mdash;by taking two response variables X and Y from your
 > data, and asking the question, "is response variable X *more* than response variable Y?"
 > If the variable is categorical, the question will make no sense (Is blue more than red?
 > Is benign more than malignant?). If the variable is numerical, it will make sense
diff --git a/regression2.Rmd b/regression2.Rmd
diff --git a/version-control.Rmd b/version-control.Rmd
diff --git a/viz.Rmd b/viz.Rmd