Merge pull request #332 from UBC-DSCI/noteboxes

leem44 · web-flow · commit 20b09b40b23e · 2021-10-09T15:53:14.000-07:00
Note box consistency
diff --git a/classification1.Rmd b/classification1.Rmd
@@ -1317,8 +1317,9 @@ The basic idea is to create a grid of synthetic new observations using the `expa
 predict the label of each, and visualize the predictions with a colored scatter having a very high transparency 
 (low `alpha` value) and large point radius. See if you can figure out what each line is doing!
 
-> *Understanding this code is not required for the remainder of the textbook. It is included
-> for those readers who would like to use similar visualizations in their own data analyses.* 
+> **Note:** Understanding this code is not required for the remainder of the
+> textbook. It is included for those readers who would like to use similar
+> visualizations in their own data analyses. 
 
 ```{r 05-workflow-plot-show, fig.height = 4, fig.width = 5, fig.cap = "Scatter plot of smoothness versus area where background color indicates the decision of the classifier."}
 # create the grid of area/smoothness vals, and arrange in a data frame
diff --git a/classification2.Rmd b/classification2.Rmd
@@ -47,7 +47,7 @@ labels for the observations in the **test set**, then we have some
 confidence that our classifier might also accurately predict the class
 labels for new observations without known class labels.
 
-> **Note:** if there were a golden rule of machine learning, \index{golden rule of machine learning} it might be this: 
+> **Note:** If there were a golden rule of machine learning, \index{golden rule of machine learning} it might be this: 
 > *you cannot use the test data to build the model!* If you do, the model gets to
 > "see" the test data in advance, making it look more accurate than it really
 > is. Imagine how bad it would be to overestimate your classifier's accuracy
@@ -486,7 +486,7 @@ use one to train the model, and then use the other to evaluate it.
 In this section, we will cover the details of this procedure, as well as 
 how to use it to help you pick a good parameter value for your classifier.
 
-> **Remember:** *don't touch the test set during the tuning process. Tuning is a part of model training!*
+**And remember:** don't touch the test set during the tuning process. Tuning is a part of model training!
 
 ### Cross-validation
 
@@ -946,9 +946,9 @@ the $K$-NN here.
 
 ## Predictor variable selection
 
-> *This section is not required reading for the remainder of the textbook. It is included for those readers 
-interested in learning how irrelevant variables can influence the performance of a classifier, and how to
-pick a subset of useful variables to include as predictors.*
+> **Note:** This section is not required reading for the remainder of the textbook. It is included for those readers 
+> interested in learning how irrelevant variables can influence the performance of a classifier, and how to
+> pick a subset of useful variables to include as predictors.
 
 Another potentially important part of tuning your classifier is to choose which
 variables from your data will be treated as predictor variables. Technically, you can choose
@@ -1174,7 +1174,7 @@ models that best subset selection requires you to train! For example, while best
 training over 1000 candidate models with $m=10$ predictors, forward selection requires training only 55 candidate models.
  Therefore we will continue the rest of this section using forward selection.
 
-> One word of caution before we move on. Every additional model that you train 
+> **Note:** One word of caution before we move on. Every additional model that you train 
 > increases the likelihood that you will get unlucky and stumble 
 > on a model that has a high cross-validation accuracy estimate, but a low true
 > accuracy on the test data and other future observations.
@@ -1329,7 +1329,7 @@ predictors from the model! It is always worth remembering, however, that what cr
 is an *estimate* of the true accuracy; you have to use your judgement when looking at this plot to decide
 where the elbow occurs, and whether adding a variable provides a meaningful increase in accuracy.
 
-> Remember: since the choice of which variables to include as predictors is
+> **Note:** Since the choice of which variables to include as predictors is
 > part of tuning your classifier, you *cannot use your test data* for this
 > process! 
 
diff --git a/clustering.Rmd b/clustering.Rmd
@@ -489,7 +489,7 @@ plot_grid(plotlist = iter_plot_list, ncol = 2,
 Note that at this point, we can terminate the algorithm since none of the assignments changed
 in the fourth iteration; both the centers and labels will remain the same from this point onward.
 
-> Is K-means *guaranteed* to stop at some point, or could it iterate forever? As it turns out,
+> **Note:** Is K-means *guaranteed* to stop at some point, or could it iterate forever? As it turns out,
 > thankfully, the answer is that K-means \index{K-means!termination} is guaranteed to stop after *some* number of iterations. For the interested reader, the
 > logic for this has three steps: (1) both the label update and the center update decrease total WSSD in each iteration,
 > (2) the total WSSD is always greater than or equal to 0, and (3) there are only a finite number of possible
diff --git a/inference.Rmd b/inference.Rmd
@@ -739,7 +739,7 @@ called **the bootstrap**.  Note that by taking many samples from our single, obs
 sample, we do not obtain the true sampling distribution, but rather an
 approximation that we call **the bootstrap distribution**. \index{bootstrap!distribution}
 
-> **Note:** we must sample *with* replacement when using the bootstrap.
+> **Note:** We must sample *with* replacement when using the bootstrap.
 > Otherwise, if we had a sample of size $n$, and obtained a sample from it of
 > size $n$ *without* replacement, it would just return our original sample!
 
diff --git a/intro.Rmd b/intro.Rmd
@@ -65,8 +65,8 @@ then we might ask the following question, which we wish to answer using our data
 *Which ten Aboriginal languages were most often reported in 2016 as mother
 tongues in Canada, and how many people speak each of them?* 
 
-> **A note about the *data* in data science!** 
-> Data science\index{data science!good practices} cannot be done without a deep understanding of the data and
+> **Note:** Data science\index{data science!good practices} cannot be done without 
+> a deep understanding of the data and
 > problem domain. In this book, we have simplified the data sets used in our
 > examples to concentrate on methods and fundamental concepts. But in real
 > life, you cannot and should not do data science without a domain expert.
@@ -224,7 +224,7 @@ and visualize data.
 library(tidyverse)
 ```
 
-> **In case you want to know more (optional):** Notice that we got some extra
+> **Note:** You may have noticed that we got some extra
 > output from R saying `Attaching packages` and `Conflicts` below our code
 > line. These are examples of *messages* in R, which give the user more
 > information that might be handy to know. The `Attaching packages` message is
@@ -263,7 +263,7 @@ knitr::include_graphics("img/read_csv_function.jpeg")
 read_csv("data/can_lang.csv")
 ```
 
-> **In case you want to know more (optional):** There is another function
+> **Note:** There is another function
 > that also loads csv files named `read.csv`. We will *always* use 
 > `read_csv` in this book, as it is designed to play nicely with all of the 
 > other `tidyverse` functions, which we will use extensively. Be
@@ -523,7 +523,7 @@ ggplot(ten_lang, aes(x = language, y = mother_tongue)) +
   geom_bar(stat = "identity")
 ```
 
-> **In case you have used R before and are curious:** The vast majority of the
+> **Note:** The vast majority of the
 > time, a single expression in R must be contained in a single line of code.
 > However, there *are* a small number of situations in which you can have a
 > single R expression span multiple lines. Above is one such case: here, R knows that a line cannot
diff --git a/reading.Rmd b/reading.Rmd
@@ -184,7 +184,7 @@ relative path to the file.
 canlang_data <- read_csv("data/can_lang.csv")
 ```
 
-> **Note:** it is also normal and expected that \index{warning} a message is
+> **Note:** It is also normal and expected that \index{warning} a message is
 > printed out after using
 > the `read_csv` and related functions. This message lets you know the data types
 > of each of the columns that R inferred while reading the data into R.  In
@@ -770,9 +770,9 @@ write_csv(no_official_lang_data, "data/no_official_languages.csv")
 
 ## Obtaining data from the web 
 
-> *This section is not required reading for the remainder of the textbook. It
+> **Note:** This section is not required reading for the remainder of the textbook. It
 > is included for those readers interested in learning a little bit more about
-> how to obtain different types of data from the web.*
+> how to obtain different types of data from the web.
 
 Data doesn't just magically appear on your computer; you need to get it from
 somewhere. Earlier in the chapter we showed you how to access data stored in a
diff --git a/regression1.Rmd b/regression1.Rmd
@@ -71,7 +71,7 @@ of our method on observations not seen during training. And finally, we can use
 choices of model parameters (e.g., K in a K-nearest neighbors model). The major difference
 is that we are now predicting numerical variables instead of categorical variables.
 
-> You can usually tell whether a \index{categorical variable}\index{numerical variable}
+> **Note:** You can usually tell whether a \index{categorical variable}\index{numerical variable}
 > variable is numerical or categorical---and therefore whether you
 > need to perform regression or classification---by taking two response variables X and Y from your
 > data, and asking the question, "is response variable X *more* than response variable Y?"
@@ -351,20 +351,19 @@ errors_plot <- ggplot(small_sacramento, aes(x = sqft, y = price)) +
 errors_plot
 ```
 
-> **RMSPE versus RMSE**
-When using many code packages (`tidymodels` included), the evaluation output 
-we will get to assess the prediction quality of
-our KNN regression models is labeled "RMSE", or "root mean squared
-error". Why is this so, and why not just RMSPE? \index{RMSPE!comparison with RMSE}
-In statistics, we try to be very precise with our
-language to indicate whether we are calculating the prediction error on the
-training data (*in-sample* prediction) versus on the testing data 
-(*out-of-sample* prediction). When predicting and evaluating prediction quality on the training data, we 
- say RMSE. By contrast, when predicting and evaluating prediction quality
-on the testing or validation data, we say RMSPE. 
-The equation for calculating RMSE and RMSPE is exactly the same; all that changes is whether the $y$s are
-training or testing data. But many people just use RMSE for both, 
-and rely on context to denote which data the root mean squared error is being calculated on.
+> **Note:** When using many code packages (`tidymodels` included), the evaluation output 
+> we will get to assess the prediction quality of
+> our KNN regression models is labeled "RMSE", or "root mean squared
+> error". Why is this so, and why not RMSPE? \index{RMSPE!comparison with RMSE}
+> In statistics, we try to be very precise with our
+> language to indicate whether we are calculating the prediction error on the
+> training data (*in-sample* prediction) versus on the testing data 
+> (*out-of-sample* prediction). When predicting and evaluating prediction quality on the training data, we 
+>  say RMSE. By contrast, when predicting and evaluating prediction quality
+> on the testing or validation data, we say RMSPE. 
+> The equation for calculating RMSE and RMSPE is exactly the same; all that changes is whether the $y$s are
+> training or testing data. But many people just use RMSE for both, 
+> and rely on context to denote which data the root mean squared error is being calculated on.
 
 Now that we know how we can assess how well our model predicts a numerical
 value, let's use R to perform cross-validation and to choose the optimal $K$.
diff --git a/regression2.Rmd b/regression2.Rmd
@@ -47,7 +47,7 @@ over their values for a prediction, in simple linear regression, we create a
 straight line of best fit through the training data and then
 "look up" the prediction using the line.
 
-> **Note:** although we did not cover it in earlier chapters, there 
+> **Note:** Although we did not cover it in earlier chapters, there 
 > is another popular method for classification called *logistic
 > regression* (it is used for classification even though the name, somewhat confusingly,
 > has the word "regression" in it). In logistic regression---similar to linear regression---you
@@ -834,7 +834,7 @@ a deep understanding of the problem---as well as the wrangling tools
 from previous chapters---to engineer useful new features that improve
 predictive performance.
 
-> **Note:** feature engineering
+> **Note:** Feature engineering
 > is *part of tuning your model*, and as such you must not use your test data
 > to evaluate the quality of the features you produce. You are free to use
 > cross-validation, though!
diff --git a/setup.Rmd b/setup.Rmd
@@ -86,7 +86,7 @@ commands:
 bash path/to/Miniconda3-latest-Linux-x86_64.sh
 ```
 
-> Note: most often, this file is downloaded to the Downloads directory, 
+> **Note:** Most often, this file is downloaded to the Downloads directory, 
 and thus the command will look like this:
 > ```
 > bash Downloads/Miniconda3-latest-Linux-x86_64.sh
@@ -181,7 +181,7 @@ To do this, type the following in the terminal:
 sudo chown -R $(whoami):admin /usr/local/bin
 ```
 
-> *Note: You might be asked to enter your password during installation.*
+>  **Note:** You might be asked to enter your password during installation.
 
 **All operating systems:** 
 To install LaTeX open JupyterLab by typing `jupyter lab` 
@@ -223,7 +223,7 @@ Thus, add the lines below to the bottom of your `.bashrc` file
 export PATH="$PATH:~/bin"
 ```
 
-> Note: If you used `nano` to open your `.bashrc` file, 
+>  **Note:** If you used `nano` to open your `.bashrc` file, 
 follow the keyboard shortcuts at the bottom of the nano text editor 
 to save and close the file.
 
diff --git a/version-control.Rmd b/version-control.Rmd
@@ -113,7 +113,7 @@ In the
 we list many of the common version control systems 
 and repository hosting services in use today.
 
-> **Note:** technically you don't *have to* use a repository hosting service. 
+> **Note:** Technically you don't *have to* use a repository hosting service. 
 > You can, for example, version control a project
 > that is stored only in a folder on your computer - 
 > never sharing it on a repository hosting service. 
diff --git a/viz.Rmd b/viz.Rmd
@@ -347,8 +347,7 @@ change the font size, we use the `theme` function with the `text` argument:
 \index{ggplot!xlab,ylab}
 \index{ggplot!theme}
 
-> **Note:** 
-> The `theme` function is quite complex and has many arguments 
+> **Note:** The `theme` function is quite complex and has many arguments 
 > that can be specified to control many non-data aspects of a visualization.
 > An in-depth discussion of the `theme` function is beyond the scope of this book.
 > Interested readers may consult the `theme` function documentation;
@@ -383,8 +382,7 @@ and so we need to use the `date` function
 (from the `tidyverse` `lubridate` R package) 
 to convert the character strings we provide to `c` to `date` vectors.
 
-> **Note:** 
-> `lubridate` is a package that is installed by the `tidyverse` metapackage,
+> **Note:** `lubridate` is a package that is installed by the `tidyverse` metapackage,
 > but is not loaded by it. 
 > Hence we need to load it separately in the code below.
 
@@ -1102,11 +1100,11 @@ morley_hist
 ```
 
 > **Note:** Factors impact plots in two ways:
-(1) ensuring a color is mapped as discretely where appropriate (like in this
-example) and (2) the ordering of levels in a plot. `ggplot` takes into account
-the order of the factor levels as opposed to the order that is displayed in
-your data frame. Learning how to reorder your factor levels will help you with
-reordering the labels of a factor on a plot.  
+> (1) ensuring a color is mapped as discretely where appropriate (like in this
+> example) and (2) the ordering of levels in a plot. `ggplot` takes into account
+> the order of the factor levels as opposed to the order that is displayed in
+> your data frame. Learning how to reorder your factor levels will help you with
+> reordering the labels of a factor on a plot.  
  
 Unfortunately, the attempt to separate out the experiment number visually has
 created a bit of a mess. All of the colors in Figure
@@ -1403,8 +1401,7 @@ take your computer some time to open the image. On the other hand, you can zoom
 into / scale up vector graphics as much as you like without the image looking
 bad, while raster images eventually start to look "pixellated." 
 
-> **Note:** 
-> The portable document format [PDF](https://en.wikipedia.org/wiki/PDF) (`.pdf`) is commonly used to
+> **Note:** The portable document format [PDF](https://en.wikipedia.org/wiki/PDF) (`.pdf`) is commonly used to
 > store *both* raster and vector formats. If you try to open a PDF and it's taking a long time
 > to load, it may be because there is a complicated vector graphics image that your computer is rendering. 
 \index{PDF}
diff --git a/wrangling.Rmd b/wrangling.Rmd
@@ -1358,10 +1358,7 @@ region_lang |>
   summarize(across(mother_tongue:lang_known, max))
 ``` 
 
-> **Note on calculating summary statistics with `summarize` + `across`** 
-> **when there are `NA`s**:
-> 
-> Similarly to when we use base R statistical summary functions 
+> **Note:** Similarly to when we use base R statistical summary functions 
 > (e.g., `max`, `min`, `mean`, `sum`, etc) with `summarize` alone, 
 > the use of the `summarize` + `across` functions paired 
 > with base R statistical summary functions
@@ -1441,9 +1438,7 @@ region_lang |>
   map_dfr(max)
 ```
 
-> **Note on calculating summary statistics with `map` when there are `NA`s**:
-> 
-> Similarly to when we use base R statistical summary functions 
+> **Note:** Similarly to when we use base R statistical summary functions 
 > (e.g., `max`, `min`, `mean`, `sum`, etc.) with `summarize`, 
 > `map` functions paired with base R statistical summary functions
 > also return `NA` values when we apply them to columns that