Skip to content

Commit 20b09b4

Browse files
authored
Merge pull request #332 from UBC-DSCI/noteboxes
Note box consistency
2 parents f212085 + 7a4c9b4 commit 20b09b4

12 files changed

+50
-58
lines changed

classification1.Rmd

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1317,8 +1317,9 @@ The basic idea is to create a grid of synthetic new observations using the `expa
13171317
predict the label of each, and visualize the predictions with a colored scatter having a very high transparency
13181318
(low `alpha` value) and large point radius. See if you can figure out what each line is doing!
13191319

1320-
> *Understanding this code is not required for the remainder of the textbook. It is included
1321-
> for those readers who would like to use similar visualizations in their own data analyses.*
1320+
> **Note:** Understanding this code is not required for the remainder of the
1321+
> textbook. It is included for those readers who would like to use similar
1322+
> visualizations in their own data analyses.
13221323
13231324
```{r 05-workflow-plot-show, fig.height = 4, fig.width = 5, fig.cap = "Scatter plot of smoothness versus area where background color indicates the decision of the classifier."}
13241325
# create the grid of area/smoothness vals, and arrange in a data frame

classification2.Rmd

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@ labels for the observations in the **test set**, then we have some
4747
confidence that our classifier might also accurately predict the class
4848
labels for new observations without known class labels.
4949

50-
> **Note:** if there were a golden rule of machine learning, \index{golden rule of machine learning} it might be this:
50+
> **Note:** If there were a golden rule of machine learning, \index{golden rule of machine learning} it might be this:
5151
> *you cannot use the test data to build the model!* If you do, the model gets to
5252
> "see" the test data in advance, making it look more accurate than it really
5353
> is. Imagine how bad it would be to overestimate your classifier's accuracy
@@ -486,7 +486,7 @@ use one to train the model, and then use the other to evaluate it.
486486
In this section, we will cover the details of this procedure, as well as
487487
how to use it to help you pick a good parameter value for your classifier.
488488

489-
> **Remember:** *don't touch the test set during the tuning process. Tuning is a part of model training!*
489+
**And remember:** don't touch the test set during the tuning process. Tuning is a part of model training!
490490

491491
### Cross-validation
492492

@@ -946,9 +946,9 @@ the $K$-NN here.
946946

947947
## Predictor variable selection
948948

949-
> *This section is not required reading for the remainder of the textbook. It is included for those readers
950-
interested in learning how irrelevant variables can influence the performance of a classifier, and how to
951-
pick a subset of useful variables to include as predictors.*
949+
> **Note:** This section is not required reading for the remainder of the textbook. It is included for those readers
950+
> interested in learning how irrelevant variables can influence the performance of a classifier, and how to
951+
> pick a subset of useful variables to include as predictors.
952952
953953
Another potentially important part of tuning your classifier is to choose which
954954
variables from your data will be treated as predictor variables. Technically, you can choose
@@ -1174,7 +1174,7 @@ models that best subset selection requires you to train! For example, while best
11741174
training over 1000 candidate models with $m=10$ predictors, forward selection requires training only 55 candidate models.
11751175
Therefore we will continue the rest of this section using forward selection.
11761176

1177-
> One word of caution before we move on. Every additional model that you train
1177+
> **Note:** One word of caution before we move on. Every additional model that you train
11781178
> increases the likelihood that you will get unlucky and stumble
11791179
> on a model that has a high cross-validation accuracy estimate, but a low true
11801180
> accuracy on the test data and other future observations.
@@ -1329,7 +1329,7 @@ predictors from the model! It is always worth remembering, however, that what cr
13291329
is an *estimate* of the true accuracy; you have to use your judgement when looking at this plot to decide
13301330
where the elbow occurs, and whether adding a variable provides a meaningful increase in accuracy.
13311331

1332-
> Remember: since the choice of which variables to include as predictors is
1332+
> **Note:** Since the choice of which variables to include as predictors is
13331333
> part of tuning your classifier, you *cannot use your test data* for this
13341334
> process!
13351335

clustering.Rmd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -489,7 +489,7 @@ plot_grid(plotlist = iter_plot_list, ncol = 2,
489489
Note that at this point, we can terminate the algorithm since none of the assignments changed
490490
in the fourth iteration; both the centers and labels will remain the same from this point onward.
491491
492-
> Is K-means *guaranteed* to stop at some point, or could it iterate forever? As it turns out,
492+
> **Note:** Is K-means *guaranteed* to stop at some point, or could it iterate forever? As it turns out,
493493
> thankfully, the answer is that K-means \index{K-means!termination} is guaranteed to stop after *some* number of iterations. For the interested reader, the
494494
> logic for this has three steps: (1) both the label update and the center update decrease total WSSD in each iteration,
495495
> (2) the total WSSD is always greater than or equal to 0, and (3) there are only a finite number of possible

inference.Rmd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -739,7 +739,7 @@ called **the bootstrap**. Note that by taking many samples from our single, obs
739739
sample, we do not obtain the true sampling distribution, but rather an
740740
approximation that we call **the bootstrap distribution**. \index{bootstrap!distribution}
741741

742-
> **Note:** we must sample *with* replacement when using the bootstrap.
742+
> **Note:** We must sample *with* replacement when using the bootstrap.
743743
> Otherwise, if we had a sample of size $n$, and obtained a sample from it of
744744
> size $n$ *without* replacement, it would just return our original sample!
745745

intro.Rmd

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -65,8 +65,8 @@ then we might ask the following question, which we wish to answer using our data
6565
*Which ten Aboriginal languages were most often reported in 2016 as mother
6666
tongues in Canada, and how many people speak each of them?*
6767

68-
> **A note about the *data* in data science!**
69-
> Data science\index{data science!good practices} cannot be done without a deep understanding of the data and
68+
> **Note:** Data science\index{data science!good practices} cannot be done without
69+
> a deep understanding of the data and
7070
> problem domain. In this book, we have simplified the data sets used in our
7171
> examples to concentrate on methods and fundamental concepts. But in real
7272
> life, you cannot and should not do data science without a domain expert.
@@ -224,7 +224,7 @@ and visualize data.
224224
library(tidyverse)
225225
```
226226

227-
> **In case you want to know more (optional):** Notice that we got some extra
227+
> **Note:** You may have noticed that we got some extra
228228
> output from R saying `Attaching packages` and `Conflicts` below our code
229229
> line. These are examples of *messages* in R, which give the user more
230230
> information that might be handy to know. The `Attaching packages` message is
@@ -263,7 +263,7 @@ knitr::include_graphics("img/read_csv_function.jpeg")
263263
read_csv("data/can_lang.csv")
264264
```
265265

266-
> **In case you want to know more (optional):** There is another function
266+
> **Note:** There is another function
267267
> that also loads csv files named `read.csv`. We will *always* use
268268
> `read_csv` in this book, as it is designed to play nicely with all of the
269269
> other `tidyverse` functions, which we will use extensively. Be
@@ -523,7 +523,7 @@ ggplot(ten_lang, aes(x = language, y = mother_tongue)) +
523523
geom_bar(stat = "identity")
524524
```
525525

526-
> **In case you have used R before and are curious:** The vast majority of the
526+
> **Note:** The vast majority of the
527527
> time, a single expression in R must be contained in a single line of code.
528528
> However, there *are* a small number of situations in which you can have a
529529
> single R expression span multiple lines. Above is one such case: here, R knows that a line cannot

reading.Rmd

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -184,7 +184,7 @@ relative path to the file.
184184
canlang_data <- read_csv("data/can_lang.csv")
185185
```
186186

187-
> **Note:** it is also normal and expected that \index{warning} a message is
187+
> **Note:** It is also normal and expected that \index{warning} a message is
188188
> printed out after using
189189
> the `read_csv` and related functions. This message lets you know the data types
190190
> of each of the columns that R inferred while reading the data into R. In
@@ -770,9 +770,9 @@ write_csv(no_official_lang_data, "data/no_official_languages.csv")
770770

771771
## Obtaining data from the web
772772

773-
> *This section is not required reading for the remainder of the textbook. It
773+
> **Note:** This section is not required reading for the remainder of the textbook. It
774774
> is included for those readers interested in learning a little bit more about
775-
> how to obtain different types of data from the web.*
775+
> how to obtain different types of data from the web.
776776
777777
Data doesn't just magically appear on your computer; you need to get it from
778778
somewhere. Earlier in the chapter we showed you how to access data stored in a

regression1.Rmd

Lines changed: 14 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -71,7 +71,7 @@ of our method on observations not seen during training. And finally, we can use
7171
choices of model parameters (e.g., K in a K-nearest neighbors model). The major difference
7272
is that we are now predicting numerical variables instead of categorical variables.
7373

74-
> You can usually tell whether a \index{categorical variable}\index{numerical variable}
74+
> **Note:** You can usually tell whether a \index{categorical variable}\index{numerical variable}
7575
> variable is numerical or categorical---and therefore whether you
7676
> need to perform regression or classification---by taking two response variables X and Y from your
7777
> data, and asking the question, "is response variable X *more* than response variable Y?"
@@ -351,20 +351,19 @@ errors_plot <- ggplot(small_sacramento, aes(x = sqft, y = price)) +
351351
errors_plot
352352
```
353353

354-
> **RMSPE versus RMSE**
355-
When using many code packages (`tidymodels` included), the evaluation output
356-
we will get to assess the prediction quality of
357-
our KNN regression models is labeled "RMSE", or "root mean squared
358-
error". Why is this so, and why not just RMSPE? \index{RMSPE!comparison with RMSE}
359-
In statistics, we try to be very precise with our
360-
language to indicate whether we are calculating the prediction error on the
361-
training data (*in-sample* prediction) versus on the testing data
362-
(*out-of-sample* prediction). When predicting and evaluating prediction quality on the training data, we
363-
say RMSE. By contrast, when predicting and evaluating prediction quality
364-
on the testing or validation data, we say RMSPE.
365-
The equation for calculating RMSE and RMSPE is exactly the same; all that changes is whether the $y$s are
366-
training or testing data. But many people just use RMSE for both,
367-
and rely on context to denote which data the root mean squared error is being calculated on.
354+
> **Note:** When using many code packages (`tidymodels` included), the evaluation output
355+
> we will get to assess the prediction quality of
356+
> our KNN regression models is labeled "RMSE", or "root mean squared
357+
> error". Why is this so, and why not RMSPE? \index{RMSPE!comparison with RMSE}
358+
> In statistics, we try to be very precise with our
359+
> language to indicate whether we are calculating the prediction error on the
360+
> training data (*in-sample* prediction) versus on the testing data
361+
> (*out-of-sample* prediction). When predicting and evaluating prediction quality on the training data, we
362+
> say RMSE. By contrast, when predicting and evaluating prediction quality
363+
> on the testing or validation data, we say RMSPE.
364+
> The equation for calculating RMSE and RMSPE is exactly the same; all that changes is whether the $y$s are
365+
> training or testing data. But many people just use RMSE for both,
366+
> and rely on context to denote which data the root mean squared error is being calculated on.
368367
369368
Now that we know how we can assess how well our model predicts a numerical
370369
value, let's use R to perform cross-validation and to choose the optimal $K$.

regression2.Rmd

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@ over their values for a prediction, in simple linear regression, we create a
4747
straight line of best fit through the training data and then
4848
"look up" the prediction using the line.
4949

50-
> **Note:** although we did not cover it in earlier chapters, there
50+
> **Note:** Although we did not cover it in earlier chapters, there
5151
> is another popular method for classification called *logistic
5252
> regression* (it is used for classification even though the name, somewhat confusingly,
5353
> has the word "regression" in it). In logistic regression---similar to linear regression---you
@@ -834,7 +834,7 @@ a deep understanding of the problem---as well as the wrangling tools
834834
from previous chapters---to engineer useful new features that improve
835835
predictive performance.
836836

837-
> **Note:** feature engineering
837+
> **Note:** Feature engineering
838838
> is *part of tuning your model*, and as such you must not use your test data
839839
> to evaluate the quality of the features you produce. You are free to use
840840
> cross-validation, though!

setup.Rmd

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -86,7 +86,7 @@ commands:
8686
bash path/to/Miniconda3-latest-Linux-x86_64.sh
8787
```
8888

89-
> Note: most often, this file is downloaded to the Downloads directory,
89+
> **Note:** Most often, this file is downloaded to the Downloads directory,
9090
and thus the command will look like this:
9191
> ```
9292
> bash Downloads/Miniconda3-latest-Linux-x86_64.sh
@@ -181,7 +181,7 @@ To do this, type the following in the terminal:
181181
sudo chown -R $(whoami):admin /usr/local/bin
182182
```
183183
184-
> *Note: You might be asked to enter your password during installation.*
184+
> **Note:** You might be asked to enter your password during installation.
185185
186186
**All operating systems:**
187187
To install LaTeX open JupyterLab by typing `jupyter lab`
@@ -223,7 +223,7 @@ Thus, add the lines below to the bottom of your `.bashrc` file
223223
export PATH="$PATH:~/bin"
224224
```
225225
226-
> Note: If you used `nano` to open your `.bashrc` file,
226+
> **Note:** If you used `nano` to open your `.bashrc` file,
227227
follow the keyboard shortcuts at the bottom of the nano text editor
228228
to save and close the file.
229229

version-control.Rmd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -113,7 +113,7 @@ In the
113113
we list many of the common version control systems
114114
and repository hosting services in use today.
115115

116-
> **Note:** technically you don't *have to* use a repository hosting service.
116+
> **Note:** Technically you don't *have to* use a repository hosting service.
117117
> You can, for example, version control a project
118118
> that is stored only in a folder on your computer -
119119
> never sharing it on a repository hosting service.

0 commit comments

Comments
 (0)