Skip to content

Commit fdcfc98

Browse files
Merge branch 'dev' into index-refs-edits
2 parents cd0e28c + a2a078f commit fdcfc98

10 files changed

+49
-49
lines changed

classification1.Rmd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -147,7 +147,7 @@ Traditionally these procedures were quite invasive; modern methods such as fine
147147
needle aspiration, used to collect the present data set, extract only a small
148148
amount of tissue and are less invasive. Based on a digital image of each breast
149149
tissue sample collected for this data set, ten different variables were measured
150-
for each cell nucleus in the image (items 3-12 of the list of variables below), and then the mean
150+
for each cell nucleus in the image (items 3–12 of the list of variables below), and then the mean
151151
for each variable across the nuclei was recorded. As part of the
152152
data preparation, these values have been *standardized (centered and scaled)*; we will discuss what this
153153
means and why we do it later in this chapter. Each image additionally was given

classification2.Rmd

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1176,9 +1176,9 @@ Best subset selection is applicable to any classification method ($K$-NN or othe
11761176
However, it becomes very slow when you have even a moderate
11771177
number of predictors to choose from (say, around 10). This is because the number of possible predictor subsets
11781178
grows very quickly with the number of predictors, and you have to train the model (itself
1179-
a slow process!) for each one. For example, if we have $2$ predictors---let's call
1180-
them A and B---then we have 3 variable sets to try: A alone, B alone, and finally A
1181-
and B together. If we have $3$ predictors---A, B, and C---then we have 7
1179+
a slow process!) for each one. For example, if we have $2$ predictors—let's call
1180+
them A and B—then we have 3 variable sets to try: A alone, B alone, and finally A
1181+
and B together. If we have $3$ predictors—A, B, and C—then we have 7
11821182
to try: A, B, C, AB, BC, AC, and ABC. In general, the number of models
11831183
we have to train for $m$ predictors is $2^m-1$; in other words, when we
11841184
get to $10$ predictors we have over *one thousand* models to train, and

clustering.Rmd

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -107,7 +107,7 @@ collected by [Dr. Kristen Gorman](https://www.uaf.edu/cfos/people/faculty/detail
107107
the [Palmer Station, Antarctica Long Term Ecological Research Site](https://pal.lternet.edu/), and includes
108108
measurements for adult penguins found near there [@palmerpenguins]. We have
109109
modified the data set for use in this chapter. Here we will focus on using two
110-
variables---penguin bill and flipper length, both in millimeters---to determine whether
110+
variables—penguin bill and flipper length, both in millimeters—to determine whether
111111
there are distinct types of penguins in our data.
112112
Understanding this might help us with species discovery and classification in a data-driven
113113
way.
@@ -332,7 +332,7 @@ base <- base +
332332
base
333333
```
334334

335-
The larger the value of $S^2$, the more spread-out the cluster is, since large $S^2$ means that points are far from the cluster center.
335+
The larger the value of $S^2$, the more spread out the cluster is, since large $S^2$ means that points are far from the cluster center.
336336
Note, however, that "large" is relative to *both* the scale of the variables for clustering *and* the number of points in the cluster. A cluster where points are very close to the center might still have a large $S^2$ if there are many data points in the cluster.
337337

338338
After we have calculated the WSSD for all the clusters,
@@ -464,7 +464,7 @@ for (i in 1:4) {
464464
aes(y = bill_length_standardized,
465465
x = flipper_length_standardized,
466466
fill = label),
467-
size = 3,
467+
size = 4,
468468
shape = 21,
469469
stroke = 1,
470470
color = "black",
@@ -498,7 +498,7 @@ for (i in 1:4) {
498498
geom_point(data = centers,
499499
aes(y = bill_length_standardized,
500500
x = flipper_length_standardized, fill = label),
501-
size = 3,
501+
size = 4,
502502
shape = 21,
503503
stroke = 1,
504504
color = "black",
@@ -637,7 +637,7 @@ for (i in 1:5) {
637637
geom_point(data = centers, aes(y = bill_length_standardized,
638638
x = flipper_length_standardized,
639639
fill = label),
640-
size = 3,
640+
size = 4,
641641
shape = 21,
642642
stroke = 1,
643643
color = "black",
@@ -670,7 +670,7 @@ for (i in 1:5) {
670670
geom_point(data = centers, aes(y = bill_length_standardized,
671671
x = flipper_length_standardized,
672672
fill = label),
673-
size = 3,
673+
size = 4,
674674
shape = 21,
675675
stroke = 1,
676676
color = "black",
@@ -811,7 +811,7 @@ levels(clusters$k) <- clusters_levels
811811
812812
p1 <- ggplot(assignments, aes(flipper_length_standardized,
813813
bill_length_standardized)) +
814-
geom_point(aes(color = .cluster, size = 1)) +
814+
geom_point(aes(color = .cluster, size = I(2))) +
815815
facet_wrap(~k) + scale_color_manual(values = cbbPalette) +
816816
labs(x = "Flipper Length (standardized)",
817817
y = "Bill Length (standardized)",
@@ -820,7 +820,7 @@ p1 <- ggplot(assignments, aes(flipper_length_standardized,
820820
geom_point(data = clusters,
821821
aes(fill = cluster),
822822
color = "black",
823-
size = 5,
823+
size = 4,
824824
shape = 21,
825825
stroke = 1) +
826826
scale_fill_manual(values = cbbPalette)

inference.Rmd

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -750,9 +750,9 @@ For a sample of size $n$, you would do the following:
750750
1. Randomly select an observation from the original sample, which was drawn from the population.
751751
2. Record the observation's value.
752752
3. Replace that observation.
753-
4. Repeat steps 1 - 3 (sampling *with* replacement) until you have $n$ observations, which form a bootstrap sample.
753+
4. Repeat steps 1&ndash;3 (sampling *with* replacement) until you have $n$ observations, which form a bootstrap sample.
754754
5. Calculate the bootstrap point estimate (e.g., mean, median, proportion, slope, etc.) of the $n$ observations in your bootstrap sample.
755-
6. Repeat steps (1) - (5) many times to create a distribution of point estimates (the bootstrap distribution).
755+
6. Repeat steps 1&ndash;5 many times to create a distribution of point estimates (the bootstrap distribution).
756756
7. Calculate the plausible range of values around our observed point estimate.
757757

758758
```{r 11-intro-bootstrap-image, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "Overview of the bootstrap process.", fig.retina = 2, out.width="100%"}
@@ -789,7 +789,7 @@ mean of the sample is \$`r round(estimates$sample_mean, 2)`.
789789
Remember, in practice, we usually only have this one sample from the population. So
790790
this sample and estimate are the only data we can work with.
791791

792-
We now perform steps (1) - (5) listed above to generate a single bootstrap
792+
We now perform steps 1&ndash;5 listed above to generate a single bootstrap
793793
sample in R and calculate a point estimate from that bootstrap sample. We will
794794
use the `rep_sample_n` function as we did when we were
795795
creating our sampling distribution. But critically, note that we now
@@ -1173,4 +1173,4 @@ found in Chapter \@ref(move-to-your-own-machine).
11731173
## Additional resources
11741174

11751175
- Chapters 7 to 10 of [*Modern Dive*](https://moderndive.com/) provide a great next step in learning about inference. In particular, Chapters 7 and 8 cover sampling and bootstrapping using `tidyverse` and `infer` in a slightly more in-depth manner than the present chapter. Chapters 9 and 10 take the next step beyond the scope of this chapter and begin to provide some of the initial mathematical underpinnings of inference and more advanced applications of the concept of inference in testing hypotheses and performing regression. This material offers a great starting point for getting more into the technical side of statistics.
1176-
- Chapters 4 to 7 of [*OpenIntro Statistics - Fourth Edition*](https://www.openintro.org/) provide a good next step after *Modern Dive*. Although it is still certainly an introductory text, things get a bit more mathematical here. Depending on your background, you may actually want to start going through Chapters 1 to 3 first, where you will learn some fundamental concepts in probability theory. Although it may seem like a diversion, probability theory is *the language of statistics*; if you have a solid grasp of probability, more advanced statistics will come naturally to you!
1176+
- Chapters 4 to 7 of [*OpenIntro Statistics*](https://www.openintro.org/) provide a good next step after *Modern Dive*. Although it is still certainly an introductory text, things get a bit more mathematical here. Depending on your background, you may actually want to start going through Chapters 1 to 3 first, where you will learn some fundamental concepts in probability theory. Although it may seem like a diversion, probability theory is *the language of statistics*; if you have a solid grasp of probability, more advanced statistics will come naturally to you!

jupyter.Rmd

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -245,8 +245,8 @@ referenced in another distinct code cell (Figure \@ref(fig:out-of-order-1)).
245245
Together, this means that you could then write a code cell further above in the
246246
notebook that references `y` and execute it without error in the current session
247247
(Figure \@ref(fig:out-of-order-2)). This could also be done successfully in
248-
future sessions if, and only if, you run the cells in the same non-conventional
249-
order. However, it is difficult to remember this non-conventional order, and it
248+
future sessions if, and only if, you run the cells in the same unconventional
249+
order. However, it is difficult to remember this unconventional order, and it
250250
is not the order that others would expect your code to be executed in. Thus, in
251251
the future, this would lead
252252
to errors when the notebook is run in the conventional
@@ -287,7 +287,7 @@ is an issue. Knowing this sooner rather than later will allow you to
287287
fix the issue and ensure your notebook can be run linearly from start to finish.
288288

289289
We recommend as a best practice to run the entire notebook in a fresh R session
290-
at least 2-3 times within any period of work. Note that,
290+
at least 2&ndash;3 times within any period of work. Note that,
291291
critically, you *must do this in a fresh R session* by restarting your kernel.
292292
We recommend using either the **Kernel** >>
293293
**Restart Kernel and Run All Cells...** command from the menu or the `r fa("fast-forward", height = "11px")`
@@ -328,7 +328,7 @@ their computer to run the analysis successfully.
328328
1. Write code so that it can be executed in a linear order.
329329

330330
2. As you write code in a Jupyter notebook, run the notebook in a linear order
331-
and in its entirety often (2-3 times every work session) via the **Kernel** >>
331+
and in its entirety often (2&ndash;3 times every work session) via the **Kernel** >>
332332
**Restart Kernel and Run All Cells...** command from the Jupyter menu or the `r fa("fast-forward", height = "11px")`
333333
button in the toolbar.
334334

reading.Rmd

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -810,10 +810,10 @@ directly from what a website displays is called \index{web scraping}
810810
information manually is a painstaking and error-prone process, especially when
811811
there is a lot of information to gather. So instead of asking your browser to
812812
translate the information that the web server provides into something you can
813-
see, you can collect that data programmatically---in the form of
813+
see, you can collect that data programmatically&mdash;in the form of
814814
**h**yper**t**ext **m**arkup **l**anguage
815815
(HTML) \index{hypertext markup language|see{HTML}}\index{cascading style sheet|see{CSS}}\index{CSS}\index{HTML}
816-
and **c**ascading **s**tyle **s**heet (CSS) code---and process it
816+
and **c**ascading **s**tyle **s**heet (CSS) code&mdash;and process it
817817
to extract useful information. HTML provides the
818818
basic structure of a site and tells the webpage how to display the content
819819
(e.g., titles, paragraphs, bullet lists etc.), whereas CSS helps style the
@@ -921,8 +921,8 @@ take a look at another line of the source snippet above:
921921
922922
It's yet another price for an apartment listing, and the tags surrounding it
923923
have the `"result-price"` class. Wonderful! Now that we know what pattern we
924-
are looking for---a dollar amount between opening and closing tags that have the
925-
`"result-price"` class---we should be able to use code to pull out all of the
924+
are looking for&mdash;a dollar amount between opening and closing tags that have the
925+
`"result-price"` class&mdash;we should be able to use code to pull out all of the
926926
matching patterns from the source code to obtain our data. This sort of "pattern"
927927
is known as a *CSS selector* (where CSS stands for **c**ascading **s**tyle **s**heet).
928928
@@ -1063,7 +1063,7 @@ population_nodes <- html_nodes(page, selectors)
10631063
head(population_nodes)
10641064
```
10651065
1066-
Next we extract the meaningful data---in other words, we get rid of the HTML code syntax and tags---from
1066+
Next we extract the meaningful data&mdash;in other words, we get rid of the HTML code syntax and tags&mdash;from
10671067
the nodes using the `html_text`
10681068
function. In the case of the example
10691069
node above, `html_text` function returns `"London"`.
@@ -1090,7 +1090,7 @@ Rather than posting a data file at a URL for you to download, many websites thes
10901090
provide an API \index{API} that must be accessed through a programming language like R. The benefit of this
10911091
is that data owners have much more control over the data they provide to users. However, unlike
10921092
web scraping, there is no consistent way to access an API across websites. Every website typically
1093-
has its own API designed especially for its own use-case. Therefore we will just provide one example
1093+
has its own API designed especially for its own use case. Therefore we will just provide one example
10941094
of accessing data through an API in this book, with the hope that it gives you enough of a basic
10951095
idea that you can learn how to use another API if needed.
10961096
@@ -1120,8 +1120,8 @@ knitr::include_graphics("img/tidyverse_twitter.png")
11201120
When you access an API, you are initiating a transfer of data from a web server
11211121
to your computer. Web servers are expensive to run and do not have infinite resources.
11221122
If you try to ask for *too much data* at once, you can use up a huge amount of the server's bandwidth.
1123-
If you try to ask for data *too frequently*---e.g., if you
1124-
make many requests to the server in quick succession---you can also bog the server down and make
1123+
If you try to ask for data *too frequently*&mdash;e.g., if you
1124+
make many requests to the server in quick succession&mdash;you can also bog the server down and make
11251125
it unable to talk to anyone else. Most servers have mechanisms to revoke your access if you are not
11261126
careful, but you should try to prevent issues from happening in the first place by being extra careful
11271127
with how you write and run your code. You should also keep in mind that when a website owner
@@ -1195,7 +1195,7 @@ tidyverse_tweets
11951195
If you look back up at the image of the Tidyverse Twitter page, you will
11961196
recognize the text of the most recent few tweets in the above data frame. In
11971197
other words, we have successfully created a small data set using the Twitter
1198-
API---neat! This data is also quite different from what we obtained from web scraping;
1198+
API&mdash;neat! This data is also quite different from what we obtained from web scraping;
11991199
it is already well-organized into a `tidyverse` data frame (although not *every* API
12001200
will provide data in such a nice format).
12011201
From this point onward, the `tidyverse_tweets` data frame is stored on your

regression1.Rmd

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@ use `tidymodels` workflows, we will use a K-nearest neighbors (KNN)
3939
approach to make predictions, and we will use cross-validation to choose K.
4040
Because of how similar these procedures are, make sure to read Chapters
4141
\@ref(classification) and \@ref(classification2) before reading
42-
this one---we will move a little bit faster here with the
42+
this one&mdash;we will move a little bit faster here with the
4343
concepts that have already been covered.
4444
This chapter will primarily focus on the case where there is a single predictor,
4545
but the end of the chapter shows how to perform
@@ -69,7 +69,7 @@ The variable that you want to predict is often called the *response variable*. \
6969
For example, we could try to use the number of hours a person spends on
7070
exercise each week to predict their race time in the annual Boston marathon. As
7171
another example, we could try to use the size of a house to
72-
predict its sale price. Both of these response variables---race time and sale price---are
72+
predict its sale price. Both of these response variables&mdash;race time and sale price&mdash;are
7373
numerical, and so predicting them given past data is considered a regression problem.
7474

7575
Just like in the \index{classification!comparison to regression}
@@ -91,8 +91,8 @@ choices of model parameters (e.g., K in a K-nearest neighbors model). The major
9191
is that we are now predicting numerical variables instead of categorical variables.
9292

9393
> **Note:** You can usually tell whether a \index{categorical variable}\index{numerical variable}
94-
> variable is numerical or categorical---and therefore whether you
95-
> need to perform regression or classification---by taking two response variables X and Y from your
94+
> variable is numerical or categorical&mdash;and therefore whether you
95+
> need to perform regression or classification&mdash;by taking two response variables X and Y from your
9696
> data, and asking the question, "is response variable X *more* than response variable Y?"
9797
> If the variable is categorical, the question will make no sense (Is blue more than red?
9898
> Is benign more than malignant?). If the variable is numerical, it will make sense

0 commit comments

Comments
 (0)