Skip to content

Commit 57be6d1

Browse files
Merge branch 'dev' into rohan-edits
2 parents 599761e + 47d458a commit 57be6d1

11 files changed

+70
-71
lines changed

classification1.Rmd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -147,7 +147,7 @@ Traditionally these procedures were quite invasive; modern methods such as fine
147147
needle aspiration, used to collect the present data set, extract only a small
148148
amount of tissue and are less invasive. Based on a digital image of each breast
149149
tissue sample collected for this data set, ten different variables were measured
150-
for each cell nucleus in the image (items 3-12 of the list of variables below), and then the mean
150+
for each cell nucleus in the image (items 3–12 of the list of variables below), and then the mean
151151
for each variable across the nuclei was recorded. As part of the
152152
data preparation, these values have been *standardized (centered and scaled)*; we will discuss what this
153153
means and why we do it later in this chapter. Each image additionally was given

classification2.Rmd

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -643,7 +643,7 @@ knitr::include_graphics("img/cv.png")
643643
```
644644

645645
To perform 5-fold cross-validation in R with `tidymodels`, we use another
646-
function: `vfold_cv`. \index{tidymodels!vfold\_cv}\index{cross validation!vfold\_cv} This function splits our training data into `v` folds
646+
function: `vfold_cv`. \index{tidymodels!vfold\_cv}\index{cross-validation!vfold\_cv} This function splits our training data into `v` folds
647647
automatically. We set the `strata` argument to the categorical label variable
648648
(here, `Class`) to ensure that the training and validation subsets contain the
649649
right proportions of each category of observation.
@@ -653,7 +653,7 @@ cancer_vfold <- vfold_cv(cancer_train, v = 5, strata = Class)
653653
cancer_vfold
654654
```
655655

656-
Then, when we create our data analysis workflow, we use the `fit_resamples` function \index{cross validation!fit\_resamples}\index{tidymodels!fit\_resamples}
656+
Then, when we create our data analysis workflow, we use the `fit_resamples` function \index{cross-validation!fit\_resamples}\index{tidymodels!fit\_resamples}
657657
instead of the `fit` function for training. This runs cross-validation on each
658658
train/validation split.
659659

@@ -679,7 +679,7 @@ knn_fit <- workflow() |>
679679
knn_fit
680680
```
681681

682-
The `collect_metrics` \index{tidymodels!collect\_metrics}\index{cross validation!collect\_metrics} function is used to aggregate the *mean* and *standard error*
682+
The `collect_metrics` \index{tidymodels!collect\_metrics}\index{cross-validation!collect\_metrics} function is used to aggregate the *mean* and *standard error*
683683
of the classifier's validation accuracy across the folds. You will find results
684684
related to the accuracy in the row with `accuracy` listed under the `.metric` column.
685685
You should consider the mean (`mean`) to be the estimated accuracy, while the standard
@@ -747,7 +747,7 @@ knn_spec <- nearest_neighbor(weight_func = "rectangular",
747747
set_mode("classification")
748748
```
749749

750-
Then instead of using `fit` or `fit_resamples`, we will use the `tune_grid` function \index{cross validation!tune\_grid}\index{tidymodels!tune\_grid}
750+
Then instead of using `fit` or `fit_resamples`, we will use the `tune_grid` function \index{cross-validation!tune\_grid}\index{tidymodels!tune\_grid}
751751
to fit the model for each value in a range of parameter values.
752752
In particular, we first create a data frame with a `neighbors`
753753
variable that contains the sequence of values of $K$ to try; below we create the `k_vals`
@@ -1176,9 +1176,9 @@ Best subset selection is applicable to any classification method ($K$-NN or othe
11761176
However, it becomes very slow when you have even a moderate
11771177
number of predictors to choose from (say, around 10). This is because the number of possible predictor subsets
11781178
grows very quickly with the number of predictors, and you have to train the model (itself
1179-
a slow process!) for each one. For example, if we have $2$ predictors---let's call
1180-
them A and B---then we have 3 variable sets to try: A alone, B alone, and finally A
1181-
and B together. If we have $3$ predictors---A, B, and C---then we have 7
1179+
a slow process!) for each one. For example, if we have $2$ predictors&mdash;let's call
1180+
them A and B&mdash;then we have 3 variable sets to try: A alone, B alone, and finally A
1181+
and B together. If we have $3$ predictors&mdash;A, B, and C&mdash;then we have 7
11821182
to try: A, B, C, AB, BC, AC, and ABC. In general, the number of models
11831183
we have to train for $m$ predictors is $2^m-1$; in other words, when we
11841184
get to $10$ predictors we have over *one thousand* models to train, and

clustering.Rmd

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -107,7 +107,7 @@ collected by [Dr. Kristen Gorman](https://www.uaf.edu/cfos/people/faculty/detail
107107
the [Palmer Station, Antarctica Long Term Ecological Research Site](https://pal.lternet.edu/), and includes
108108
measurements for adult penguins found near there [@palmerpenguins]. We have
109109
modified the data set for use in this chapter. Here we will focus on using two
110-
variables---penguin bill and flipper length, both in millimeters---to determine whether
110+
variables&mdash;penguin bill and flipper length, both in millimeters&mdash;to determine whether
111111
there are distinct types of penguins in our data.
112112
Understanding this might help us with species discovery and classification in a data-driven
113113
way.
@@ -332,7 +332,7 @@ base <- base +
332332
base
333333
```
334334

335-
The larger the value of $S^2$, the more spread-out the cluster is, since large $S^2$ means that points are far from the cluster center.
335+
The larger the value of $S^2$, the more spread out the cluster is, since large $S^2$ means that points are far from the cluster center.
336336
Note, however, that "large" is relative to *both* the scale of the variables for clustering *and* the number of points in the cluster. A cluster where points are very close to the center might still have a large $S^2$ if there are many data points in the cluster.
337337

338338
After we have calculated the WSSD for all the clusters,
@@ -591,7 +591,7 @@ These, however, are beyond the scope of this book.
591591

592592
### Random restarts
593593

594-
Unlike the classification and regression models we studied in previous chapters, K-means \index{K-means!restart,nstart} can get "stuck" in a bad solution.
594+
Unlike the classification and regression models we studied in previous chapters, K-means \index{K-means!restart, nstart} can get "stuck" in a bad solution.
595595
For example, Figure \@ref(fig:10-toy-kmeans-bad-init) illustrates an unlucky random initialization by K-means.
596596

597597
```{r 10-toy-kmeans-bad-init, echo = FALSE, warning = FALSE, message = FALSE, fig.height = 3.5, fig.width = 3.75, fig.align = "center", fig.cap = "Random initialization of labels."}
@@ -859,7 +859,7 @@ each other. Therefore, the *scale* of each of the variables in the data
859859
will influence which cluster data points end up being assigned.
860860
Variables with a large scale will have a much larger
861861
effect on deciding cluster assignment than variables with a small scale.
862-
To address this problem, we typically standardize \index{standardization!K-means}\index{K-means!stanardization} our data before clustering,
862+
To address this problem, we typically standardize \index{standardization!K-means}\index{K-means!standardization} our data before clustering,
863863
which ensures that each variable has a mean of 0 and standard deviation of 1.
864864
The `scale` function in R can be used to do this.
865865
We show an example of how to use this function
@@ -1050,7 +1050,7 @@ But why is there a "bump" in the total WSSD plot here?
10501050
Shouldn't total WSSD always decrease as we add more clusters?
10511051
Technically yes, but remember: K-means can get "stuck" in a bad solution.
10521052
Unfortunately, for K = 8 we had an unlucky initialization
1053-
and found a bad clustering! \index{K-means!restart,nstart}
1053+
and found a bad clustering! \index{K-means!restart, nstart}
10541054
We can help prevent finding a bad clustering
10551055
by trying a few different random initializations
10561056
via the `nstart` argument (Figure \@ref(fig:10-choose-k-nstart)

inference.Rmd

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -750,9 +750,9 @@ For a sample of size $n$, you would do the following:
750750
1. Randomly select an observation from the original sample, which was drawn from the population.
751751
2. Record the observation's value.
752752
3. Replace that observation.
753-
4. Repeat steps 1 - 3 (sampling *with* replacement) until you have $n$ observations, which form a bootstrap sample.
753+
4. Repeat steps 1&ndash;3 (sampling *with* replacement) until you have $n$ observations, which form a bootstrap sample.
754754
5. Calculate the bootstrap point estimate (e.g., mean, median, proportion, slope, etc.) of the $n$ observations in your bootstrap sample.
755-
6. Repeat steps (1) - (5) many times to create a distribution of point estimates (the bootstrap distribution).
755+
6. Repeat steps 1&ndash;5 many times to create a distribution of point estimates (the bootstrap distribution).
756756
7. Calculate the plausible range of values around our observed point estimate.
757757

758758
```{r 11-intro-bootstrap-image, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "Overview of the bootstrap process.", fig.retina = 2, out.width="100%"}
@@ -789,7 +789,7 @@ mean of the sample is \$`r round(estimates$sample_mean, 2)`.
789789
Remember, in practice, we usually only have this one sample from the population. So
790790
this sample and estimate are the only data we can work with.
791791

792-
We now perform steps (1) - (5) listed above to generate a single bootstrap
792+
We now perform steps 1&ndash;5 listed above to generate a single bootstrap
793793
sample in R and calculate a point estimate from that bootstrap sample. We will
794794
use the `rep_sample_n` function as we did when we were
795795
creating our sampling distribution. But critically, note that we now
@@ -1173,4 +1173,4 @@ found in Chapter \@ref(move-to-your-own-machine).
11731173
## Additional resources
11741174

11751175
- Chapters 7 to 10 of [*Modern Dive*](https://moderndive.com/) provide a great next step in learning about inference. In particular, Chapters 7 and 8 cover sampling and bootstrapping using `tidyverse` and `infer` in a slightly more in-depth manner than the present chapter. Chapters 9 and 10 take the next step beyond the scope of this chapter and begin to provide some of the initial mathematical underpinnings of inference and more advanced applications of the concept of inference in testing hypotheses and performing regression. This material offers a great starting point for getting more into the technical side of statistics.
1176-
- Chapters 4 to 7 of [*OpenIntro Statistics - Fourth Edition*](https://www.openintro.org/) provide a good next step after *Modern Dive*. Although it is still certainly an introductory text, things get a bit more mathematical here. Depending on your background, you may actually want to start going through Chapters 1 to 3 first, where you will learn some fundamental concepts in probability theory. Although it may seem like a diversion, probability theory is *the language of statistics*; if you have a solid grasp of probability, more advanced statistics will come naturally to you!
1176+
- Chapters 4 to 7 of [*OpenIntro Statistics*](https://www.openintro.org/) provide a good next step after *Modern Dive*. Although it is still certainly an introductory text, things get a bit more mathematical here. Depending on your background, you may actually want to start going through Chapters 1 to 3 first, where you will learn some fundamental concepts in probability theory. Although it may seem like a diversion, probability theory is *the language of statistics*; if you have a solid grasp of probability, more advanced statistics will come naturally to you!

jupyter.Rmd

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -144,7 +144,7 @@ that indicates the status of your kernel. If the circle is empty (`r fa("circle"
144144
the kernel is idle and ready to execute code. If the circle is filled in (`r fa("circle", fill = "black", stroke = "black", stroke_width = "10px", height = "12px")`),
145145
the kernel is busy running some code.
146146

147-
You may run into problems where your kernel \index{kernel!interrupt,restart} is stuck for an excessive amount
147+
You may run into problems where your kernel \index{kernel!interrupt, restart} is stuck for an excessive amount
148148
of time, your notebook is very slow and unresponsive, or your kernel loses its
149149
connection. If this happens, try the following steps:
150150

@@ -245,8 +245,8 @@ referenced in another distinct code cell (Figure \@ref(fig:out-of-order-1)).
245245
Together, this means that you could then write a code cell further above in the
246246
notebook that references `y` and execute it without error in the current session
247247
(Figure \@ref(fig:out-of-order-2)). This could also be done successfully in
248-
future sessions if, and only if, you run the cells in the same non-conventional
249-
order. However, it is difficult to remember this non-conventional order, and it
248+
future sessions if, and only if, you run the cells in the same unconventional
249+
order. However, it is difficult to remember this unconventional order, and it
250250
is not the order that others would expect your code to be executed in. Thus, in
251251
the future, this would lead
252252
to errors when the notebook is run in the conventional
@@ -287,7 +287,7 @@ is an issue. Knowing this sooner rather than later will allow you to
287287
fix the issue and ensure your notebook can be run linearly from start to finish.
288288

289289
We recommend as a best practice to run the entire notebook in a fresh R session
290-
at least 2-3 times within any period of work. Note that,
290+
at least 2&ndash;3 times within any period of work. Note that,
291291
critically, you *must do this in a fresh R session* by restarting your kernel.
292292
We recommend using either the **Kernel** >>
293293
**Restart Kernel and Run All Cells...** command from the menu or the `r fa("fast-forward", height = "11px")`
@@ -328,7 +328,7 @@ their computer to run the analysis successfully.
328328
1. Write code so that it can be executed in a linear order.
329329

330330
2. As you write code in a Jupyter notebook, run the notebook in a linear order
331-
and in its entirety often (2-3 times every work session) via the **Kernel** >>
331+
and in its entirety often (2&ndash;3 times every work session) via the **Kernel** >>
332332
**Restart Kernel and Run All Cells...** command from the Jupyter menu or the `r fa("fast-forward", height = "11px")`
333333
button in the toolbar.
334334

reading.Rmd

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -61,7 +61,7 @@ into R, but before we can talk about *how* we read the data into R with these
6161
functions, we first need to talk about *where* the data lives. When you load a
6262
data set into R, you first need to tell R where those files live. The file
6363
could live on your computer (*local*)
64-
\index{location|see{path}} \index{path!local,remote,relative,absolute}
64+
\index{location|see{path}} \index{path!local, remote, relative, absolute}
6565
or somewhere on the internet (*remote*).
6666

6767
The place where the file lives on your computer is called the "path". You can
@@ -810,10 +810,10 @@ directly from what a website displays is called \index{web scraping}
810810
information manually is a painstaking and error-prone process, especially when
811811
there is a lot of information to gather. So instead of asking your browser to
812812
translate the information that the web server provides into something you can
813-
see, you can collect that data programmatically---in the form of
813+
see, you can collect that data programmatically&mdash;in the form of
814814
**h**yper**t**ext **m**arkup **l**anguage
815815
(HTML) \index{hypertext markup language|see{HTML}}\index{cascading style sheet|see{CSS}}\index{CSS}\index{HTML}
816-
and **c**ascading **s**tyle **s**heet (CSS) code---and process it
816+
and **c**ascading **s**tyle **s**heet (CSS) code&mdash;and process it
817817
to extract useful information. HTML provides the
818818
basic structure of a site and tells the webpage how to display the content
819819
(e.g., titles, paragraphs, bullet lists etc.), whereas CSS helps style the
@@ -921,8 +921,8 @@ take a look at another line of the source snippet above:
921921
922922
It's yet another price for an apartment listing, and the tags surrounding it
923923
have the `"result-price"` class. Wonderful! Now that we know what pattern we
924-
are looking for---a dollar amount between opening and closing tags that have the
925-
`"result-price"` class---we should be able to use code to pull out all of the
924+
are looking for&mdash;a dollar amount between opening and closing tags that have the
925+
`"result-price"` class&mdash;we should be able to use code to pull out all of the
926926
matching patterns from the source code to obtain our data. This sort of "pattern"
927927
is known as a *CSS selector* (where CSS stands for **c**ascading **s**tyle **s**heet).
928928
@@ -1063,7 +1063,7 @@ population_nodes <- html_nodes(page, selectors)
10631063
head(population_nodes)
10641064
```
10651065
1066-
Next we extract the meaningful data---in other words, we get rid of the HTML code syntax and tags---from
1066+
Next we extract the meaningful data&mdash;in other words, we get rid of the HTML code syntax and tags&mdash;from
10671067
the nodes using the `html_text`
10681068
function. In the case of the example
10691069
node above, `html_text` function returns `"London"`.
@@ -1090,7 +1090,7 @@ Rather than posting a data file at a URL for you to download, many websites thes
10901090
provide an API \index{API} that must be accessed through a programming language like R. The benefit of this
10911091
is that data owners have much more control over the data they provide to users. However, unlike
10921092
web scraping, there is no consistent way to access an API across websites. Every website typically
1093-
has its own API designed especially for its own use-case. Therefore we will just provide one example
1093+
has its own API designed especially for its own use case. Therefore we will just provide one example
10941094
of accessing data through an API in this book, with the hope that it gives you enough of a basic
10951095
idea that you can learn how to use another API if needed.
10961096
@@ -1120,8 +1120,8 @@ knitr::include_graphics("img/tidyverse_twitter.png")
11201120
When you access an API, you are initiating a transfer of data from a web server
11211121
to your computer. Web servers are expensive to run and do not have infinite resources.
11221122
If you try to ask for *too much data* at once, you can use up a huge amount of the server's bandwidth.
1123-
If you try to ask for data *too frequently*---e.g., if you
1124-
make many requests to the server in quick succession---you can also bog the server down and make
1123+
If you try to ask for data *too frequently*&mdash;e.g., if you
1124+
make many requests to the server in quick succession&mdash;you can also bog the server down and make
11251125
it unable to talk to anyone else. Most servers have mechanisms to revoke your access if you are not
11261126
careful, but you should try to prevent issues from happening in the first place by being extra careful
11271127
with how you write and run your code. You should also keep in mind that when a website owner
@@ -1195,7 +1195,7 @@ tidyverse_tweets
11951195
If you look back up at the image of the Tidyverse Twitter page, you will
11961196
recognize the text of the most recent few tweets in the above data frame. In
11971197
other words, we have successfully created a small data set using the Twitter
1198-
API---neat! This data is also quite different from what we obtained from web scraping;
1198+
API&mdash;neat! This data is also quite different from what we obtained from web scraping;
11991199
it is already well-organized into a `tidyverse` data frame (although not *every* API
12001200
will provide data in such a nice format).
12011201
From this point onward, the `tidyverse_tweets` data frame is stored on your

0 commit comments

Comments
 (0)