Skip to content

Commit 40741b6

Browse files
Merge pull request #444 from UBC-DSCI/dev
Update master with dev
2 parents fec7a19 + 3dee232 commit 40741b6

17 files changed

+152
-147
lines changed

classification1.Rmd

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -170,6 +170,7 @@ total set of variables per image in this data set is:
170170
11. Symmetry: how similar the nucleus is when mirrored
171171
12. Fractal Dimension: a measurement of how "rough" the perimeter is
172172

173+
\pagebreak
173174

174175
Below we use `glimpse` \index{glimpse} to preview the data frame. This function can
175176
make it easier to inspect the data when we have a lot of columns,
@@ -192,7 +193,7 @@ glimpse(cancer)
192193
```
193194

194195
Recall that factors have what are called "levels", which you can think of as categories. We
195-
can verify the levels of the `Class` column by using the `levels` \index{levels}\index{factor!levels} function.
196+
can verify the levels of the `Class` column by using the `levels`\index{levels}\index{factor!levels} function.
196197
This function should return the name of each category in that column. Given
197198
that we only have two different values in our `Class` column (B for benign and M
198199
for malignant), we only expect to get two names back. Note that the `levels` function requires a *vector* argument;
@@ -582,7 +583,6 @@ three predictors.
582583
new_obs_Perimeter <- 0
583584
new_obs_Concavity <- 3.5
584585
new_obs_Symmetry <- 1
585-
586586
cancer |>
587587
select(ID, Perimeter, Concavity, Symmetry, Class) |>
588588
mutate(dist_from_new = sqrt((Perimeter - new_obs_Perimeter)^2 +
@@ -846,8 +846,8 @@ loaded, and the standardized version of that same data. But first, we need to
846846
standardize the `unscaled_cancer` data set with `tidymodels`.
847847

848848
In the `tidymodels` framework, all data preprocessing happens
849-
using a `recipe` from [the `recipes` R package](https://recipes.tidymodels.org/) [@recipes]
850-
Here we will initialize a recipe \index{recipe} \index{tidymodels!recipe|see{recipe}} for
849+
using a `recipe` from [the `recipes` R package](https://recipes.tidymodels.org/) [@recipes].
850+
Here we will initialize a recipe\index{recipe} \index{tidymodels!recipe|see{recipe}} for
851851
the `unscaled_cancer` data above, specifying
852852
that the `Class` variable is the target, and all other variables are predictors:
853853

@@ -1296,7 +1296,7 @@ The `tidymodels` package collection also provides the `workflow`, a way to chain
12961296
To illustrate the whole pipeline, let's start from scratch with the `unscaled_wdbc.csv` data.
12971297
First we will load the data, create a model, and specify a recipe for how the data should be preprocessed:
12981298

1299-
```{r 05-workflow}
1299+
```{r 05-workflow, message = FALSE, warning = FALSE}
13001300
# load the unscaled cancer data
13011301
# and make sure the target Class variable is a factor
13021302
unscaled_cancer <- read_csv("data/unscaled_wdbc.csv") |>
@@ -1320,7 +1320,7 @@ formula `Class ~ Area + Smoothness` (instead of `Class ~ .`) in the recipe.
13201320
You will also notice that we did not call `prep()` on the recipe; this is unnecessary when it is
13211321
placed in a workflow.
13221322

1323-
We will now place these steps in a `workflow` using the `add_recipe` and `add_model` functions, \index{tidymodels!add\_recipe}\index{tidymodels!add\_model}
1323+
We will now place these steps in a `workflow` using the `add_recipe` and `add_model` functions,\index{tidymodels!add\_recipe}\index{tidymodels!add\_model}
13241324
and finally we will use the `fit` function to run the whole workflow on the `unscaled_cancer` data.
13251325
Note another difference from earlier here: we do not include a formula in the `fit` function. This \index{tidymodels!fit}
13261326
is again because we included the formula in the recipe, so there is no need to respecify it:
@@ -1364,6 +1364,8 @@ The basic idea is to create a grid of synthetic new observations using the `expa
13641364
predict the label of each, and visualize the predictions with a colored scatter having a very high transparency
13651365
(low `alpha` value) and large point radius. See if you can figure out what each line is doing!
13661366

1367+
\pagebreak
1368+
13671369
> **Note:** Understanding this code is not required for the remainder of the
13681370
> textbook. It is included for those readers who would like to use similar
13691371
> visualizations in their own data analyses.

classification2.Rmd

Lines changed: 16 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -120,7 +120,7 @@ in the analysis, would we not get a different result each time?
120120
The trick is that in R&mdash;and other programming languages&mdash;randomness
121121
is not actually random! Instead, R uses a *random number generator* that
122122
produces a sequence of numbers that
123-
are completely determined by a \index{seed} \index{random seed|see{seed}}
123+
are completely determined by a\index{seed} \index{random seed|see{seed}}
124124
*seed value*. Once you set the seed value
125125
using the \index{seed!set.seed} `set.seed` function, everything after that point may *look* random,
126126
but is actually totally reproducible. As long as you pick the same seed
@@ -134,34 +134,34 @@ Here, we pass in the number `1`.
134134

135135
```{r}
136136
set.seed(1)
137-
random_numbers <- sample(0:9, 10, replace=TRUE)
138-
random_numbers
137+
random_numbers1 <- sample(0:9, 10, replace=TRUE)
138+
random_numbers1
139139
```
140140

141-
You can see that `random_numbers` is a list of 10 numbers
141+
You can see that `random_numbers1` is a list of 10 numbers
142142
from 0 to 9 that, from all appearances, looks random. If
143143
we run the `sample` function again, we will
144144
get a fresh batch of 10 numbers that also look random.
145145

146146
```{r}
147-
random_numbers <- sample(0:9, 10, replace=TRUE)
148-
random_numbers
147+
random_numbers2 <- sample(0:9, 10, replace=TRUE)
148+
random_numbers2
149149
```
150150

151151
If we want to force R to produce the same sequences of random numbers,
152152
we can simply call the `set.seed` function again with the same argument
153-
value.
153+
value.
154154

155155
```{r}
156156
set.seed(1)
157-
random_numbers <- sample(0:9, 10, replace=TRUE)
158-
random_numbers
157+
random_numbers1_again <- sample(0:9, 10, replace=TRUE)
158+
random_numbers1_again
159159
160-
random_numbers <- sample(0:9, 10, replace=TRUE)
161-
random_numbers
160+
random_numbers2_again <- sample(0:9, 10, replace=TRUE)
161+
random_numbers2_again
162162
```
163163

164-
And if we choose
164+
Notice that after setting the seed, we get the same two sequences of numbers in the same order. `random_numbers1` and `random_numbers1_again` produce the same sequence of numbers, and the same can be said about `random_numbers2` and `random_numbers2_again`. And if we choose
165165
a different value for the seed&mdash;say, 4235&mdash;we
166166
obtain a different sequence of random numbers.
167167

@@ -323,7 +323,7 @@ our test data does not influence any aspect of our model training. Once we have
323323
created the standardization preprocessor, we can then apply it separately to both the
324324
training and test data sets.
325325

326-
Fortunately, the `recipe` framework from `tidymodels` helps us handle \index{recipe}\index{recipe!step\_scale}\index{recipe!step\_center}
326+
Fortunately, the `recipe` framework from `tidymodels` helps us handle\index{recipe}\index{recipe!step\_scale}\index{recipe!step\_center}
327327
this properly. Below we construct and prepare the recipe using only the training
328328
data (due to `data = cancer_train` in the first line).
329329

@@ -411,7 +411,6 @@ the table of predicted labels and correct labels, using the `conf_mat` function:
411411
```{r 06-confusionmat}
412412
confusion <- cancer_test_predictions |>
413413
conf_mat(truth = Class, estimate = .pred_class)
414-
415414
confusion
416415
```
417416

@@ -497,7 +496,7 @@ for the application.
497496
## Tuning the classifier
498497

499498
The vast majority of predictive models in statistics and machine learning have
500-
*parameters*. A *parameter* \index{parameter}\index{tuning parameter|see{parameter}}
499+
*parameters*. A *parameter*\index{parameter}\index{tuning parameter|see{parameter}}
501500
is a number you have to pick in advance that determines
502501
some aspect of how the model behaves. For example, in the $K$-nearest neighbors
503502
classification algorithm, $K$ is a parameter that we have to pick
@@ -663,7 +662,7 @@ cancer_vfold <- vfold_cv(cancer_train, v = 5, strata = Class)
663662
cancer_vfold
664663
```
665664

666-
Then, when we create our data analysis workflow, we use the `fit_resamples` function \index{cross-validation!fit\_resamples}\index{tidymodels!fit\_resamples}
665+
Then, when we create our data analysis workflow, we use the `fit_resamples` function\index{cross-validation!fit\_resamples}\index{tidymodels!fit\_resamples}
667666
instead of the `fit` function for training. This runs cross-validation on each
668667
train/validation split.
669668

@@ -689,7 +688,7 @@ knn_fit <- workflow() |>
689688
knn_fit
690689
```
691690

692-
The `collect_metrics` \index{tidymodels!collect\_metrics}\index{cross-validation!collect\_metrics} function is used to aggregate the *mean* and *standard error*
691+
The `collect_metrics`\index{tidymodels!collect\_metrics}\index{cross-validation!collect\_metrics} function is used to aggregate the *mean* and *standard error*
693692
of the classifier's validation accuracy across the folds. You will find results
694693
related to the accuracy in the row with `accuracy` listed under the `.metric` column.
695694
You should consider the mean (`mean`) to be the estimated accuracy, while the standard

clustering.Rmd

Lines changed: 16 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ including techniques to choose the number of clusters.
3232
## Chapter learning objectives
3333
By the end of the chapter, readers will be able to do the following:
3434

35-
* Describe a case where clustering is appropriate,
35+
* Describe a situation in which clustering is an appropriate technique to use,
3636
and what insight it might extract from the data.
3737
* Explain the K-means clustering algorithm.
3838
* Interpret the output of a K-means analysis.
@@ -46,7 +46,7 @@ and do this using R.
4646
limitations and assumptions of the K-means clustering algorithm.
4747

4848
## Clustering
49-
Clustering \index{clustering} is a data analysis task
49+
Clustering \index{clustering} is a data analysis technique
5050
involving separating a data set into subgroups of related data.
5151
For example, we might use clustering to separate a
5252
data set of documents into groups that correspond to topics, a data set of
@@ -70,12 +70,11 @@ and examine the structure of data without any response variable labels
7070
or values to help us.
7171
This approach has both advantages and disadvantages.
7272
Clustering requires no additional annotation or input on the data.
73-
For example, it would be nearly impossible to annotate
74-
all the articles on Wikipedia with human-made topic labels.
75-
However, we can still cluster the articles without this information
73+
For example, while it would be nearly impossible to annotate
74+
all the articles on Wikipedia with human-made topic labels,
75+
we can cluster the articles without this information
7676
to find groupings corresponding to topics automatically.
77-
78-
Given that there is no response variable, it is not as easy to evaluate
77+
However, given that there is no response variable, it is not as easy to evaluate
7978
the "quality" of a clustering. With classification, we can use a test data set
8079
to assess prediction performance. In clustering, there is not a single good
8180
choice for evaluation. In this book, we will use visualization to ascertain the
@@ -248,9 +247,9 @@ It starts with an initial clustering of the data, and then iteratively
248247
improves it by making adjustments to the assignment of data
249248
to clusters until it cannot improve any further. But how do we measure
250249
the "quality" of a clustering, and what does it mean to improve it?
251-
In K-means clustering, we measure the quality of a cluster by its
252-
\index{within-cluster sum-of-squared-distances|see{WSSD}}\index{WSSD}
253-
*within-cluster sum-of-squared-distances* (WSSD). Computing this involves two steps.
250+
In K-means clustering, we measure the quality of a cluster
251+
by its\index{within-cluster sum-of-squared-distances|see{WSSD}}\index{WSSD} *within-cluster sum-of-squared-distances* (WSSD).
252+
Computing this involves two steps.
254253
First, we find the cluster centers by computing the mean of each variable
255254
over data points in the cluster. For example, suppose we have a
256255
cluster containing four observations, and we are using two variables, $x$ and $y$, to cluster the data.
@@ -839,7 +838,7 @@ p1
839838
```
840839

841840
If we set K less than 3, then the clustering merges separate groups of data; this causes a large
842-
total WSSD, since the cluster center (denoted by an "x") is not close to any of the data in the cluster. On
841+
total WSSD, since the cluster center is not close to any of the data in the cluster. On
843842
the other hand, if we set K greater than 3, the clustering subdivides subgroups of data; this does indeed still
844843
decrease the total WSSD, but by only a *diminishing amount*. If we plot the total WSSD versus the number of
845844
clusters, we see that the decrease in total WSSD levels off (or forms an "elbow shape") \index{elbow method} when we reach roughly
@@ -890,7 +889,7 @@ not_standardized_data
890889
```
891890

892891
And then we apply the `scale` function to every column in the data frame
893-
using `mutate` + `across`.
892+
using `mutate` and `across`.
894893

895894
```{r 10-mapdf-scale-data}
896895
standardized_data <- not_standardized_data |>
@@ -903,8 +902,8 @@ standardized_data
903902

904903
To perform K-means clustering in R, we use the `kmeans` function. \index{K-means!kmeans function} It takes at
905904
least two arguments: the data frame containing the data you wish to cluster,
906-
and K, the number of clusters (here we choose K = 3). Note that since the K-means
907-
algorithm uses a random initialization of assignments, but since we set the random seed
905+
and K, the number of clusters (here we choose K = 3). Note that the K-means
906+
algorithm uses a random initialization of assignments; but since we set the random seed
908907
earlier, the clustering will be reproducible.
909908

910909
```{r 10-kmeans-seed, echo = FALSE, warning = FALSE, message = FALSE}
@@ -1000,8 +999,8 @@ penguin_clust_ks
1000999
If we wanted to get one of the clusterings out
10011000
of the list column in the data frame,
10021001
we could use a familiar friend: `pull`.
1003-
`pull` will return to us a data frame column as a simpler data structure,
1004-
here that would be a list.
1002+
`pull` will return to us a data frame column as a simpler data structure;
1003+
here, that would be a list.
10051004
And then to extract the first item of the list,
10061005
we can use the `pluck` function. We pass
10071006
it the index for the element we would like to extract
@@ -1074,7 +1073,7 @@ The more times we perform K-means clustering,
10741073
the more likely we are to find a good clustering (if one exists).
10751074
What value should you choose for `nstart`? The answer is that it depends
10761075
on many factors: the size and characteristics of your data set,
1077-
as well as the speed and size of your computer.
1076+
as well as how powerful your computer is.
10781077
The larger the `nstart` value the better from an analysis perspective,
10791078
but there is a trade-off that doing many clusterings
10801079
could take a long time.

img/intro-bootstrap.jpeg

77 Bytes
Loading

img/pivot_functions.key

-508 Bytes
Binary file not shown.
96 Bytes
Loading
-854 Bytes
Binary file not shown.

inference.Rmd

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@ By the end of the chapter, readers will be able to do the following:
3939

4040
* Describe real-world examples of questions that can be answered with statistical inference.
4141
* Define common population parameters (e.g., mean, proportion, standard deviation) that are often estimated using sampled data, and estimate these from a sample.
42-
* Define the following statistical sampling terms (population, sample, population parameter, point estimate, sampling distribution).
42+
* Define the following statistical sampling terms: population, sample, population parameter, point estimate, and sampling distribution.
4343
* Explain the difference between a population parameter and a sample point estimate.
4444
* Use R to draw random samples from a finite population.
4545
* Use R to create a sampling distribution from a finite population.
@@ -90,14 +90,14 @@ knitr::include_graphics("img/population_vs_sample.png")
9090
Note that proportions are not the *only* kind of population parameter we might
9191
be interested in. For example, suppose an undergraduate student studying at the University
9292
of British Columbia in Canada is looking for an apartment
93-
to rent. They need to create a budget, so they want to know something about
94-
studio apartment rental prices in Vancouver, BC. This student might
95-
formulate the following question:
93+
to rent. They need to create a budget, so they want to know about
94+
studio apartment rental prices in Vancouver. This student might
95+
formulate the question:
9696

97-
*What is the average price-per-month of studio apartment rentals in Vancouver, Canada?*
97+
*What is the average price per month of studio apartment rentals in Vancouver?*
9898

9999
In this case, the population consists of all studio apartment rentals in Vancouver, and the
100-
population parameter is the *average price-per-month*. Here we used the average
100+
population parameter is the *average price per month*. Here we used the average
101101
as a measure of the center to describe the "typical value" of studio apartment
102102
rental prices. But even within this one example, we could also be interested in
103103
many other population parameters. For instance, we know that not every studio
@@ -1148,9 +1148,9 @@ boot_est_dist +
11481148

11491149
To finish our estimation of the population parameter, we would report the point
11501150
estimate and our confidence interval's lower and upper bounds. Here the sample
1151-
mean price-per-night of 40 Airbnb listings was
1151+
mean price per night of 40 Airbnb listings was
11521152
\$`r round(mean(one_sample$price),2)`, and we are 95\% "confident" that the true
1153-
population mean price-per-night for all Airbnb listings in Vancouver is between
1153+
population mean price per night for all Airbnb listings in Vancouver is between
11541154
\$(`r round(bounds[1],2)`, `r round(bounds[2],2)`).
11551155
Notice that our interval does indeed contain the true
11561156
population mean value, \$`r round(mean(airbnb$price),2)`\! However, in

intro.Rmd

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -25,10 +25,10 @@ By the end of the chapter, readers will be able to do the following:
2525
- Identify the different types of data analysis question and categorize a question into the correct type.
2626
- Load the `tidyverse` package into R.
2727
- Read tabular data with `read_csv`.
28-
- Use `?` to access help and documentation tools in R.
2928
- Create new variables and objects in R using the assignment symbol.
3029
- Create and organize subsets of tabular data using `filter`, `select`, `arrange`, and `slice`.
3130
- Visualize data with a `ggplot` bar plot.
31+
- Use `?` to access help and documentation tools in R.
3232

3333
## Canadian languages data set
3434

@@ -312,7 +312,7 @@ to be surrounded by quotes.
312312
After making the assignment, we can use the special name words we have created in
313313
place of their values. For example, if we want to do something with the value `3` later on,
314314
we can just use `my_number` instead. Let's try adding 2 to `my_number`; you will see that
315-
R just interprets this as adding 2 and 3:
315+
R just interprets this as adding 3 and 2:
316316
```{r naming-things2}
317317
my_number + 2
318318
```
@@ -374,7 +374,7 @@ Aboriginal languages in the data set, and then use `select` to obtain only the
374374
columns we want to include in our table.
375375

376376
### Using `filter` to extract rows
377-
Looking at the `can_lang` data above, we see the column `category` contains different
377+
Looking at the `can_lang` data above, we see the `category` column contains different
378378
high-level categories of languages, which include "Aboriginal languages",
379379
"Non-Official & Non-Aboriginal languages" and "Official languages". To answer
380380
our question we want to filter our data set so we restrict our attention
@@ -528,7 +528,7 @@ image_read("img/ggplot_function.jpeg") |>
528528
image_crop("1625x1900")
529529
```
530530

531-
```{r barplot-mother-tongue, fig.width=5, fig.height=3, warning=FALSE, fig.cap = "Bar plot of the ten Aboriginal languages most often reported by Canadian residents as their mother tongue. Note that this visualization is not done yet; there are still improvements to be made."}
531+
```{r barplot-mother-tongue, fig.width=5, fig.height=3.1, warning=FALSE, fig.cap = "Bar plot of the ten Aboriginal languages most often reported by Canadian residents as their mother tongue. Note that this visualization is not done yet; there are still improvements to be made."}
532532
ggplot(ten_lang, aes(x = language, y = mother_tongue)) +
533533
geom_bar(stat = "identity")
534534
```
@@ -687,7 +687,7 @@ Figure \@ref(fig:01-help) shows the documentation that will pop up,
687687
including a high-level description of the function, its arguments,
688688
a description of each, and more. Note that you may find some of the
689689
text in the documentation a bit too technical right now
690-
(for example, what is `dbplyr`, and what is grouped data?).
690+
(for example, what is `dbplyr`, and what is a lazy data frame?).
691691
Fear not: as you work through this book, many of these terms will be introduced
692692
to you, and slowly but surely you will become more adept at understanding and navigating
693693
documentation like that shown in Figure \@ref(fig:01-help). But do keep in mind that the documentation

0 commit comments

Comments
 (0)