Skip to content

Commit b12e5de

Browse files
Merge pull request #434 from UBC-DSCI/ch9-13
Ch9-13 Final check
2 parents 980c019 + 02a5fc8 commit b12e5de

File tree

4 files changed

+30
-32
lines changed

4 files changed

+30
-32
lines changed

clustering.Rmd

Lines changed: 16 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ including techniques to choose the number of clusters.
3232
## Chapter learning objectives
3333
By the end of the chapter, readers will be able to do the following:
3434

35-
* Describe a case where clustering is appropriate,
35+
* Describe a situation in which clustering is an appropriate technique to use,
3636
and what insight it might extract from the data.
3737
* Explain the K-means clustering algorithm.
3838
* Interpret the output of a K-means analysis.
@@ -46,7 +46,7 @@ and do this using R.
4646
limitations and assumptions of the K-means clustering algorithm.
4747

4848
## Clustering
49-
Clustering \index{clustering} is a data analysis task
49+
Clustering \index{clustering} is a data analysis technique
5050
involving separating a data set into subgroups of related data.
5151
For example, we might use clustering to separate a
5252
data set of documents into groups that correspond to topics, a data set of
@@ -70,12 +70,11 @@ and examine the structure of data without any response variable labels
7070
or values to help us.
7171
This approach has both advantages and disadvantages.
7272
Clustering requires no additional annotation or input on the data.
73-
For example, it would be nearly impossible to annotate
74-
all the articles on Wikipedia with human-made topic labels.
75-
However, we can still cluster the articles without this information
73+
For example, while it would be nearly impossible to annotate
74+
all the articles on Wikipedia with human-made topic labels,
75+
we can cluster the articles without this information
7676
to find groupings corresponding to topics automatically.
77-
78-
Given that there is no response variable, it is not as easy to evaluate
77+
However, given that there is no response variable, it is not as easy to evaluate
7978
the "quality" of a clustering. With classification, we can use a test data set
8079
to assess prediction performance. In clustering, there is not a single good
8180
choice for evaluation. In this book, we will use visualization to ascertain the
@@ -248,9 +247,9 @@ It starts with an initial clustering of the data, and then iteratively
248247
improves it by making adjustments to the assignment of data
249248
to clusters until it cannot improve any further. But how do we measure
250249
the "quality" of a clustering, and what does it mean to improve it?
251-
In K-means clustering, we measure the quality of a cluster by its
252-
\index{within-cluster sum-of-squared-distances|see{WSSD}}\index{WSSD}
253-
*within-cluster sum-of-squared-distances* (WSSD). Computing this involves two steps.
250+
In K-means clustering, we measure the quality of a cluster
251+
by its\index{within-cluster sum-of-squared-distances|see{WSSD}}\index{WSSD} *within-cluster sum-of-squared-distances* (WSSD).
252+
Computing this involves two steps.
254253
First, we find the cluster centers by computing the mean of each variable
255254
over data points in the cluster. For example, suppose we have a
256255
cluster containing four observations, and we are using two variables, $x$ and $y$, to cluster the data.
@@ -839,7 +838,7 @@ p1
839838
```
840839

841840
If we set K less than 3, then the clustering merges separate groups of data; this causes a large
842-
total WSSD, since the cluster center (denoted by an "x") is not close to any of the data in the cluster. On
841+
total WSSD, since the cluster center is not close to any of the data in the cluster. On
843842
the other hand, if we set K greater than 3, the clustering subdivides subgroups of data; this does indeed still
844843
decrease the total WSSD, but by only a *diminishing amount*. If we plot the total WSSD versus the number of
845844
clusters, we see that the decrease in total WSSD levels off (or forms an "elbow shape") \index{elbow method} when we reach roughly
@@ -890,7 +889,7 @@ not_standardized_data
890889
```
891890

892891
And then we apply the `scale` function to every column in the data frame
893-
using `mutate` + `across`.
892+
using `mutate` and `across`.
894893

895894
```{r 10-mapdf-scale-data}
896895
standardized_data <- not_standardized_data |>
@@ -903,8 +902,8 @@ standardized_data
903902

904903
To perform K-means clustering in R, we use the `kmeans` function. \index{K-means!kmeans function} It takes at
905904
least two arguments: the data frame containing the data you wish to cluster,
906-
and K, the number of clusters (here we choose K = 3). Note that since the K-means
907-
algorithm uses a random initialization of assignments, but since we set the random seed
905+
and K, the number of clusters (here we choose K = 3). Note that the K-means
906+
algorithm uses a random initialization of assignments; but since we set the random seed
908907
earlier, the clustering will be reproducible.
909908

910909
```{r 10-kmeans-seed, echo = FALSE, warning = FALSE, message = FALSE}
@@ -1000,8 +999,8 @@ penguin_clust_ks
1000999
If we wanted to get one of the clusterings out
10011000
of the list column in the data frame,
10021001
we could use a familiar friend: `pull`.
1003-
`pull` will return to us a data frame column as a simpler data structure,
1004-
here that would be a list.
1002+
`pull` will return to us a data frame column as a simpler data structure;
1003+
here, that would be a list.
10051004
And then to extract the first item of the list,
10061005
we can use the `pluck` function. We pass
10071006
it the index for the element we would like to extract
@@ -1074,7 +1073,7 @@ The more times we perform K-means clustering,
10741073
the more likely we are to find a good clustering (if one exists).
10751074
What value should you choose for `nstart`? The answer is that it depends
10761075
on many factors: the size and characteristics of your data set,
1077-
as well as the speed and size of your computer.
1076+
as well as how powerful your computer is.
10781077
The larger the `nstart` value the better from an analysis perspective,
10791078
but there is a trade-off that doing many clusterings
10801079
could take a long time.

inference.Rmd

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@ By the end of the chapter, readers will be able to do the following:
3939

4040
* Describe real-world examples of questions that can be answered with statistical inference.
4141
* Define common population parameters (e.g., mean, proportion, standard deviation) that are often estimated using sampled data, and estimate these from a sample.
42-
* Define the following statistical sampling terms (population, sample, population parameter, point estimate, sampling distribution).
42+
* Define the following statistical sampling terms: population, sample, population parameter, point estimate, and sampling distribution.
4343
* Explain the difference between a population parameter and a sample point estimate.
4444
* Use R to draw random samples from a finite population.
4545
* Use R to create a sampling distribution from a finite population.
@@ -90,14 +90,14 @@ knitr::include_graphics("img/population_vs_sample.png")
9090
Note that proportions are not the *only* kind of population parameter we might
9191
be interested in. For example, suppose an undergraduate student studying at the University
9292
of British Columbia in Canada is looking for an apartment
93-
to rent. They need to create a budget, so they want to know something about
94-
studio apartment rental prices in Vancouver, BC. This student might
95-
formulate the following question:
93+
to rent. They need to create a budget, so they want to know about
94+
studio apartment rental prices in Vancouver. This student might
95+
formulate the question:
9696

97-
*What is the average price-per-month of studio apartment rentals in Vancouver, Canada?*
97+
*What is the average price per month of studio apartment rentals in Vancouver?*
9898

9999
In this case, the population consists of all studio apartment rentals in Vancouver, and the
100-
population parameter is the *average price-per-month*. Here we used the average
100+
population parameter is the *average price per month*. Here we used the average
101101
as a measure of the center to describe the "typical value" of studio apartment
102102
rental prices. But even within this one example, we could also be interested in
103103
many other population parameters. For instance, we know that not every studio
@@ -1148,9 +1148,9 @@ boot_est_dist +
11481148

11491149
To finish our estimation of the population parameter, we would report the point
11501150
estimate and our confidence interval's lower and upper bounds. Here the sample
1151-
mean price-per-night of 40 Airbnb listings was
1151+
mean price per night of 40 Airbnb listings was
11521152
\$`r round(mean(one_sample$price),2)`, and we are 95\% "confident" that the true
1153-
population mean price-per-night for all Airbnb listings in Vancouver is between
1153+
population mean price per night for all Airbnb listings in Vancouver is between
11541154
\$(`r round(bounds[1],2)`, `r round(bounds[2],2)`).
11551155
Notice that our interval does indeed contain the true
11561156
population mean value, \$`r round(mean(airbnb$price),2)`\! However, in

jupyter.Rmd

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -104,10 +104,9 @@ To run a code cell independently, the cell needs to first be activated. This
104104
is done by clicking on it with the cursor. Jupyter will indicate a cell has been
105105
activated by highlighting it with a blue rectangle to its left. After the cell
106106
has been activated (Figure \@ref(fig:activate-and-run-button)), the cell can be run by either pressing the **Run** (`r fa("play", height = "11px")`)
107-
button in the toolbar, or by using a keyboard shortcut of
108-
`Shift + Enter`.
107+
button in the toolbar, or by using the keyboard shortcut `Shift + Enter`.
109108

110-
```{r activate-and-run-button, echo = FALSE, fig.cap = "An activated cell that is ready to be run. The red arrow points to the blue rectangle to the cell's left. The blue rectangle indicates that it is ready to be run. This can be done by clicking the run button (circled in red).", fig.retina = 2, out.width="100%"}
109+
```{r activate-and-run-button, echo = FALSE, fig.cap = "An activated cell that is ready to be run. The blue rectangle to the cell's left (annotated by a red arrow) indicates that it is ready to be run. The cell can be run by clicking the run button (circled in red).", fig.retina = 2, out.width="100%"}
111110
image_read("img/activate-and-run-button-annotated.png") |>
112111
image_crop("3632x900")
113112
```
@@ -135,7 +134,7 @@ image_read("img/restart-kernel-run-all.png") |>
135134
```
136135

137136
### The Kernel
138-
The kernel \index{kernel}\index{Jupyter notebook!kernel|see{kernel}} is a program that executes the code inside your notebook and
137+
The kernel\index{kernel}\index{Jupyter notebook!kernel|see{kernel}} is a program that executes the code inside your notebook and
139138
outputs the results. Kernels for many different programming languages have
140139
been created for Jupyter, which means that Jupyter can interpret and execute
141140
the code of many different programming languages. To run R code, your notebook
@@ -166,7 +165,7 @@ image_read("img/create-new-code-cell.png") |>
166165

167166
## Markdown cells
168167

169-
Text cells inside a Jupyter notebook are \index{markdown} \index{Jupyter notebook!markdown cell} called Markdown cells. Markdown cells
168+
Text cells inside a Jupyter notebook are\index{markdown}\index{Jupyter notebook!markdown cell} called Markdown cells. Markdown cells
170169
are rich formatted text cells, which means you can **bold** and *italicize*
171170
text, create subject headers, create bullet and numbered lists, and more. These cells are
172171
given the name "Markdown" because they use *Markdown language* to specify the rich text formatting.

version-control.Rmd

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -225,7 +225,7 @@ made. In Figure \@ref(fig:vc-ba3-commit), the message is `Message about changes.
225225
your work you should make sure to replace this with an
226226
informative message about what changed. It is also important to note here that
227227
these changes are only being committed to the local repository's history. The
228-
remote repository on GitHub has not changed, and collaborators would not yet be
228+
remote repository on GitHub has not changed, and collaborators are not yet
229229
able to see your new changes.
230230

231231
```{r vc-ba3-commit, fig.cap = "Committing the modified files in the staging area to the local repository history, with an informative message about what changed.", fig.retina = 2, out.width="100%"}
@@ -903,7 +903,7 @@ image_read("img/version_control/issue_01.png") |>
903903

904904
Next click the "New issue" button (Figure \@ref(fig:issue-02)).
905905

906-
(ref:issue-02) The "New issues" button on the GitHub web interface.
906+
(ref:issue-02) The "New issue" button on the GitHub web interface.
907907

908908
```{r issue-02, fig.pos = "H", out.extra="", fig.cap = '(ref:issue-02)', fig.retina = 2, out.width="100%"}
909909
image_read("img/version_control/issue_02.png") |>

0 commit comments

Comments
 (0)