Skip to content

Commit 45890e3

Browse files
authored
Merge pull request #394 from UBC-DSCI/clustering-edits
Copyediting for clustering
2 parents f8fa611 + fb539ae commit 45890e3

File tree

1 file changed

+7
-7
lines changed

1 file changed

+7
-7
lines changed

clustering.Rmd

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ using the K-means algorithm,
2828
including techniques to choose the number of clusters.
2929

3030
## Chapter learning objectives
31-
By the end of the chapter, readers will be able to:
31+
By the end of the chapter, readers will be able to do the following:
3232

3333
* Describe a case where clustering is appropriate,
3434
and what insight it might extract from the data.
@@ -104,7 +104,7 @@ for where to begin learning more about these other methods.
104104
Here we will present an illustrative example using a data set \index{Palmer penguins} from the
105105
[{palmerpenguins} R data package](https://allisonhorst.github.io/palmerpenguins/). This data set was
106106
collected by [Dr. Kristen Gorman](https://www.uaf.edu/cfos/people/faculty/detail/kristen-gorman.php) and
107-
the [Palmer Station, Antarctica Long Term Ecological Research Site](https://pal.lternet.edu/) and includes
107+
the [Palmer Station, Antarctica Long Term Ecological Research Site](https://pal.lternet.edu/), and includes
108108
measurements for adult penguins found near there [@palmerpenguins]. We have
109109
modified the data set for use in this chapter. Here we will focus on using two
110110
variables---penguin bill and flipper length, both in millimeters---to determine whether
@@ -184,7 +184,7 @@ including:
184184
2. a small flipper length, but large bill length group, and
185185
3. a large flipper and bill length group.
186186

187-
Data visualization is a great tool to give us a rough sense for such patterns
187+
Data visualization is a great tool to give us a rough sense of such patterns
188188
when we have a small number of variables.
189189
But if we are to group data—and select the number of groups—as part of
190190
a reproducible analysis, we need something a bit more automated.
@@ -193,7 +193,7 @@ as we increase the number of variables we consider when clustering.
193193
The way to rigorously separate the data into groups
194194
is to use a clustering algorithm.
195195
In this chapter, we will focus on the *K-means* algorithm,
196-
\index{K-means} a widely-used and often very effective clustering method,
196+
\index{K-means} a widely used and often very effective clustering method,
197197
combined with the *elbow method* \index{elbow method}
198198
for selecting the number of clusters.
199199
This procedure will separate the data into groups;
@@ -911,7 +911,7 @@ As you can see above, the clustering object returned by `kmeans` has a lot of in
911911
that can be used to visualize the clusters, pick K, and evaluate the total WSSD.
912912
To obtain this information in a tidy format, we will call in help
913913
from the `broom` package. \index{broom} Let's start by visualizing the clustering
914-
as a colored scatter plot. To do that
914+
as a colored scatter plot. To do that,
915915
we use the `augment` function, \index{K-means!augment} \index{augment} which takes in the model and the original data
916916
frame, and returns a data frame with the data and the cluster assignments for
917917
each point:
@@ -965,7 +965,7 @@ Then we use `rowwise` \index{rowwise} + `mutate` to apply the `kmeans` function
965965
within each row to each K.
966966
However, given that the `kmeans` function
967967
returns a model object to us (not a vector),
968-
we will need to store the results as a list columm.
968+
we will need to store the results as a list column.
969969
This works because both vectors and lists are legitimate
970970
data structures for data frame columns.
971971
To make this work,
@@ -1098,4 +1098,4 @@ please follow the instructions for computer setup needed to run the worksheets
10981098
found in Chapter \@ref(move-to-your-own-machine).
10991099

11001100
## Additional resources
1101-
- Chapter 10 of [An Introduction to Statistical Learning](https://www.statlearning.com/) [-@james2013introduction] provides a great next stop in the process of learning about clustering and unsupervised learning in general. In the realm of clustering specifically, it provides a great companion introduction to K-means, but also covers *hierarchical* clustering for when you expect there to be subgroups, and then subgroups within subgroups, etc. in your data. In the realm of more general unsupervised learning, it covers *principal components analysis (PCA)*, which is a very popular technique in scientific applications for reducing the number of predictors in a dataset.
1101+
- Chapter 10 of [*An Introduction to Statistical Learning*](https://www.statlearning.com/) [-@james2013introduction] provides a great next stop in the process of learning about clustering and unsupervised learning in general. In the realm of clustering specifically, it provides a great companion introduction to K-means, but also covers *hierarchical* clustering for when you expect there to be subgroups, and then subgroups within subgroups, etc., in your data. In the realm of more general unsupervised learning, it covers *principal components analysis (PCA)*, which is a very popular technique in scientific applications for reducing the number of predictors in a dataset.

0 commit comments

Comments
 (0)