You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: clustering.Rmd
+7-7Lines changed: 7 additions & 7 deletions
Original file line number
Diff line number
Diff line change
@@ -28,7 +28,7 @@ using the K-means algorithm,
28
28
including techniques to choose the number of clusters.
29
29
30
30
## Chapter learning objectives
31
-
By the end of the chapter, readers will be able to:
31
+
By the end of the chapter, readers will be able to do the following:
32
32
33
33
* Describe a case where clustering is appropriate,
34
34
and what insight it might extract from the data.
@@ -104,7 +104,7 @@ for where to begin learning more about these other methods.
104
104
Here we will present an illustrative example using a data set \index{Palmer penguins} from the
105
105
[{palmerpenguins} R data package](https://allisonhorst.github.io/palmerpenguins/). This data set was
106
106
collected by [Dr. Kristen Gorman](https://www.uaf.edu/cfos/people/faculty/detail/kristen-gorman.php) and
107
-
the [Palmer Station, Antarctica Long Term Ecological Research Site](https://pal.lternet.edu/) and includes
107
+
the [Palmer Station, Antarctica Long Term Ecological Research Site](https://pal.lternet.edu/), and includes
108
108
measurements for adult penguins found near there [@palmerpenguins]. We have
109
109
modified the data set for use in this chapter. Here we will focus on using two
110
110
variables---penguin bill and flipper length, both in millimeters---to determine whether
@@ -184,7 +184,7 @@ including:
184
184
2. a small flipper length, but large bill length group, and
185
185
3. a large flipper and bill length group.
186
186
187
-
Data visualization is a great tool to give us a rough sense for such patterns
187
+
Data visualization is a great tool to give us a rough sense of such patterns
188
188
when we have a small number of variables.
189
189
But if we are to group data—and select the number of groups—as part of
190
190
a reproducible analysis, we need something a bit more automated.
@@ -193,7 +193,7 @@ as we increase the number of variables we consider when clustering.
193
193
The way to rigorously separate the data into groups
194
194
is to use a clustering algorithm.
195
195
In this chapter, we will focus on the *K-means* algorithm,
196
-
\index{K-means} a widely-used and often very effective clustering method,
196
+
\index{K-means} a widelyused and often very effective clustering method,
197
197
combined with the *elbow method* \index{elbow method}
198
198
for selecting the number of clusters.
199
199
This procedure will separate the data into groups;
@@ -911,7 +911,7 @@ As you can see above, the clustering object returned by `kmeans` has a lot of in
911
911
that can be used to visualize the clusters, pick K, and evaluate the total WSSD.
912
912
To obtain this information in a tidy format, we will call in help
913
913
from the `broom` package. \index{broom} Let's start by visualizing the clustering
914
-
as a colored scatter plot. To do that
914
+
as a colored scatter plot. To do that,
915
915
we use the `augment` function, \index{K-means!augment} \index{augment} which takes in the model and the original data
916
916
frame, and returns a data frame with the data and the cluster assignments for
917
917
each point:
@@ -965,7 +965,7 @@ Then we use `rowwise` \index{rowwise} + `mutate` to apply the `kmeans` function
965
965
within each row to each K.
966
966
However, given that the `kmeans` function
967
967
returns a model object to us (not a vector),
968
-
we will need to store the results as a list columm.
968
+
we will need to store the results as a list column.
969
969
This works because both vectors and lists are legitimate
970
970
data structures for data frame columns.
971
971
To make this work,
@@ -1098,4 +1098,4 @@ please follow the instructions for computer setup needed to run the worksheets
1098
1098
found in Chapter \@ref(move-to-your-own-machine).
1099
1099
1100
1100
## Additional resources
1101
-
- Chapter 10 of [An Introduction to Statistical Learning](https://www.statlearning.com/)[-@james2013introduction] provides a great next stop in the process of learning about clustering and unsupervised learning in general. In the realm of clustering specifically, it provides a great companion introduction to K-means, but also covers *hierarchical* clustering for when you expect there to be subgroups, and then subgroups within subgroups, etc. in your data. In the realm of more general unsupervised learning, it covers *principal components analysis (PCA)*, which is a very popular technique in scientific applications for reducing the number of predictors in a dataset.
1101
+
- Chapter 10 of [*An Introduction to Statistical Learning*](https://www.statlearning.com/)[-@james2013introduction] provides a great next stop in the process of learning about clustering and unsupervised learning in general. In the realm of clustering specifically, it provides a great companion introduction to K-means, but also covers *hierarchical* clustering for when you expect there to be subgroups, and then subgroups within subgroups, etc., in your data. In the realm of more general unsupervised learning, it covers *principal components analysis (PCA)*, which is a very popular technique in scientific applications for reducing the number of predictors in a dataset.
0 commit comments