Skip to content

Commit 883a7ef

Browse files
ch 9 edits
1 parent 65d70b7 commit 883a7ef

File tree

1 file changed

+15
-16
lines changed

1 file changed

+15
-16
lines changed

clustering.Rmd

Lines changed: 15 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ including techniques to choose the number of clusters.
3232
## Chapter learning objectives
3333
By the end of the chapter, readers will be able to do the following:
3434

35-
* Describe a case where clustering is appropriate,
35+
* Describe a situation in which clustering is an appropriate technique to use,
3636
and what insight it might extract from the data.
3737
* Explain the K-means clustering algorithm.
3838
* Interpret the output of a K-means analysis.
@@ -46,7 +46,7 @@ and do this using R.
4646
limitations and assumptions of the K-means clustering algorithm.
4747

4848
## Clustering
49-
Clustering \index{clustering} is a data analysis task
49+
Clustering \index{clustering} is a data analysis technique
5050
involving separating a data set into subgroups of related data.
5151
For example, we might use clustering to separate a
5252
data set of documents into groups that correspond to topics, a data set of
@@ -70,12 +70,11 @@ and examine the structure of data without any response variable labels
7070
or values to help us.
7171
This approach has both advantages and disadvantages.
7272
Clustering requires no additional annotation or input on the data.
73-
For example, it would be nearly impossible to annotate
74-
all the articles on Wikipedia with human-made topic labels.
75-
However, we can still cluster the articles without this information
73+
For example, while it would be nearly impossible to annotate
74+
all the articles on Wikipedia with human-made topic labels,
75+
we can cluster the articles without this information
7676
to find groupings corresponding to topics automatically.
77-
78-
Given that there is no response variable, it is not as easy to evaluate
77+
However, given that there is no response variable, it is not as easy to evaluate
7978
the "quality" of a clustering. With classification, we can use a test data set
8079
to assess prediction performance. In clustering, there is not a single good
8180
choice for evaluation. In this book, we will use visualization to ascertain the
@@ -248,9 +247,9 @@ It starts with an initial clustering of the data, and then iteratively
248247
improves it by making adjustments to the assignment of data
249248
to clusters until it cannot improve any further. But how do we measure
250249
the "quality" of a clustering, and what does it mean to improve it?
251-
In K-means clustering, we measure the quality of a cluster by its
252-
\index{within-cluster sum-of-squared-distances|see{WSSD}}\index{WSSD}
253-
*within-cluster sum-of-squared-distances* (WSSD). Computing this involves two steps.
250+
In K-means clustering, we measure the quality of a cluster
251+
by its\index{within-cluster sum-of-squared-distances|see{WSSD}}\index{WSSD} *within-cluster sum-of-squared-distances* (WSSD).
252+
Computing this involves two steps.
254253
First, we find the cluster centers by computing the mean of each variable
255254
over data points in the cluster. For example, suppose we have a
256255
cluster containing four observations, and we are using two variables, $x$ and $y$, to cluster the data.
@@ -839,7 +838,7 @@ p1
839838
```
840839

841840
If we set K less than 3, then the clustering merges separate groups of data; this causes a large
842-
total WSSD, since the cluster center (denoted by an "x") is not close to any of the data in the cluster. On
841+
total WSSD, since the cluster center is not close to any of the data in the cluster. On
843842
the other hand, if we set K greater than 3, the clustering subdivides subgroups of data; this does indeed still
844843
decrease the total WSSD, but by only a *diminishing amount*. If we plot the total WSSD versus the number of
845844
clusters, we see that the decrease in total WSSD levels off (or forms an "elbow shape") \index{elbow method} when we reach roughly
@@ -903,8 +902,8 @@ standardized_data
903902

904903
To perform K-means clustering in R, we use the `kmeans` function. \index{K-means!kmeans function} It takes at
905904
least two arguments: the data frame containing the data you wish to cluster,
906-
and K, the number of clusters (here we choose K = 3). Note that since the K-means
907-
algorithm uses a random initialization of assignments, but since we set the random seed
905+
and K, the number of clusters (here we choose K = 3). Note that the K-means
906+
algorithm uses a random initialization of assignments; but since we set the random seed
908907
earlier, the clustering will be reproducible.
909908

910909
```{r 10-kmeans-seed, echo = FALSE, warning = FALSE, message = FALSE}
@@ -1000,8 +999,8 @@ penguin_clust_ks
1000999
If we wanted to get one of the clusterings out
10011000
of the list column in the data frame,
10021001
we could use a familiar friend: `pull`.
1003-
`pull` will return to us a data frame column as a simpler data structure,
1004-
here that would be a list.
1002+
`pull` will return to us a data frame column as a simpler data structure;
1003+
here, that would be a list.
10051004
And then to extract the first item of the list,
10061005
we can use the `pluck` function. We pass
10071006
it the index for the element we would like to extract
@@ -1074,7 +1073,7 @@ The more times we perform K-means clustering,
10741073
the more likely we are to find a good clustering (if one exists).
10751074
What value should you choose for `nstart`? The answer is that it depends
10761075
on many factors: the size and characteristics of your data set,
1077-
as well as the speed and size of your computer.
1076+
as well as how powerful your computer is.
10781077
The larger the `nstart` value the better from an analysis perspective,
10791078
but there is a trade-off that doing many clusterings
10801079
could take a long time.

0 commit comments

Comments
 (0)