@@ -32,7 +32,7 @@ including techniques to choose the number of clusters.
32
32
## Chapter learning objectives
33
33
By the end of the chapter, readers will be able to do the following:
34
34
35
- * Describe a case where clustering is appropriate,
35
+ * Describe a situation in which clustering is an appropriate technique to use ,
36
36
and what insight it might extract from the data.
37
37
* Explain the K-means clustering algorithm.
38
38
* Interpret the output of a K-means analysis.
@@ -46,7 +46,7 @@ and do this using R.
46
46
limitations and assumptions of the K-means clustering algorithm.
47
47
48
48
## Clustering
49
- Clustering \index{clustering} is a data analysis task
49
+ Clustering \index{clustering} is a data analysis technique
50
50
involving separating a data set into subgroups of related data.
51
51
For example, we might use clustering to separate a
52
52
data set of documents into groups that correspond to topics, a data set of
@@ -70,12 +70,11 @@ and examine the structure of data without any response variable labels
70
70
or values to help us.
71
71
This approach has both advantages and disadvantages.
72
72
Clustering requires no additional annotation or input on the data.
73
- For example, it would be nearly impossible to annotate
74
- all the articles on Wikipedia with human-made topic labels.
75
- However, we can still cluster the articles without this information
73
+ For example, while it would be nearly impossible to annotate
74
+ all the articles on Wikipedia with human-made topic labels,
75
+ we can cluster the articles without this information
76
76
to find groupings corresponding to topics automatically.
77
-
78
- Given that there is no response variable, it is not as easy to evaluate
77
+ However, given that there is no response variable, it is not as easy to evaluate
79
78
the "quality" of a clustering. With classification, we can use a test data set
80
79
to assess prediction performance. In clustering, there is not a single good
81
80
choice for evaluation. In this book, we will use visualization to ascertain the
@@ -248,9 +247,9 @@ It starts with an initial clustering of the data, and then iteratively
248
247
improves it by making adjustments to the assignment of data
249
248
to clusters until it cannot improve any further. But how do we measure
250
249
the "quality" of a clustering, and what does it mean to improve it?
251
- In K-means clustering, we measure the quality of a cluster by its
252
- \index{within-cluster sum-of-squared-distances|see{WSSD}}\index{WSSD}
253
- * within-cluster sum-of-squared-distances * (WSSD). Computing this involves two steps.
250
+ In K-means clustering, we measure the quality of a cluster
251
+ by its \index{within-cluster sum-of-squared-distances|see{WSSD}}\index{WSSD} * within-cluster sum-of-squared-distances * (WSSD).
252
+ Computing this involves two steps.
254
253
First, we find the cluster centers by computing the mean of each variable
255
254
over data points in the cluster. For example, suppose we have a
256
255
cluster containing four observations, and we are using two variables, $x$ and $y$, to cluster the data.
839
838
```
840
839
841
840
If we set K less than 3, then the clustering merges separate groups of data; this causes a large
842
- total WSSD, since the cluster center (denoted by an "x") is not close to any of the data in the cluster. On
841
+ total WSSD, since the cluster center is not close to any of the data in the cluster. On
843
842
the other hand, if we set K greater than 3, the clustering subdivides subgroups of data; this does indeed still
844
843
decrease the total WSSD, but by only a * diminishing amount* . If we plot the total WSSD versus the number of
845
844
clusters, we see that the decrease in total WSSD levels off (or forms an "elbow shape") \index{elbow method} when we reach roughly
@@ -903,8 +902,8 @@ standardized_data
903
902
904
903
To perform K-means clustering in R, we use the ` kmeans ` function. \index{K-means!kmeans function} It takes at
905
904
least two arguments: the data frame containing the data you wish to cluster,
906
- and K, the number of clusters (here we choose K = 3). Note that since the K-means
907
- algorithm uses a random initialization of assignments, but since we set the random seed
905
+ and K, the number of clusters (here we choose K = 3). Note that the K-means
906
+ algorithm uses a random initialization of assignments; but since we set the random seed
908
907
earlier, the clustering will be reproducible.
909
908
910
909
``` {r 10-kmeans-seed, echo = FALSE, warning = FALSE, message = FALSE}
@@ -1000,8 +999,8 @@ penguin_clust_ks
1000
999
If we wanted to get one of the clusterings out
1001
1000
of the list column in the data frame,
1002
1001
we could use a familiar friend: ` pull ` .
1003
- ` pull ` will return to us a data frame column as a simpler data structure,
1004
- here that would be a list.
1002
+ ` pull ` will return to us a data frame column as a simpler data structure;
1003
+ here, that would be a list.
1005
1004
And then to extract the first item of the list,
1006
1005
we can use the ` pluck ` function. We pass
1007
1006
it the index for the element we would like to extract
@@ -1074,7 +1073,7 @@ The more times we perform K-means clustering,
1074
1073
the more likely we are to find a good clustering (if one exists).
1075
1074
What value should you choose for ` nstart ` ? The answer is that it depends
1076
1075
on many factors: the size and characteristics of your data set,
1077
- as well as the speed and size of your computer.
1076
+ as well as how powerful your computer is .
1078
1077
The larger the ` nstart ` value the better from an analysis perspective,
1079
1078
but there is a trade-off that doing many clusterings
1080
1079
could take a long time.
0 commit comments