ch 9 edits

trevorcampbell · trevorcampbell · commit 883a7ef47c49 · 2022-04-19T17:07:38.000-07:00
diff --git a/clustering.Rmd b/clustering.Rmd
@@ -32,7 +32,7 @@ including techniques to choose the number of clusters.
 ## Chapter learning objectives 
 By the end of the chapter, readers will be able to do the following:
 
-* Describe a case where clustering is appropriate, 
+* Describe a situation in which clustering is an appropriate technique to use, 
 and what insight it might extract from the data.
 * Explain the K-means clustering algorithm.
 * Interpret the output of a K-means analysis.
@@ -46,7 +46,7 @@ and do this using R.
 limitations and assumptions of the K-means clustering algorithm.
 
 ## Clustering
-Clustering \index{clustering} is a data analysis task 
+Clustering \index{clustering} is a data analysis technique 
 involving separating a data set into subgroups of related data. 
 For example, we might use clustering to separate a
 data set of documents into groups that correspond to topics, a data set of
@@ -70,12 +70,11 @@ and examine the structure of data without any response variable labels
 or values to help us. 
 This approach has both advantages and disadvantages. 
 Clustering requires no additional annotation or input on the data. 
-For example, it would be nearly impossible to annotate 
-all the articles on Wikipedia with human-made topic labels. 
-However, we can still cluster the articles without this information 
+For example, while it would be nearly impossible to annotate 
+all the articles on Wikipedia with human-made topic labels, 
+we can cluster the articles without this information 
 to find groupings corresponding to topics automatically. 
-
-Given that there is no response variable, it is not as easy to evaluate
+However, given that there is no response variable, it is not as easy to evaluate
 the "quality" of a clustering.  With classification, we can use a test data set
 to assess prediction performance. In clustering, there is not a single good
 choice for evaluation. In this book, we will use visualization to ascertain the
@@ -248,9 +247,9 @@ It starts with an initial clustering of the data, and then iteratively
 improves it by making adjustments to the assignment of data
 to clusters until it cannot improve any further. But how do we measure
 the "quality" of a clustering, and what does it mean to improve it? 
-In K-means clustering, we measure the quality of a cluster by its
-\index{within-cluster sum-of-squared-distances|see{WSSD}}\index{WSSD}
-*within-cluster sum-of-squared-distances* (WSSD). Computing this involves two steps.
+In K-means clustering, we measure the quality of a cluster 
+by its\index{within-cluster sum-of-squared-distances|see{WSSD}}\index{WSSD} *within-cluster sum-of-squared-distances* (WSSD). 
+Computing this involves two steps.
 First, we find the cluster centers by computing the mean of each variable 
 over data points in the cluster. For example, suppose we have a 
 cluster containing four observations, and we are using two variables, $x$ and $y$, to cluster the data.
@@ -839,7 +838,7 @@ p1
 ```
 
 If we set K less than 3, then the clustering merges separate groups of data; this causes a large 
-total WSSD, since the cluster center (denoted by an "x") is not close to any of the data in the cluster. On 
+total WSSD, since the cluster center is not close to any of the data in the cluster. On 
 the other hand, if we set K greater than 3, the clustering subdivides subgroups of data; this does indeed still 
 decrease the total WSSD, but by only a *diminishing amount*. If we plot the total WSSD versus the number of 
 clusters, we see that the decrease in total WSSD levels off (or forms an "elbow shape") \index{elbow method} when we reach roughly 
@@ -903,8 +902,8 @@ standardized_data
 
 To perform K-means clustering in R, we use the `kmeans` function. \index{K-means!kmeans function} It takes at
 least two arguments: the data frame containing the data you wish to cluster,
-and K, the number of clusters (here we choose K = 3). Note that since the K-means
-algorithm uses a random initialization of assignments, but since we set the random seed
+and K, the number of clusters (here we choose K = 3). Note that the K-means
+algorithm uses a random initialization of assignments; but since we set the random seed
 earlier, the clustering will be reproducible.
 
 ```{r 10-kmeans-seed, echo = FALSE, warning = FALSE, message = FALSE}
@@ -1000,8 +999,8 @@ penguin_clust_ks
 If we wanted to get one of the clusterings out 
 of the list column in the data frame,
 we could use a familiar friend: `pull`.
-`pull` will return to us a data frame column as a simpler data structure,
-here that would be a list.
+`pull` will return to us a data frame column as a simpler data structure;
+here, that would be a list.
 And then to extract the first item of the list, 
 we can use the `pluck` function. We pass  
 it the index for the element we would like to extract 
@@ -1074,7 +1073,7 @@ The more times we perform K-means clustering,
 the more likely we are to find a good clustering (if one exists).
 What value should you choose for `nstart`? The answer is that it depends
 on many factors: the size and characteristics of your data set,
-as well as the speed and size of your computer.
+as well as how powerful your computer is.
 The larger the `nstart` value the better from an analysis perspective, 
 but there is a trade-off that doing many clusterings 
 could take a long time.