Merge pull request #434 from UBC-DSCI/ch9-13

trevorcampbell · web-flow · commit b12e5de4da86 · 2022-04-26T11:51:38.000-07:00
Ch9-13 Final check
diff --git a/clustering.Rmd b/clustering.Rmd
@@ -32,7 +32,7 @@ including techniques to choose the number of clusters.
 ## Chapter learning objectives 
 By the end of the chapter, readers will be able to do the following:
 
-* Describe a case where clustering is appropriate, 
+* Describe a situation in which clustering is an appropriate technique to use, 
 and what insight it might extract from the data.
 * Explain the K-means clustering algorithm.
 * Interpret the output of a K-means analysis.
@@ -46,7 +46,7 @@ and do this using R.
 limitations and assumptions of the K-means clustering algorithm.
 
 ## Clustering
-Clustering \index{clustering} is a data analysis task 
+Clustering \index{clustering} is a data analysis technique 
 involving separating a data set into subgroups of related data. 
 For example, we might use clustering to separate a
 data set of documents into groups that correspond to topics, a data set of
@@ -70,12 +70,11 @@ and examine the structure of data without any response variable labels
 or values to help us. 
 This approach has both advantages and disadvantages. 
 Clustering requires no additional annotation or input on the data. 
-For example, it would be nearly impossible to annotate 
-all the articles on Wikipedia with human-made topic labels. 
-However, we can still cluster the articles without this information 
+For example, while it would be nearly impossible to annotate 
+all the articles on Wikipedia with human-made topic labels, 
+we can cluster the articles without this information 
 to find groupings corresponding to topics automatically. 
-
-Given that there is no response variable, it is not as easy to evaluate
+However, given that there is no response variable, it is not as easy to evaluate
 the "quality" of a clustering.  With classification, we can use a test data set
 to assess prediction performance. In clustering, there is not a single good
 choice for evaluation. In this book, we will use visualization to ascertain the
@@ -248,9 +247,9 @@ It starts with an initial clustering of the data, and then iteratively
 improves it by making adjustments to the assignment of data
 to clusters until it cannot improve any further. But how do we measure
 the "quality" of a clustering, and what does it mean to improve it? 
-In K-means clustering, we measure the quality of a cluster by its
-\index{within-cluster sum-of-squared-distances|see{WSSD}}\index{WSSD}
-*within-cluster sum-of-squared-distances* (WSSD). Computing this involves two steps.
+In K-means clustering, we measure the quality of a cluster 
+by its\index{within-cluster sum-of-squared-distances|see{WSSD}}\index{WSSD} *within-cluster sum-of-squared-distances* (WSSD). 
+Computing this involves two steps.
 First, we find the cluster centers by computing the mean of each variable 
 over data points in the cluster. For example, suppose we have a 
 cluster containing four observations, and we are using two variables, $x$ and $y$, to cluster the data.
@@ -839,7 +838,7 @@ p1
 ```
 
 If we set K less than 3, then the clustering merges separate groups of data; this causes a large 
-total WSSD, since the cluster center (denoted by an "x") is not close to any of the data in the cluster. On 
+total WSSD, since the cluster center is not close to any of the data in the cluster. On 
 the other hand, if we set K greater than 3, the clustering subdivides subgroups of data; this does indeed still 
 decrease the total WSSD, but by only a *diminishing amount*. If we plot the total WSSD versus the number of 
 clusters, we see that the decrease in total WSSD levels off (or forms an "elbow shape") \index{elbow method} when we reach roughly 
@@ -890,7 +889,7 @@ not_standardized_data
 ```
 
 And then we apply the `scale` function to every column in the data frame 
-using `mutate` + `across`.
+using `mutate` and `across`.
 
 ```{r 10-mapdf-scale-data}
 standardized_data <- not_standardized_data |>
@@ -903,8 +902,8 @@ standardized_data
 
 To perform K-means clustering in R, we use the `kmeans` function. \index{K-means!kmeans function} It takes at
 least two arguments: the data frame containing the data you wish to cluster,
-and K, the number of clusters (here we choose K = 3). Note that since the K-means
-algorithm uses a random initialization of assignments, but since we set the random seed
+and K, the number of clusters (here we choose K = 3). Note that the K-means
+algorithm uses a random initialization of assignments; but since we set the random seed
 earlier, the clustering will be reproducible.
 
 ```{r 10-kmeans-seed, echo = FALSE, warning = FALSE, message = FALSE}
@@ -1000,8 +999,8 @@ penguin_clust_ks
 If we wanted to get one of the clusterings out 
 of the list column in the data frame,
 we could use a familiar friend: `pull`.
-`pull` will return to us a data frame column as a simpler data structure,
-here that would be a list.
+`pull` will return to us a data frame column as a simpler data structure;
+here, that would be a list.
 And then to extract the first item of the list, 
 we can use the `pluck` function. We pass  
 it the index for the element we would like to extract 
@@ -1074,7 +1073,7 @@ The more times we perform K-means clustering,
 the more likely we are to find a good clustering (if one exists).
 What value should you choose for `nstart`? The answer is that it depends
 on many factors: the size and characteristics of your data set,
-as well as the speed and size of your computer.
+as well as how powerful your computer is.
 The larger the `nstart` value the better from an analysis perspective, 
 but there is a trade-off that doing many clusterings 
 could take a long time.
diff --git a/inference.Rmd b/inference.Rmd
@@ -39,7 +39,7 @@ By the end of the chapter, readers will be able to do the following:
 
 * Describe real-world examples of questions that can be answered with statistical inference.
 * Define common population parameters (e.g., mean, proportion, standard deviation) that are often estimated using sampled data, and estimate these from a sample.
-* Define the following statistical sampling terms (population, sample, population parameter, point estimate, sampling distribution).
+* Define the following statistical sampling terms: population, sample, population parameter, point estimate, and sampling distribution.
 * Explain the difference between a population parameter and a sample point estimate.
 * Use R to draw random samples from a finite population.
 * Use R to create a sampling distribution from a finite population.
@@ -90,14 +90,14 @@ knitr::include_graphics("img/population_vs_sample.png")
 Note that proportions are not the *only* kind of population parameter we might
 be interested in. For example, suppose an undergraduate student studying at the University
 of British Columbia in Canada is looking for an apartment
-to rent. They need to create a budget, so they want to know something about
-studio apartment rental prices in Vancouver, BC. This student might 
-formulate the following question:
+to rent. They need to create a budget, so they want to know about
+studio apartment rental prices in Vancouver. This student might 
+formulate the question:
 
-*What is the average price-per-month of studio apartment rentals in Vancouver, Canada?*
+*What is the average price per month of studio apartment rentals in Vancouver?*
 
 In this case, the population consists of all studio apartment rentals in Vancouver, and the
-population parameter is the *average price-per-month*. Here we used the average
+population parameter is the *average price per month*. Here we used the average
 as a measure of the center to describe the "typical value" of studio apartment
 rental prices. But even within this one example, we could also be interested in
 many other population parameters. For instance, we know that not every studio
@@ -1148,9 +1148,9 @@ boot_est_dist +
 
 To finish our estimation of the population parameter, we would report the point
 estimate and our confidence interval's lower and upper bounds. Here the sample
-mean price-per-night of 40 Airbnb listings was 
+mean price per night of 40 Airbnb listings was 
 \$`r round(mean(one_sample$price),2)`, and we are 95\% "confident" that the true
-population mean price-per-night for all Airbnb listings in Vancouver is between
+population mean price per night for all Airbnb listings in Vancouver is between
 \$(`r round(bounds[1],2)`, `r round(bounds[2],2)`).
 Notice that our interval does indeed contain the true
 population mean value, \$`r round(mean(airbnb$price),2)`\! However, in
diff --git a/jupyter.Rmd b/jupyter.Rmd
@@ -104,10 +104,9 @@ To run a code cell independently, the cell needs to first be activated. This
 is done by clicking on it with the cursor. Jupyter will indicate a cell has been
 activated by highlighting it with a blue rectangle to its left. After the cell
 has been activated (Figure \@ref(fig:activate-and-run-button)), the cell can be run by either pressing the **Run** (`r fa("play", height = "11px")`) 
-button in the toolbar, or by using a keyboard shortcut of 
-`Shift + Enter`.
+button in the toolbar, or by using the keyboard shortcut `Shift + Enter`.
 
-```{r activate-and-run-button, echo = FALSE, fig.cap = "An activated cell that is ready to be run. The red arrow points to the blue rectangle to the cell's left. The blue rectangle indicates that it is ready to be run. This can be done by clicking the run button (circled in red).", fig.retina = 2, out.width="100%"}
+```{r activate-and-run-button, echo = FALSE, fig.cap = "An activated cell that is ready to be run. The blue rectangle to the cell's left (annotated by a red arrow) indicates that it is ready to be run. The cell can be run by clicking the run button (circled in red).", fig.retina = 2, out.width="100%"}
 image_read("img/activate-and-run-button-annotated.png") |> 
   image_crop("3632x900")
 ```
@@ -135,7 +134,7 @@ image_read("img/restart-kernel-run-all.png") |>
 ```
 
 ### The Kernel
-The kernel \index{kernel}\index{Jupyter notebook!kernel|see{kernel}} is a program that executes the code inside your notebook and 
+The kernel\index{kernel}\index{Jupyter notebook!kernel|see{kernel}} is a program that executes the code inside your notebook and 
 outputs the results. Kernels for many different programming languages have 
 been created for Jupyter, which means that Jupyter can interpret and execute 
 the code of many different programming languages. To run R code, your notebook 
@@ -166,7 +165,7 @@ image_read("img/create-new-code-cell.png") |>
 
 ## Markdown cells
 
-Text cells inside a Jupyter notebook are \index{markdown} \index{Jupyter notebook!markdown cell} called Markdown cells. Markdown cells 
+Text cells inside a Jupyter notebook are\index{markdown}\index{Jupyter notebook!markdown cell} called Markdown cells. Markdown cells 
 are rich formatted text cells, which means you can **bold** and *italicize* 
 text, create subject headers, create bullet and numbered lists, and more. These cells are 
 given the name "Markdown" because they use *Markdown language* to specify the rich text formatting.
diff --git a/version-control.Rmd b/version-control.Rmd
@@ -225,7 +225,7 @@ made. In Figure \@ref(fig:vc-ba3-commit), the message is `Message about changes.
 your work you should make sure to replace this with an
 informative message about what changed. It is also important to note here that
 these changes are only being committed to the local repository's history.  The
-remote repository on GitHub has not changed, and collaborators would not yet be
+remote repository on GitHub has not changed, and collaborators are not yet 
 able to see your new changes.
 
 ```{r vc-ba3-commit, fig.cap = "Committing the modified files in the staging area to the local repository history, with an informative message about what changed.", fig.retina = 2, out.width="100%"}
@@ -903,7 +903,7 @@ image_read("img/version_control/issue_01.png") |>
 
 Next click the "New issue" button (Figure \@ref(fig:issue-02)).
 
-(ref:issue-02) The "New issues" button on the GitHub web interface.
+(ref:issue-02) The "New issue" button on the GitHub web interface.
 
 ```{r issue-02, fig.pos = "H", out.extra="", fig.cap = '(ref:issue-02)', fig.retina = 2, out.width="100%"}
 image_read("img/version_control/issue_02.png") |>