Merge pull request #396 from UBC-DSCI/regression2-edits

leem44 · web-flow · commit f8fa611a5d80 · 2021-12-06T15:46:58.000-08:00
Copyediting for regression 2
diff --git a/regression2.Rmd b/regression2.Rmd
@@ -38,7 +38,7 @@ on the case where there is a single predictor and single response variable of in
  predictor.
 
 ## Chapter learning objectives 
-By the end of the chapter, readers will be able to:
+By the end of the chapter, readers will be able to do the following:
 
 * Use R and `tidymodels` to fit a linear regression model on training data.
 * Evaluate the linear regression model on test data.
@@ -200,7 +200,7 @@ very similar manner to how we performed KNN regression.
 To do this, instead of creating a `nearest_neighbor` model specification with
 the `kknn` engine, we use a `linear_reg` model specification
 with the `lm` engine. Another difference is that we do not need to choose $K$ in the
-context of linear regression, and so we do not need to perform cross validation.
+context of linear regression, and so we do not need to perform cross-validation.
 Below we illustrate how we can use the usual `tidymodels` workflow to predict house sale
 price given house size using a simple linear regression approach using the full
 Sacramento real estate data set.
@@ -395,22 +395,22 @@ variable we predict given a unit increase in the predictor
 variable. KNN regression, as simple as it is to implement and understand, has no such
 interpretability from its wiggly line. 
 
-There can however also be a disadvantage to using a simple linear regression
+There can, however, also be a disadvantage to using a simple linear regression
 model in some cases, particularly when the relationship between the target and
 the predictor is not linear, but instead some other shape (e.g., curved or oscillating). In 
 these cases the prediction model from a simple linear regression
 will underfit \index{underfitting!regression} (have high bias), meaning that model/predicted values do not
 match the actual observed values very well. Such a model would probably have a
 quite high RMSE when assessing model goodness of fit on the training data and
-a quite high RMPSE when assessing model prediction quality on a test data
+a quite high RMSPE when assessing model prediction quality on a test data
 set. On such a data set, KNN regression may fare better. Additionally, there
 are other types of regression you can learn about in future books that may do
 even better at predicting with such data.
 
 How do these two models compare on the Sacramento house prices data set? In
-Figure \@ref(fig:08-compareRegression), we also printed the RMPSE as calculated from 
-predicting on the test data set that was not used to train/fit the models. The RMPSE for the simple linear
-regression model is slightly lower than the RMPSE for the KNN regression model.
+Figure \@ref(fig:08-compareRegression), we also printed the RMSPE as calculated from 
+predicting on the test data set that was not used to train/fit the models. The RMSPE for the simple linear
+regression model is slightly lower than the RMSPE for the KNN regression model.
 Considering that the simple linear regression model is also more interpretable,
 if we were comparing these in practice we would likely choose to use the simple
 linear regression model.
@@ -531,7 +531,7 @@ if(!is_latex_output()){
 We see that the predictions from linear regression with two predictors form a
 flat plane. This is the hallmark of linear regression, and differs from the 
 wiggly, flexible surface we get from other methods such as KNN regression. 
- As discussed this can be advantageous in one aspect, which is that for each
+ As discussed, this can be advantageous in one aspect, which is that for each
 predictor, we can get slopes/intercept from linear regression, and thus describe the
 plane mathematically. We can extract those slope values from our model object
 as shown below:
@@ -620,7 +620,7 @@ the scope of this book.
 But to illustrate what can happen when you have outliers, Figure \@ref(fig:08-lm-outlier)
 shows a small subset of the Sacramento housing data again, except we have added a *single* data point (highlighted
 in red). This house is 5,000 square feet in size, and sold for only \$50,000. Unbeknownst to the
-data analyst, this house was sold by a parent to their child for an absurdly low price. Of course
+data analyst, this house was sold by a parent to their child for an absurdly low price. Of course,
 this is not representative of the real housing market values that the other data points follow; 
 the data point is an *outlier*. In blue we plot the original line of best fit, and in red
 we plot the new line of best fit including the outlier. You can see how different the red line
@@ -777,7 +777,7 @@ sqft33 <- format(round(coeffs |>
 
  If we again fit the multivariable linear regression model on this data, then the plane of best fit
 has regression coefficients that are very sensitive to the exact values in the data. For example,
-if we change the data ever so slightly&mdash;e.g., by running cross validation, which splits
+if we change the data ever so slightly&mdash;e.g., by running cross-validation, which splits
 up the data randomly into different chunks&mdash;the coefficients vary by large amounts:
 
 Best Fit 1: $\text{house sale price} = `r icept1` + `r sqft1`\cdot (\text{house size 1 (ft$^2$)}) + `r sqft11` \cdot (\text{house size 2 (ft$^2$)}).$
@@ -900,5 +900,5 @@ found in Chapter \@ref(move-to-your-own-machine).
 
 ## Additional resources
 - The [`tidymodels` website](https://tidymodels.org/packages) is an excellent reference for more details on, and advanced usage of, the functions and packages in the past two chapters. Aside from that, it also has a [nice beginner's tutorial](https://www.tidymodels.org/start/) and [an extensive list of more advanced examples](https://www.tidymodels.org/learn/) that you can use to continue learning beyond the scope of this book. 
-- [Modern Dive](https://moderndive.com/) is another textbook that uses the `tidyverse` / `tidymodels` framework. Chapter 6 complements the material in the current chapter well; it covers some slightly more advanced concepts than we do without getting mathematical. Give this chapter a read before moving on to the next reference. It is also worth noting that this book takes a more "explanatory" / "inferential" approach to regression in general (in Chapters 5, 6, and 10), which provides a nice complement to the predictive tack we take in the present book.
-- [An Introduction to Statistical Learning](https://www.statlearning.com/) [-@james2013introduction] provides a great next stop in the process of learning about regression. Chapter 3 covers linear regression at a slightly more mathematical level than we do here, but it is not too large a leap and so should provide a good stepping stone. Chapter 6 discusses how to pick a subset of "informative" predictors when you have a data set with many predictors, and you expect only a few of them to be relevant. Chapter 7 covers regression models that are more flexible than linear regression models but still enjoy the computational efficiency of linear regression. In contrast, the KNN methods we covered earlier are indeed more flexible but become very slow when given lots of data.
+- [*Modern Dive*](https://moderndive.com/) is another textbook that uses the `tidyverse` / `tidymodels` framework. Chapter 6 complements the material in the current chapter well; it covers some slightly more advanced concepts than we do without getting mathematical. Give this chapter a read before moving on to the next reference. It is also worth noting that this book takes a more "explanatory" / "inferential" approach to regression in general (in Chapters 5, 6, and 10), which provides a nice complement to the predictive tack we take in the present book.
+- [*An Introduction to Statistical Learning*](https://www.statlearning.com/) [-@james2013introduction] provides a great next stop in the process of learning about regression. Chapter 3 covers linear regression at a slightly more mathematical level than we do here, but it is not too large a leap and so should provide a good stepping stone. Chapter 6 discusses how to pick a subset of "informative" predictors when you have a data set with many predictors, and you expect only a few of them to be relevant. Chapter 7 covers regression models that are more flexible than linear regression models but still enjoy the computational efficiency of linear regression. In contrast, the KNN methods we covered earlier are indeed more flexible but become very slow when given lots of data.