Merge pull request #397 from UBC-DSCI/regression1-edits

leem44 · web-flow · commit b71e67ae3031 · 2021-12-06T15:38:35.000-08:00
Copyediting for regression 1
diff --git a/regression1.Rmd b/regression1.Rmd
@@ -49,12 +49,12 @@ can also be used to answer inferential and causal questions,
 however that is beyond the scope of this book.
 
 ## Chapter learning objectives 
-By the end of the chapter, readers will be able to:
+By the end of the chapter, readers will be able to do the following:
 
 * Recognize situations where a simple regression analysis would be appropriate for making predictions.
 * Explain the K-nearest neighbor (KNN) regression algorithm and describe how it differs from KNN classification.
 * Interpret the output of a KNN regression.
-* In a dataset with two or more variables, perform K-nearest neighbor regression in R using a `tidymodels` workflow
+* In a dataset with two or more variables, perform K-nearest neighbor regression in R using a `tidymodels` workflow.
 * Execute cross-validation in R to choose the number of neighbors.
 * Evaluate KNN regression prediction accuracy in R using a test data set and the root mean squared prediction error (RMSPE).
 * In the context of KNN regression, compare and contrast goodness of fit and prediction properties (namely RMSE vs RMSPE).
@@ -94,9 +94,9 @@ is that we are now predicting numerical variables instead of categorical variabl
 > variable is numerical or categorical---and therefore whether you
 > need to perform regression or classification---by taking two response variables X and Y from your
 > data, and asking the question, "is response variable X *more* than response variable Y?"
-> If the variable is categorical, the question will make no sense ("is blue more than red?",
-> or "is benign more than malignant?"). If the variable is numerical, it will make sense
-> ("is 1.5 hours more than 2.25 hours?", or "is \$500,000 more than \$400,000?").
+> If the variable is categorical, the question will make no sense (Is blue more than red?
+> Is benign more than malignant?). If the variable is numerical, it will make sense
+> (Is 1.5 hours more than 2.25 hours? Is \$500,000 more than \$400,000?).
 > Be careful when applying this heuristic, though: sometimes categorical variables will be encoded as
 > numbers in your data (e.g., "1" represents "benign", and "0" represents "malignant"). In these cases
 > you have to ask the question about the *meaning* of the labels ("benign" and "malignant"), not their values ("1" and "0"). 
@@ -105,10 +105,10 @@ is that we are now predicting numerical variables instead of categorical variabl
 
 In this chapter and the next, we will study the Sacramento \index{Sacramento real estate} real estate data
 set. This data set contains 932 real estate transactions in Sacramento,
-California [originally reported in the Sacramento Bee newspaper](https://support.spatialkey.com/spatialkey-sample-csv-data/).
+California, [originally reported in the *Sacramento Bee* newspaper](https://support.spatialkey.com/spatialkey-sample-csv-data/).
 We first need to formulate a precise question that
 we want to answer. In this example, our question is again predictive:
-\index{question!regression} can we use the size of a house in the Sacramento, CA area to predict
+\index{question!regression} Can we use the size of a house in the Sacramento, CA area to predict
 its sale price? A rigorous, quantitative answer to this question might help
 a realtor advise a client as to whether the price of a particular listing 
 is fair, or perhaps how to set the price of a new listing.
@@ -304,9 +304,9 @@ $$\text{RMSPE} = \sqrt{\frac{1}{n}\sum\limits_{i=1}^{n}(y_i - \hat{y}_i)^2}$$
 
 where:
 
-- $n$ is the number of observations
-- $y_i$ is the observed value for the $i^\text{th}$ observation
-- $\hat{y}_i$ is the forecasted/predicted value for the $i^\text{th}$ observation
+- $n$ is the number of observations,
+- $y_i$ is the observed value for the $i^\text{th}$ observation, and
+- $\hat{y}_i$ is the forecasted/predicted value for the $i^\text{th}$ observation.
 
 In other words, we compute the *squared* difference between the predicted and true response 
 value for each observation in our test (or validation) set, compute the average, and then finally
@@ -320,7 +320,7 @@ mistakes.
 If the predictions are very close to the true values, then
 RMSPE will be small. If, on the other-hand, the predictions are very
 different from the true values, then RMSPE will be quite large. When we
-use cross validation, we will choose the $K$ that gives
+use cross-validation, we will choose the $K$ that gives
 us the smallest RMSPE.
 
 ```{r 07-verticalerrors, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "Scatter plot of price (USD) versus house size (square feet) with example predictions (blue line) and the error in those predictions compared with true response values for three selected observations (vertical red lines).", fig.height = 3.5, fig.width = 4.5}
@@ -396,7 +396,7 @@ value, let's use R to perform cross-validation and to choose the optimal $K$.
 First, we will create a recipe for preprocessing our data.
 Note that we include standardization
 in our preprocessing to build good habits, but since we only have one 
-predictor it is technically not necessary; there is no risk of comparing two predictors
+predictor, it is technically not necessary; there is no risk of comparing two predictors
 of different scales.
 Next we create a model specification for K-nearest neighbors regression. Note 
 that we use `set_mode("regression")`
@@ -405,7 +405,7 @@ problems from the previous chapters.
 The use of `set_mode("regression")` essentially
  tells `tidymodels` that we need to use different metrics (RMSPE, not accuracy)
 for tuning and evaluation.
-Then we create a 5-fold cross validation object, and put the recipe and model specification together
+Then we create a 5-fold cross-validation object, and put the recipe and model specification together
 in a workflow.
 \index{tidymodels}\index{recipe}\index{workflow}
 
@@ -432,7 +432,7 @@ sacr_wkflw
 print_tidymodels(sacr_wkflw)
 ```
 
-Next we run cross validation for a grid of numbers of neighbors ranging from 1 to 200. 
+Next we run cross-validation for a grid of numbers of neighbors ranging from 1 to 200. 
 The following code tunes
 the model and returns the RMSPE for each number of neighbors. In the output of the `sacr_results`
 results data frame, we see that the `neighbors` variable contains the value of $K$,
@@ -562,7 +562,7 @@ What about the plots in Figure \@ref(fig:07-howK) where $K$ is quite large,
 say, $K$ = 250 or 932? 
 In this case the blue line becomes extremely smooth, and actually becomes flat
 once $K$ is equal to the number of datapoints in the entire data set. 
-This happens because our predicted values for a given x value (here home
+This happens because our predicted values for a given x value (here, home
 size), depend on many neighboring observations; in the case where $K$ is equal 
 to the size of the dataset, the prediction is just the mean of the house prices
 in the dataset (completely ignoring the house size). 
@@ -747,7 +747,7 @@ in the chapter on evaluating and tuning classification models),
 then we must compare the accuracy estimated using only the training data via cross-validation.
 Looking back, the estimated cross-validation accuracy for the single-predictor 
 model was `r format(round(sacr_min$mean), big.mark=",", nsmall=0, scientific = FALSE)`.
-The estimated cross validation accuracy for the multivariable model is
+The estimated cross-validation accuracy for the multivariable model is
 `r format(round(sacr_multi$mean), big.mark=",", nsmall=0, scientific = FALSE)`.
 Thus in this case, we did not improve the model 
 by a large amount by adding this additional predictor.
@@ -777,7 +777,7 @@ knn_mult_mets <- metrics(knn_mult_preds, truth = price, estimate = .pred) |>
 knn_mult_mets
 ```
 
-This time when we performed KNN regression on the same data set, but also
+This time, when we performed KNN regression on the same data set, but also
 included number of bedrooms as a predictor, we obtained a RMSPE test error 
 of `r format(round(knn_mult_mets |> pull(.estimate)), big.mark=",", nsmall=0, scientific=FALSE)`.
 Figure \@ref(fig:07-knn-mult-viz) visualizes the model's predictions overlaid on top of the data. This 
@@ -846,15 +846,15 @@ regression has both strengths and weaknesses. Some are listed here:
 
 **Strengths:** K-nearest neighbors regression
 
-1. is a simple, intuitive algorithm
-2. requires few assumptions about what the data must look like 
-3. works well with non-linear relationships (i.e., if the relationship is not a straight line)
+1. is a simple, intuitive algorithm,
+2. requires few assumptions about what the data must look like, and 
+3. works well with non-linear relationships (i.e., if the relationship is not a straight line).
 
 **Weaknesses:** K-nearest neighbors regression
 
-1. becomes very slow as the training data gets larger
-2. may not perform well with a large number of predictors
-3. may not predict well beyond the range of values input in your training data
+1. becomes very slow as the training data gets larger,
+2. may not perform well with a large number of predictors, and
+3. may not predict well beyond the range of values input in your training data.
 
 ## Exercises