Skip to content

Commit f8fa611

Browse files
authored
Merge pull request #396 from UBC-DSCI/regression2-edits
Copyediting for regression 2
2 parents b71e67a + 1b29ede commit f8fa611

File tree

1 file changed

+12
-12
lines changed

1 file changed

+12
-12
lines changed

regression2.Rmd

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ on the case where there is a single predictor and single response variable of in
3838
predictor.
3939

4040
## Chapter learning objectives
41-
By the end of the chapter, readers will be able to:
41+
By the end of the chapter, readers will be able to do the following:
4242

4343
* Use R and `tidymodels` to fit a linear regression model on training data.
4444
* Evaluate the linear regression model on test data.
@@ -200,7 +200,7 @@ very similar manner to how we performed KNN regression.
200200
To do this, instead of creating a `nearest_neighbor` model specification with
201201
the `kknn` engine, we use a `linear_reg` model specification
202202
with the `lm` engine. Another difference is that we do not need to choose $K$ in the
203-
context of linear regression, and so we do not need to perform cross validation.
203+
context of linear regression, and so we do not need to perform cross-validation.
204204
Below we illustrate how we can use the usual `tidymodels` workflow to predict house sale
205205
price given house size using a simple linear regression approach using the full
206206
Sacramento real estate data set.
@@ -395,22 +395,22 @@ variable we predict given a unit increase in the predictor
395395
variable. KNN regression, as simple as it is to implement and understand, has no such
396396
interpretability from its wiggly line.
397397

398-
There can however also be a disadvantage to using a simple linear regression
398+
There can, however, also be a disadvantage to using a simple linear regression
399399
model in some cases, particularly when the relationship between the target and
400400
the predictor is not linear, but instead some other shape (e.g., curved or oscillating). In
401401
these cases the prediction model from a simple linear regression
402402
will underfit \index{underfitting!regression} (have high bias), meaning that model/predicted values do not
403403
match the actual observed values very well. Such a model would probably have a
404404
quite high RMSE when assessing model goodness of fit on the training data and
405-
a quite high RMPSE when assessing model prediction quality on a test data
405+
a quite high RMSPE when assessing model prediction quality on a test data
406406
set. On such a data set, KNN regression may fare better. Additionally, there
407407
are other types of regression you can learn about in future books that may do
408408
even better at predicting with such data.
409409

410410
How do these two models compare on the Sacramento house prices data set? In
411-
Figure \@ref(fig:08-compareRegression), we also printed the RMPSE as calculated from
412-
predicting on the test data set that was not used to train/fit the models. The RMPSE for the simple linear
413-
regression model is slightly lower than the RMPSE for the KNN regression model.
411+
Figure \@ref(fig:08-compareRegression), we also printed the RMSPE as calculated from
412+
predicting on the test data set that was not used to train/fit the models. The RMSPE for the simple linear
413+
regression model is slightly lower than the RMSPE for the KNN regression model.
414414
Considering that the simple linear regression model is also more interpretable,
415415
if we were comparing these in practice we would likely choose to use the simple
416416
linear regression model.
@@ -531,7 +531,7 @@ if(!is_latex_output()){
531531
We see that the predictions from linear regression with two predictors form a
532532
flat plane. This is the hallmark of linear regression, and differs from the
533533
wiggly, flexible surface we get from other methods such as KNN regression.
534-
As discussed this can be advantageous in one aspect, which is that for each
534+
As discussed, this can be advantageous in one aspect, which is that for each
535535
predictor, we can get slopes/intercept from linear regression, and thus describe the
536536
plane mathematically. We can extract those slope values from our model object
537537
as shown below:
@@ -620,7 +620,7 @@ the scope of this book.
620620
But to illustrate what can happen when you have outliers, Figure \@ref(fig:08-lm-outlier)
621621
shows a small subset of the Sacramento housing data again, except we have added a *single* data point (highlighted
622622
in red). This house is 5,000 square feet in size, and sold for only \$50,000. Unbeknownst to the
623-
data analyst, this house was sold by a parent to their child for an absurdly low price. Of course
623+
data analyst, this house was sold by a parent to their child for an absurdly low price. Of course,
624624
this is not representative of the real housing market values that the other data points follow;
625625
the data point is an *outlier*. In blue we plot the original line of best fit, and in red
626626
we plot the new line of best fit including the outlier. You can see how different the red line
@@ -777,7 +777,7 @@ sqft33 <- format(round(coeffs |>
777777

778778
If we again fit the multivariable linear regression model on this data, then the plane of best fit
779779
has regression coefficients that are very sensitive to the exact values in the data. For example,
780-
if we change the data ever so slightly&mdash;e.g., by running cross validation, which splits
780+
if we change the data ever so slightly&mdash;e.g., by running cross-validation, which splits
781781
up the data randomly into different chunks&mdash;the coefficients vary by large amounts:
782782

783783
Best Fit 1: $\text{house sale price} = `r icept1` + `r sqft1`\cdot (\text{house size 1 (ft$^2$)}) + `r sqft11` \cdot (\text{house size 2 (ft$^2$)}).$
@@ -900,5 +900,5 @@ found in Chapter \@ref(move-to-your-own-machine).
900900

901901
## Additional resources
902902
- The [`tidymodels` website](https://tidymodels.org/packages) is an excellent reference for more details on, and advanced usage of, the functions and packages in the past two chapters. Aside from that, it also has a [nice beginner's tutorial](https://www.tidymodels.org/start/) and [an extensive list of more advanced examples](https://www.tidymodels.org/learn/) that you can use to continue learning beyond the scope of this book.
903-
- [Modern Dive](https://moderndive.com/) is another textbook that uses the `tidyverse` / `tidymodels` framework. Chapter 6 complements the material in the current chapter well; it covers some slightly more advanced concepts than we do without getting mathematical. Give this chapter a read before moving on to the next reference. It is also worth noting that this book takes a more "explanatory" / "inferential" approach to regression in general (in Chapters 5, 6, and 10), which provides a nice complement to the predictive tack we take in the present book.
904-
- [An Introduction to Statistical Learning](https://www.statlearning.com/) [-@james2013introduction] provides a great next stop in the process of learning about regression. Chapter 3 covers linear regression at a slightly more mathematical level than we do here, but it is not too large a leap and so should provide a good stepping stone. Chapter 6 discusses how to pick a subset of "informative" predictors when you have a data set with many predictors, and you expect only a few of them to be relevant. Chapter 7 covers regression models that are more flexible than linear regression models but still enjoy the computational efficiency of linear regression. In contrast, the KNN methods we covered earlier are indeed more flexible but become very slow when given lots of data.
903+
- [*Modern Dive*](https://moderndive.com/) is another textbook that uses the `tidyverse` / `tidymodels` framework. Chapter 6 complements the material in the current chapter well; it covers some slightly more advanced concepts than we do without getting mathematical. Give this chapter a read before moving on to the next reference. It is also worth noting that this book takes a more "explanatory" / "inferential" approach to regression in general (in Chapters 5, 6, and 10), which provides a nice complement to the predictive tack we take in the present book.
904+
- [*An Introduction to Statistical Learning*](https://www.statlearning.com/) [-@james2013introduction] provides a great next stop in the process of learning about regression. Chapter 3 covers linear regression at a slightly more mathematical level than we do here, but it is not too large a leap and so should provide a good stepping stone. Chapter 6 discusses how to pick a subset of "informative" predictors when you have a data set with many predictors, and you expect only a few of them to be relevant. Chapter 7 covers regression models that are more flexible than linear regression models but still enjoy the computational efficiency of linear regression. In contrast, the KNN methods we covered earlier are indeed more flexible but become very slow when given lots of data.

0 commit comments

Comments
 (0)