Skip to content

Commit 5f284de

Browse files
copy edits reg1
1 parent c22ffd2 commit 5f284de

File tree

1 file changed

+23
-23
lines changed

1 file changed

+23
-23
lines changed

regression1.Rmd

Lines changed: 23 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -49,12 +49,12 @@ can also be used to answer inferential and causal questions,
4949
however that is beyond the scope of this book.
5050

5151
## Chapter learning objectives
52-
By the end of the chapter, readers will be able to:
52+
By the end of the chapter, readers will be able to do the following:
5353

5454
* Recognize situations where a simple regression analysis would be appropriate for making predictions.
5555
* Explain the K-nearest neighbor (KNN) regression algorithm and describe how it differs from KNN classification.
5656
* Interpret the output of a KNN regression.
57-
* In a dataset with two or more variables, perform K-nearest neighbor regression in R using a `tidymodels` workflow
57+
* In a dataset with two or more variables, perform K-nearest neighbor regression in R using a `tidymodels` workflow.
5858
* Execute cross-validation in R to choose the number of neighbors.
5959
* Evaluate KNN regression prediction accuracy in R using a test data set and the root mean squared prediction error (RMSPE).
6060
* In the context of KNN regression, compare and contrast goodness of fit and prediction properties (namely RMSE vs RMSPE).
@@ -94,9 +94,9 @@ is that we are now predicting numerical variables instead of categorical variabl
9494
> variable is numerical or categorical---and therefore whether you
9595
> need to perform regression or classification---by taking two response variables X and Y from your
9696
> data, and asking the question, "is response variable X *more* than response variable Y?"
97-
> If the variable is categorical, the question will make no sense ("is blue more than red?",
98-
> or "is benign more than malignant?"). If the variable is numerical, it will make sense
99-
> ("is 1.5 hours more than 2.25 hours?", or "is \$500,000 more than \$400,000?").
97+
> If the variable is categorical, the question will make no sense (Is blue more than red?
98+
> Is benign more than malignant?). If the variable is numerical, it will make sense
99+
> (Is 1.5 hours more than 2.25 hours? Is \$500,000 more than \$400,000?).
100100
> Be careful when applying this heuristic, though: sometimes categorical variables will be encoded as
101101
> numbers in your data (e.g., "1" represents "benign", and "0" represents "malignant"). In these cases
102102
> you have to ask the question about the *meaning* of the labels ("benign" and "malignant"), not their values ("1" and "0").
@@ -105,10 +105,10 @@ is that we are now predicting numerical variables instead of categorical variabl
105105

106106
In this chapter and the next, we will study the Sacramento \index{Sacramento real estate} real estate data
107107
set. This data set contains 932 real estate transactions in Sacramento,
108-
California [originally reported in the Sacramento Bee newspaper](https://support.spatialkey.com/spatialkey-sample-csv-data/).
108+
California, [originally reported in the *Sacramento Bee* newspaper](https://support.spatialkey.com/spatialkey-sample-csv-data/).
109109
We first need to formulate a precise question that
110110
we want to answer. In this example, our question is again predictive:
111-
\index{question!regression} can we use the size of a house in the Sacramento, CA area to predict
111+
\index{question!regression} Can we use the size of a house in the Sacramento, CA area to predict
112112
its sale price? A rigorous, quantitative answer to this question might help
113113
a realtor advise a client as to whether the price of a particular listing
114114
is fair, or perhaps how to set the price of a new listing.
@@ -304,9 +304,9 @@ $$\text{RMSPE} = \sqrt{\frac{1}{n}\sum\limits_{i=1}^{n}(y_i - \hat{y}_i)^2}$$
304304

305305
where:
306306

307-
- $n$ is the number of observations
308-
- $y_i$ is the observed value for the $i^\text{th}$ observation
309-
- $\hat{y}_i$ is the forecasted/predicted value for the $i^\text{th}$ observation
307+
- $n$ is the number of observations,
308+
- $y_i$ is the observed value for the $i^\text{th}$ observation, and
309+
- $\hat{y}_i$ is the forecasted/predicted value for the $i^\text{th}$ observation.
310310

311311
In other words, we compute the *squared* difference between the predicted and true response
312312
value for each observation in our test (or validation) set, compute the average, and then finally
@@ -320,7 +320,7 @@ mistakes.
320320
If the predictions are very close to the true values, then
321321
RMSPE will be small. If, on the other-hand, the predictions are very
322322
different from the true values, then RMSPE will be quite large. When we
323-
use cross validation, we will choose the $K$ that gives
323+
use cross-validation, we will choose the $K$ that gives
324324
us the smallest RMSPE.
325325

326326
```{r 07-verticalerrors, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "Scatter plot of price (USD) versus house size (square feet) with example predictions (blue line) and the error in those predictions compared with true response values for three selected observations (vertical red lines).", fig.height = 3.5, fig.width = 4.5}
@@ -396,7 +396,7 @@ value, let's use R to perform cross-validation and to choose the optimal $K$.
396396
First, we will create a recipe for preprocessing our data.
397397
Note that we include standardization
398398
in our preprocessing to build good habits, but since we only have one
399-
predictor it is technically not necessary; there is no risk of comparing two predictors
399+
predictor, it is technically not necessary; there is no risk of comparing two predictors
400400
of different scales.
401401
Next we create a model specification for K-nearest neighbors regression. Note
402402
that we use `set_mode("regression")`
@@ -405,7 +405,7 @@ problems from the previous chapters.
405405
The use of `set_mode("regression")` essentially
406406
tells `tidymodels` that we need to use different metrics (RMSPE, not accuracy)
407407
for tuning and evaluation.
408-
Then we create a 5-fold cross validation object, and put the recipe and model specification together
408+
Then we create a 5-fold cross-validation object, and put the recipe and model specification together
409409
in a workflow.
410410
\index{tidymodels}\index{recipe}\index{workflow}
411411

@@ -432,7 +432,7 @@ sacr_wkflw
432432
print_tidymodels(sacr_wkflw)
433433
```
434434

435-
Next we run cross validation for a grid of numbers of neighbors ranging from 1 to 200.
435+
Next we run cross-validation for a grid of numbers of neighbors ranging from 1 to 200.
436436
The following code tunes
437437
the model and returns the RMSPE for each number of neighbors. In the output of the `sacr_results`
438438
results data frame, we see that the `neighbors` variable contains the value of $K$,
@@ -562,7 +562,7 @@ What about the plots in Figure \@ref(fig:07-howK) where $K$ is quite large,
562562
say, $K$ = 250 or 932?
563563
In this case the blue line becomes extremely smooth, and actually becomes flat
564564
once $K$ is equal to the number of datapoints in the entire data set.
565-
This happens because our predicted values for a given x value (here home
565+
This happens because our predicted values for a given x value (here, home
566566
size), depend on many neighboring observations; in the case where $K$ is equal
567567
to the size of the dataset, the prediction is just the mean of the house prices
568568
in the dataset (completely ignoring the house size).
@@ -747,7 +747,7 @@ in the chapter on evaluating and tuning classification models),
747747
then we must compare the accuracy estimated using only the training data via cross-validation.
748748
Looking back, the estimated cross-validation accuracy for the single-predictor
749749
model was `r format(round(sacr_min$mean), big.mark=",", nsmall=0, scientific = FALSE)`.
750-
The estimated cross validation accuracy for the multivariable model is
750+
The estimated cross-validation accuracy for the multivariable model is
751751
`r format(round(sacr_multi$mean), big.mark=",", nsmall=0, scientific = FALSE)`.
752752
Thus in this case, we did not improve the model
753753
by a large amount by adding this additional predictor.
@@ -777,7 +777,7 @@ knn_mult_mets <- metrics(knn_mult_preds, truth = price, estimate = .pred) |>
777777
knn_mult_mets
778778
```
779779

780-
This time when we performed KNN regression on the same data set, but also
780+
This time, when we performed KNN regression on the same data set, but also
781781
included number of bedrooms as a predictor, we obtained a RMSPE test error
782782
of `r format(round(knn_mult_mets |> pull(.estimate)), big.mark=",", nsmall=0, scientific=FALSE)`.
783783
Figure \@ref(fig:07-knn-mult-viz) visualizes the model's predictions overlaid on top of the data. This
@@ -846,15 +846,15 @@ regression has both strengths and weaknesses. Some are listed here:
846846

847847
**Strengths:** K-nearest neighbors regression
848848

849-
1. is a simple, intuitive algorithm
850-
2. requires few assumptions about what the data must look like
851-
3. works well with non-linear relationships (i.e., if the relationship is not a straight line)
849+
1. is a simple, intuitive algorithm,
850+
2. requires few assumptions about what the data must look like, and
851+
3. works well with non-linear relationships (i.e., if the relationship is not a straight line).
852852

853853
**Weaknesses:** K-nearest neighbors regression
854854

855-
1. becomes very slow as the training data gets larger
856-
2. may not perform well with a large number of predictors
857-
3. may not predict well beyond the range of values input in your training data
855+
1. becomes very slow as the training data gets larger,
856+
2. may not perform well with a large number of predictors, and
857+
3. may not predict well beyond the range of values input in your training data.
858858

859859
## Exercises
860860

0 commit comments

Comments
 (0)