You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- $y_i$ is the observed value for the $i^\text{th}$ observation
309
-
- $\hat{y}_i$ is the forecasted/predicted value for the $i^\text{th}$ observation
307
+
- $n$ is the number of observations,
308
+
- $y_i$ is the observed value for the $i^\text{th}$ observation, and
309
+
- $\hat{y}_i$ is the forecasted/predicted value for the $i^\text{th}$ observation.
310
310
311
311
In other words, we compute the *squared* difference between the predicted and true response
312
312
value for each observation in our test (or validation) set, compute the average, and then finally
@@ -320,7 +320,7 @@ mistakes.
320
320
If the predictions are very close to the true values, then
321
321
RMSPE will be small. If, on the other-hand, the predictions are very
322
322
different from the true values, then RMSPE will be quite large. When we
323
-
use crossvalidation, we will choose the $K$ that gives
323
+
use cross-validation, we will choose the $K$ that gives
324
324
us the smallest RMSPE.
325
325
326
326
```{r 07-verticalerrors, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "Scatter plot of price (USD) versus house size (square feet) with example predictions (blue line) and the error in those predictions compared with true response values for three selected observations (vertical red lines).", fig.height = 3.5, fig.width = 4.5}
@@ -396,7 +396,7 @@ value, let's use R to perform cross-validation and to choose the optimal $K$.
396
396
First, we will create a recipe for preprocessing our data.
397
397
Note that we include standardization
398
398
in our preprocessing to build good habits, but since we only have one
399
-
predictor it is technically not necessary; there is no risk of comparing two predictors
399
+
predictor, it is technically not necessary; there is no risk of comparing two predictors
400
400
of different scales.
401
401
Next we create a model specification for K-nearest neighbors regression. Note
402
402
that we use `set_mode("regression")`
@@ -405,7 +405,7 @@ problems from the previous chapters.
405
405
The use of `set_mode("regression")` essentially
406
406
tells `tidymodels` that we need to use different metrics (RMSPE, not accuracy)
407
407
for tuning and evaluation.
408
-
Then we create a 5-fold crossvalidation object, and put the recipe and model specification together
408
+
Then we create a 5-fold cross-validation object, and put the recipe and model specification together
409
409
in a workflow.
410
410
\index{tidymodels}\index{recipe}\index{workflow}
411
411
@@ -432,7 +432,7 @@ sacr_wkflw
432
432
print_tidymodels(sacr_wkflw)
433
433
```
434
434
435
-
Next we run crossvalidation for a grid of numbers of neighbors ranging from 1 to 200.
435
+
Next we run cross-validation for a grid of numbers of neighbors ranging from 1 to 200.
436
436
The following code tunes
437
437
the model and returns the RMSPE for each number of neighbors. In the output of the `sacr_results`
438
438
results data frame, we see that the `neighbors` variable contains the value of $K$,
@@ -562,7 +562,7 @@ What about the plots in Figure \@ref(fig:07-howK) where $K$ is quite large,
562
562
say, $K$ = 250 or 932?
563
563
In this case the blue line becomes extremely smooth, and actually becomes flat
564
564
once $K$ is equal to the number of datapoints in the entire data set.
565
-
This happens because our predicted values for a given x value (here home
565
+
This happens because our predicted values for a given x value (here, home
566
566
size), depend on many neighboring observations; in the case where $K$ is equal
567
567
to the size of the dataset, the prediction is just the mean of the house prices
568
568
in the dataset (completely ignoring the house size).
@@ -747,7 +747,7 @@ in the chapter on evaluating and tuning classification models),
747
747
then we must compare the accuracy estimated using only the training data via cross-validation.
748
748
Looking back, the estimated cross-validation accuracy for the single-predictor
749
749
model was `r format(round(sacr_min$mean), big.mark=",", nsmall=0, scientific = FALSE)`.
750
-
The estimated crossvalidation accuracy for the multivariable model is
750
+
The estimated cross-validation accuracy for the multivariable model is
0 commit comments