Skip to content

Commit 8a5aa3e

Browse files
committed
Remove geom_smooth and use the same viz approach as for knn instead
1 parent 8373e71 commit 8a5aa3e

File tree

2 files changed

+43
-16
lines changed

2 files changed

+43
-16
lines changed

source/regression1.Rmd

Lines changed: 21 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -640,17 +640,31 @@ $`r format(round(sacr_summary |> pull(.estimate)), big.mark=",", nsmall=0, scien
640640
might represent a substantial fraction of a home buyer's budget, and
641641
could make or break whether or not they could afford put an offer on a house.
642642

643-
Finally, Figure \@ref(fig:07-predict-all) shows the predictions that our final model makes across
644-
the range of house sizes we might encounter in the Sacramento area—from 500 to 5000 square feet.
645-
You have already seen a few plots like this in this chapter, but here we also provide the code that generated it
646-
as a learning challenge.
643+
Finally, Figure \@ref(fig:07-predict-all) shows the predictions that our final
644+
model makes across the range of house sizes we might encounter in the
645+
Sacramento area.
646+
Note that instead of predicting the house price only for those house sizes that happen to appear in our data,
647+
we predict it for evenly spaced values between the minimum and maximum in the data set
648+
(roughly 500 to 5000 square feet).
649+
We superimpose this prediction line on a scatter
650+
plot of the original housing price data,
651+
so that we can qualitatively assess if the model seems to fit the data well.
652+
You have already seen a
653+
few plots like this in this chapter, but here we also provide the code that
654+
generated it as a learning opportunity.
647655

648656
```{r 07-predict-all, warning = FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap = "Predicted values of house price (blue line) for the final KNN regression model."}
649-
sacr_preds <- tibble(sqft = seq(from = 500, to = 5000, by = 10))
657+
sqft_prediction_grid <- tibble(
658+
sqft = seq(
659+
from = sacramento |> select(sqft) |> min(),
660+
to = sacramento |> select(sqft) |> max(),
661+
by = 10
662+
)
663+
)
650664
651665
sacr_preds <- sacr_fit |>
652-
predict(sacr_preds) |>
653-
bind_cols(sacr_preds)
666+
predict(sqft_prediction_grid) |>
667+
bind_cols(sqft_prediction_grid)
654668
655669
plot_final <- ggplot(sacramento_train, aes(x = sqft, y = price)) +
656670
geom_point(alpha = 0.4) +

source/regression2.Rmd

Lines changed: 22 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,6 @@ By the end of the chapter, readers will be able to do the following:
5454
* Use R and `tidymodels` to fit a linear regression model on training data.
5555
* Evaluate the linear regression model on test data.
5656
* Compare and contrast predictions obtained from K-nearest neighbor regression to those obtained using linear regression from the same data set.
57-
* In R, overlay predictions from linear regression on a scatter plot of data using `geom_smooth`.
5857

5958
## Simple linear regression
6059

@@ -292,21 +291,35 @@ sale price based off of the predictor of home size? Again, answering this is
292291
tricky and requires knowledge of how you intend to use the prediction.
293292

294293
To visualize the simple linear regression model, we can plot the predicted house
295-
sale price across all possible house sizes we might encounter superimposed on a scatter
296-
plot of the original housing price data. There is a plotting function in
297-
the `tidyverse`, `geom_smooth`, that
298-
allows us to add a layer on our plot with the simple
299-
linear regression predicted line of best fit. By default `geom_smooth` adds some other information
300-
to the plot that we are not interested in at this point; we provide the argument `se = FALSE` to
301-
tell `geom_smooth` not to show that information. Figure \@ref(fig:08-lm-predict-all) displays the result.
294+
sale price across all possible house sizes we might encounter.
295+
Since our model is linear,
296+
we only need to compute the predicted value of the min and max points,
297+
and then connect them with a straight line.
298+
We superimpose this prediction line on a scatter
299+
plot of the original housing price data,
300+
so that we can qualitatively assess if the model seems to fit the data well.
301+
Figure \@ref(fig:08-lm-predict-all) displays the result.
302302

303303
```{r 08-lm-predict-all, fig.height = 3.5, fig.width = 4.5, warning = FALSE, fig.pos = "H", out.extra="", message = FALSE, fig.cap = "Scatter plot of sale price versus size with line of best fit for the full Sacramento housing data."}
304+
sqft_prediction_grid <- tibble(
305+
sqft = c(
306+
sacramento |> select(sqft) |> min(),
307+
sacramento |> select(sqft) |> max()
308+
)
309+
)
310+
311+
sacr_preds <- lm_fit |>
312+
predict(sqft_prediction_grid) |>
313+
bind_cols(sqft_prediction_grid)
314+
304315
lm_plot_final <- ggplot(sacramento_train, aes(x = sqft, y = price)) +
305316
geom_point(alpha = 0.4) +
317+
geom_line(data = sacr_preds,
318+
mapping = aes(x = sqft, y = .pred),
319+
color = "blue") +
306320
xlab("House size (square feet)") +
307321
ylab("Price (USD)") +
308322
scale_y_continuous(labels = dollar_format()) +
309-
geom_smooth(method = "lm", se = FALSE) +
310323
theme(text = element_text(size = 12))
311324
312325
lm_plot_final

0 commit comments

Comments
 (0)