Merge pull request #520 from joelostblom/rm-smooth

trevorcampbell · web-flow · commit 60099ca97da4 · 2023-08-21T17:36:50.000-07:00
Remove geom_smooth and use the same viz approach as for knn instead
diff --git a/source/regression1.Rmd b/source/regression1.Rmd
@@ -640,19 +640,33 @@ $`r format(round(sacr_summary |> pull(.estimate)), big.mark=",", nsmall=0, scien
 might represent a substantial fraction of a home buyer's budget, and
 could make or break whether or not they could afford put an offer on a house. 
 
-Finally, Figure \@ref(fig:07-predict-all) shows the predictions that our final model makes across
-the range of house sizes we might encounter in the Sacramento area&mdash;from 500 to 5000 square feet. 
-You have already seen a few plots like this in this chapter, but here we also provide the code that generated it
-as a learning challenge.
+Finally, Figure \@ref(fig:07-predict-all) shows the predictions that our final
+model makes across the range of house sizes we might encounter in the
+Sacramento area.
+Note that instead of predicting the house price only for those house sizes that happen to appear in our data,
+we predict it for evenly spaced values between the minimum and maximum in the data set
+(roughly 500 to 5000 square feet).
+We superimpose this prediction line on a scatter
+plot of the original housing price data,
+so that we can qualitatively assess if the model seems to fit the data well.
+You have already seen a
+few plots like this in this chapter, but here we also provide the code that
+generated it as a learning opportunity.
 
 ```{r 07-predict-all, warning = FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap = "Predicted values of house price (blue line) for the final KNN regression model."}
-sacr_preds <- tibble(sqft = seq(from = 500, to = 5000, by = 10))
+sqft_prediction_grid <- tibble(
+    sqft = seq(
+        from = sacramento |> select(sqft) |> min(),
+        to = sacramento |> select(sqft) |> max(),
+        by = 10
+    )
+)
 
 sacr_preds <- sacr_fit |>
-  predict(sacr_preds) |>
-  bind_cols(sacr_preds)
+  predict(sqft_prediction_grid) |>
+  bind_cols(sqft_prediction_grid)
 
-plot_final <- ggplot(sacramento_train, aes(x = sqft, y = price)) +
+plot_final <- ggplot(sacramento, aes(x = sqft, y = price)) +
   geom_point(alpha = 0.4) +
   geom_line(data = sacr_preds, 
             mapping = aes(x = sqft, y = .pred), 
diff --git a/source/regression2.Rmd b/source/regression2.Rmd
@@ -54,7 +54,6 @@ By the end of the chapter, readers will be able to do the following:
 * Use R and `tidymodels` to fit a linear regression model on training data.
 * Evaluate the linear regression model on test data.
 * Compare and contrast predictions obtained from K-nearest neighbor regression to those obtained using linear regression from the same data set.
-* In R, overlay predictions from linear regression on a scatter plot of data using `geom_smooth`.
 
 ## Simple linear regression
 
@@ -292,21 +291,35 @@ sale price based off of the predictor of home size? Again, answering this is
 tricky and requires knowledge of how you intend to use the prediction.
 
 To visualize the simple linear regression model, we can plot the predicted house
-sale price across all possible house sizes we might encounter superimposed on a scatter
-plot of the original housing price data. There is a plotting function in 
-the `tidyverse`, `geom_smooth`, that
-allows us to add a layer on our plot with the simple
-linear regression predicted line of best fit. By default `geom_smooth` adds some other information
-to the plot that we are not interested in at this point; we provide the argument `se = FALSE` to
-tell `geom_smooth` not to show that information. Figure \@ref(fig:08-lm-predict-all) displays the result.
+sale price across all possible house sizes we might encounter.
+Since our model is linear,
+we only need to compute the predicted value of the min and max points,
+and then connect them with a straight line.
+We superimpose this prediction line on a scatter
+plot of the original housing price data,
+so that we can qualitatively assess if the model seems to fit the data well.
+Figure \@ref(fig:08-lm-predict-all) displays the result.
 
 ```{r 08-lm-predict-all, fig.height = 3.5, fig.width = 4.5, warning = FALSE, fig.pos = "H", out.extra="", message = FALSE, fig.cap = "Scatter plot of sale price versus size with line of best fit for the full Sacramento housing data."}
-lm_plot_final <- ggplot(sacramento_train, aes(x = sqft, y = price)) +
+sqft_prediction_grid <- tibble(
+    sqft = c(
+        sacramento |> select(sqft) |> min(),
+        sacramento |> select(sqft) |> max()
+    )
+)
+
+sacr_preds <- lm_fit |>
+  predict(sqft_prediction_grid) |>
+  bind_cols(sqft_prediction_grid)
+
+lm_plot_final <- ggplot(sacramento, aes(x = sqft, y = price)) +
   geom_point(alpha = 0.4) +
+  geom_line(data = sacr_preds, 
+            mapping = aes(x = sqft, y = .pred), 
+            color = "blue") +
   xlab("House size (square feet)") +
   ylab("Price (USD)") +
   scale_y_continuous(labels = dollar_format()) +
-  geom_smooth(method = "lm", se = FALSE) + 
   theme(text = element_text(size = 12))
 
 lm_plot_final