Skip to content

Commit d15bcb5

Browse files
authored
Merge pull request #83 from UBC-DSCI/dev
Dev
2 parents 1243559 + c2facb5 commit d15bcb5

File tree

329 files changed

+27203
-5078
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

329 files changed

+27203
-5078
lines changed

01-reading.Rmd

Lines changed: 174 additions & 103 deletions
Large diffs are not rendered by default.

02-wrangling.Rmd

Lines changed: 278 additions & 173 deletions
Large diffs are not rendered by default.

03-viz.Rmd

Lines changed: 177 additions & 167 deletions
Large diffs are not rendered by default.

04-version_control.Rmd

Lines changed: 569 additions & 37 deletions
Large diffs are not rendered by default.

05-classification.Rmd

Lines changed: 294 additions & 262 deletions
Large diffs are not rendered by default.

06-classification_continued.Rmd

Lines changed: 120 additions & 124 deletions
Large diffs are not rendered by default.

07-regression1.Rmd

Lines changed: 123 additions & 116 deletions
Large diffs are not rendered by default.

08-regression2.Rmd

Lines changed: 93 additions & 88 deletions
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@ in purchasing with an advertised list price of
4040
To answer this question using simple linear regression, we use the data we have
4141
to draw the straight line of best fit through our existing data points:
4242

43-
```{r 08-lin-reg1, message = FALSE, warning = FALSE, echo = FALSE, fig.height = 4, fig.width = 5}
43+
```{r 08-lin-reg1, message = FALSE, warning = FALSE, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap = "Scatter plot of price (USD) versus house size (square footage) with line of best fit for subset of the Sacramento housing data set"}
4444
library(tidyverse)
4545
library(gridExtra)
4646
library(caret)
@@ -53,8 +53,8 @@ small_plot <- ggplot(small_sacramento, aes(x = sqft, y = price)) +
5353
geom_point() +
5454
xlab("House size (square footage)") +
5555
ylab("Price (USD)") +
56-
scale_y_continuous(labels=dollar_format()) +
57-
geom_smooth(method = "lm", se = FALSE)
56+
scale_y_continuous(labels = dollar_format()) +
57+
geom_smooth(method = "lm", se = FALSE)
5858
small_plot
5959
```
6060

@@ -71,14 +71,14 @@ $\beta_0$ and $\beta_1$ that *parametrize* (correspond to) the line of best fit.
7171
Once we have the coefficients, we can use the equation above to evaluate the predicted price given the value we
7272
have for the predictor/explanatory variable&mdash;here 2,000 square feet.
7373

74-
```{r 08-lin-reg2, message = FALSE, warning = FALSE, echo = FALSE, fig.height = 4, fig.width = 5}
74+
```{r 08-lin-reg2, message = FALSE, warning = FALSE, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap = "Scatter plot of price (USD) versus house size (square footage) with line of best fit and predicted price for a 2000 square foot home represented as a red dot"}
7575
small_model <- lm(price ~ sqft, data = small_sacramento)
7676
prediction <- predict(small_model, data.frame(sqft = 2000))
7777
78-
small_plot +
78+
small_plot +
7979
geom_vline(xintercept = 2000, linetype = "dotted") +
8080
geom_point(aes(x = 2000, y = prediction[[1]], color = "red", size = 2.5)) +
81-
theme(legend.position="none")
81+
theme(legend.position = "none")
8282
8383
print(prediction[[1]])
8484
```
@@ -90,12 +90,12 @@ exactly does simple linear regression choose the line of best fit? Many
9090
different lines could be drawn through the data points. We show some examples
9191
below:
9292

93-
```{r 08-several-lines, echo = FALSE, message = FALSE, warning = FALSE, fig.height = 4, fig.width = 5}
93+
```{r 08-several-lines, echo = FALSE, message = FALSE, warning = FALSE, fig.height = 4, fig.width = 5, fig.cap = "Scatter plot of price (USD) versus house size (square footage) with many possible lines that could be drawn through the data points"}
9494
95-
small_plot +
95+
small_plot +
9696
geom_abline(intercept = -64542.23, slope = 190, color = "green") +
9797
geom_abline(intercept = -6900, slope = 175, color = "purple") +
98-
geom_abline(intercept = -64542.23, slope = 160, color = "red")
98+
geom_abline(intercept = -64542.23, slope = 160, color = "red")
9999
```
100100

101101
Simple linear regression chooses the straight line of best fit by choosing
@@ -105,13 +105,11 @@ line. What exactly do we mean by the vertical distance between the predicted
105105
values (which fall along the line of best fit) and the observed data points?
106106
We illustrate these distances in the plot below with a red line:
107107

108-
```{r 08-verticalDistToMin, echo = FALSE, message = FALSE, warning = FALSE, fig.height = 4, fig.width = 5}
109-
small_sacramento <- small_sacramento %>%
108+
```{r 08-verticalDistToMin, echo = FALSE, message = FALSE, warning = FALSE, fig.height = 4, fig.width = 5, fig.cap = "Scatter plot of price (USD) versus house size (square footage) with the vertical distances between the predicted values and the observed data points"}
109+
small_sacramento <- small_sacramento %>%
110110
mutate(predicted = predict(small_model))
111111
small_plot +
112-
geom_segment(data = small_sacramento, aes(xend = sqft, yend = predicted), colour = "red")
113-
114-
112+
geom_segment(data = small_sacramento, aes(xend = sqft, yend = predicted), colour = "red")
115113
```
116114

117115
To assess the predictive accuracy of a simple linear regression model,
@@ -150,15 +148,15 @@ Now that we have our training data, we will create the model specification
150148
and recipe, and fit our simple linear regression model:
151149
```{r 08-fitLM, fig.height = 4, fig.width = 5}
152150
lm_spec <- linear_reg() %>%
153-
set_engine("lm") %>%
154-
set_mode("regression")
151+
set_engine("lm") %>%
152+
set_mode("regression")
155153
156-
lm_recipe <- recipe(price ~ sqft, data = sacramento_train)
154+
lm_recipe <- recipe(price ~ sqft, data = sacramento_train)
157155
158156
lm_fit <- workflow() %>%
159-
add_recipe(lm_recipe) %>%
160-
add_model(lm_spec) %>%
161-
fit(data = sacramento_train)
157+
add_recipe(lm_recipe) %>%
158+
add_model(lm_spec) %>%
159+
fit(data = sacramento_train)
162160
lm_fit
163161
```
164162
Our coefficients are
@@ -172,9 +170,9 @@ every extra square foot increases the cost of the house by \$`r format(round(pul
172170

173171
```{r 08-assessFinal}
174172
lm_test_results <- lm_fit %>%
175-
predict(sacramento_test) %>%
176-
bind_cols(sacramento_test) %>%
177-
metrics(truth = price, estimate = .pred)
173+
predict(sacramento_test) %>%
174+
bind_cols(sacramento_test) %>%
175+
metrics(truth = price, estimate = .pred)
178176
lm_test_results
179177
```
180178

@@ -196,14 +194,14 @@ plausible range to this line that we are not interested in at this point, so to
196194
avoid plotting it, we provide the argument `se = FALSE` in our call to
197195
`geom_smooth`.
198196

199-
```{r 08-lm-predict-all, fig.height = 4, fig.width = 5, warning = FALSE, message = FALSE}
197+
```{r 08-lm-predict-all, fig.height = 4, fig.width = 5, warning = FALSE, message = FALSE, fig.cap = "Scatter plot of price (USD) versus house size (square footage) with line of best fit for complete Sacramento housing data set"}
200198
201199
lm_plot_final <- ggplot(sacramento_train, aes(x = sqft, y = price)) +
202-
geom_point(alpha = 0.4) +
203-
xlab("House size (square footage)") +
204-
ylab("Price (USD)") +
205-
scale_y_continuous(labels = dollar_format()) +
206-
geom_smooth(method = "lm", se = FALSE)
200+
geom_point(alpha = 0.4) +
201+
xlab("House size (square footage)") +
202+
ylab("Price (USD)") +
203+
scale_y_continuous(labels = dollar_format()) +
204+
geom_smooth(method = "lm", se = FALSE)
207205
lm_plot_final
208206
```
209207

@@ -226,50 +224,51 @@ simple linear regression model predictions for the Sacramento real estate data
226224
(predicting price from house size) and the "best" K-NN regression model
227225
obtained from the same problem:
228226

229-
```{r 08-compareRegression, echo = FALSE, warning = FALSE, message = FALSE, fig.height = 4, fig.width = 10}
227+
```{r 08-compareRegression, echo = FALSE, warning = FALSE, message = FALSE, fig.height = 4, fig.width = 10, fig.cap = "Comparison of simple linear regression and K-NN regression"}
230228
set.seed(1234)
231229
sacr_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 30) %>%
232-
set_engine("kknn") %>%
233-
set_mode("regression")
230+
set_engine("kknn") %>%
231+
set_mode("regression")
234232
235233
sacr_wkflw <- workflow() %>%
236-
add_recipe(sacr_recipe) %>%
237-
add_model(sacr_spec)
234+
add_recipe(sacr_recipe) %>%
235+
add_model(sacr_spec)
238236
239237
sacr_fit <- sacr_wkflw %>%
240-
fit(data = sacramento_train)
238+
fit(data = sacramento_train)
241239
242240
sacr_preds <- sacr_fit %>%
243-
predict(sacramento_train) %>%
244-
bind_cols(sacramento_train)
241+
predict(sacramento_train) %>%
242+
bind_cols(sacramento_train)
245243
246244
sacr_rmse <- sacr_preds %>%
247-
metrics(truth = price, estimate = .pred) %>%
248-
filter(.metric == 'rmse') %>%
249-
pull(.estimate) %>%
250-
round(2)
245+
metrics(truth = price, estimate = .pred) %>%
246+
filter(.metric == "rmse") %>%
247+
pull(.estimate) %>%
248+
round(2)
251249
252250
sacr_rmspe <- sacr_fit %>%
253-
predict(sacramento_test) %>%
254-
bind_cols(sacramento_test) %>%
255-
metrics(truth = price, estimate = .pred) %>%
256-
filter(.metric == 'rmse') %>%
257-
pull(.estimate) %>% round()
251+
predict(sacramento_test) %>%
252+
bind_cols(sacramento_test) %>%
253+
metrics(truth = price, estimate = .pred) %>%
254+
filter(.metric == "rmse") %>%
255+
pull(.estimate) %>%
256+
round()
258257
259258
260259
knn_plot_final <- ggplot(sacr_preds, aes(x = sqft, y = price)) +
261-
geom_point(alpha = 0.4) +
262-
xlab("House size (square footage)") +
263-
ylab("Price (USD)") +
264-
scale_y_continuous(labels = dollar_format()) +
265-
geom_line(data = sacr_preds, aes(x = sqft, y = .pred), color = "blue") +
266-
ggtitle("K-NN regression") +
267-
annotate("text", x = 3500, y = 100000, label = paste("RMSPE =", sacr_rmspe))
268-
269-
lm_rmspe <- lm_test_results %>%
270-
filter(.metric == 'rmse') %>%
271-
pull(.estimate) %>%
272-
round()
260+
geom_point(alpha = 0.4) +
261+
xlab("House size (square footage)") +
262+
ylab("Price (USD)") +
263+
scale_y_continuous(labels = dollar_format()) +
264+
geom_line(data = sacr_preds, aes(x = sqft, y = .pred), color = "blue") +
265+
ggtitle("K-NN regression") +
266+
annotate("text", x = 3500, y = 100000, label = paste("RMSPE =", sacr_rmspe))
267+
268+
lm_rmspe <- lm_test_results %>%
269+
filter(.metric == "rmse") %>%
270+
pull(.estimate) %>%
271+
round()
273272
274273
lm_plot_final <- lm_plot_final +
275274
annotate("text", x = 3500, y = 100000, label = paste("RMSPE =", lm_rmspe)) +
@@ -339,54 +338,60 @@ trying to predict. We will start by changing the formula in the recipe to
339338
include both the `sqft` and `beds` variables as predictors:
340339

341340
```{r 08-lm-mult-test-train-split}
342-
lm_recipe <- recipe(price ~ sqft + beds, data = sacramento_train)
341+
lm_recipe <- recipe(price ~ sqft + beds, data = sacramento_train)
343342
```
344343

345344
Now we can build our workflow and fit the model:
346345
```{r 08-fitlm}
347346
lm_fit <- workflow() %>%
348-
add_recipe(lm_recipe) %>%
349-
add_model(lm_spec) %>%
350-
fit(data = sacramento_train)
347+
add_recipe(lm_recipe) %>%
348+
add_model(lm_spec) %>%
349+
fit(data = sacramento_train)
351350
lm_fit
352351
```
353352

354353
And finally, we predict on the test data set to assess how well our model does:
355354

356355
```{r 08-assessFinal-multi}
357356
lm_mult_test_results <- lm_fit %>%
358-
predict(sacramento_test) %>%
359-
bind_cols(sacramento_test) %>%
360-
metrics(truth = price, estimate = .pred)
357+
predict(sacramento_test) %>%
358+
bind_cols(sacramento_test) %>%
359+
metrics(truth = price, estimate = .pred)
361360
lm_mult_test_results
362361
```
363362

364363
In the case of two predictors, our linear regression creates a *plane* of best fit, shown below:
365364

366-
```{r 08-3DlinReg, echo = FALSE, message = FALSE, warning = FALSE}
365+
```{r 08-3DlinReg, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "Simple linear regression model’s predictions represented as a plane overlaid on top of the data using three predictors (price, house size, and the number of bedrooms)"}
367366
library(plotly)
368367
xvals <- seq(from = min(sacramento_train$sqft), to = max(sacramento_train$sqft), length = 50)
369368
yvals <- seq(from = min(sacramento_train$beds), to = max(sacramento_train$beds), length = 50)
370369
371370
zvals <- lm_fit %>%
372-
predict(crossing(xvals, yvals) %>% mutate(sqft = xvals, beds = yvals)) %>%
373-
pull(.pred)
374-
375-
zvalsm <- matrix(zvals, nrow=length(xvals))
376-
377-
plot_ly() %>%
378-
add_markers(data = sacramento_train,
379-
x = ~sqft,
380-
y = ~beds,
381-
z = ~price,
382-
marker = list(size = 5, opacity = 0.4, color = "red")) %>%
383-
layout(scene = list(xaxis = list(title = 'House size (square feet)'),
384-
zaxis = list(title = 'Price (USD)'),
385-
yaxis = list(title = 'Number of bedrooms'))) %>%
386-
add_surface(x = ~xvals,
387-
y = ~yvals,
388-
z = ~zvalsm,
389-
colorbar=list(title='Price (USD)'))
371+
predict(crossing(xvals, yvals) %>% mutate(sqft = xvals, beds = yvals)) %>%
372+
pull(.pred)
373+
374+
zvalsm <- matrix(zvals, nrow = length(xvals))
375+
376+
plot_ly() %>%
377+
add_markers(
378+
data = sacramento_train,
379+
x = ~sqft,
380+
y = ~beds,
381+
z = ~price,
382+
marker = list(size = 5, opacity = 0.4, color = "red")
383+
) %>%
384+
layout(scene = list(
385+
xaxis = list(title = "House size (square feet)"),
386+
zaxis = list(title = "Price (USD)"),
387+
yaxis = list(title = "Number of bedrooms")
388+
)) %>%
389+
add_surface(
390+
x = ~xvals,
391+
y = ~yvals,
392+
z = ~zvalsm,
393+
colorbar = list(title = "Price (USD)")
394+
)
390395
```
391396
We see that the predictions from linear regression with two predictors form a
392397
flat plane. This is the hallmark of linear regression, and differs from the
@@ -411,9 +416,9 @@ where:
411416
Finally, we can fill in the values for $\beta_0$, $\beta_1$ and $\beta_2$ from the model output above
412417
to create the equation of the plane of best fit to the data:
413418
```{r 08-lm-multi-get-coeffs-hidden, echo = FALSE}
414-
icept <- format(round(coeffs %>% filter(term == '(Intercept)') %>% pull(estimate)), scientific = FALSE)
415-
sqftc <- format(round(coeffs %>% filter(term == 'sqft') %>% pull(estimate)), scientific = FALSE)
416-
bedsc <- format(round(coeffs %>% filter(term == 'beds') %>% pull(estimate)), scientific = FALSE)
419+
icept <- format(round(coeffs %>% filter(term == "(Intercept)") %>% pull(estimate)), scientific = FALSE)
420+
sqftc <- format(round(coeffs %>% filter(term == "sqft") %>% pull(estimate)), scientific = FALSE)
421+
bedsc <- format(round(coeffs %>% filter(term == "beds") %>% pull(estimate)), scientific = FALSE)
417422
```
418423

419424
$$\text{house price} = `r icept` + `r sqftc`\cdot (\text{house size}) `r bedsc` \cdot (\text{number of bedrooms})$$
@@ -470,7 +475,7 @@ quantifying how big each of these effects are, and assessing how accurately we
470475
can estimate each of these effects. This side of regression is the topic of
471476
many follow-on statistics courses and beyond the scope of this course.
472477

473-
## Additional readings/resources
478+
## Additional resources
474479
- Pages 59-71 of [Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf) with Applications in R by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani
475480
- Pages 104 - 109 of [An Introduction to Statistical Learning with Applications in R](https://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf) by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani
476481
- [The `caret` Package](https://topepo.github.io/caret/index.html)

0 commit comments

Comments
 (0)