You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: classification2.Rmd
+18-13Lines changed: 18 additions & 13 deletions
Original file line number
Diff line number
Diff line change
@@ -206,7 +206,7 @@ tumor cell concavity versus smoothness colored by diagnosis in Figure \@ref(fig:
206
206
You will also notice that we set the random seed here at the beginning of the analysis
207
207
using the `set.seed` function, as described in Section \@ref(randomseeds).
208
208
209
-
```{r 06-precode, fig.height = 4, fig.width = 5, fig.cap="Scatter plot of tumor cell concavity versus smoothness colored by diagnosis label.", message = F, warning = F}
209
+
```{r 06-precode, fig.height = 3.5, fig.width = 4.5, fig.cap="Scatter plot of tumor cell concavity versus smoothness colored by diagnosis label.", message = F, warning = F}
210
210
# load packages
211
211
library(tidyverse)
212
212
library(tidymodels)
@@ -778,7 +778,7 @@ We can select the best value of the number of neighbors (i.e., the one that resu
778
778
in the highest classifier accuracy estimate) by plotting the accuracy versus $K$
779
779
in Figure \@ref(fig:06-find-k).
780
780
781
-
```{r 06-find-k, fig.height = 4, fig.width = 5, fig.cap= "Plot of estimated accuracy versus the number of neighbors."}
781
+
```{r 06-find-k, fig.height = 3.5, fig.width = 4, fig.cap= "Plot of estimated accuracy versus the number of neighbors."}
782
782
accuracy_vs_k <- ggplot(accuracies, aes(x = neighbors, y = mean)) +
783
783
geom_point() +
784
784
geom_line() +
@@ -824,7 +824,7 @@ we vary $K$ from 1 to almost the number of observations in the data set.
824
824
set.seed(1)
825
825
```
826
826
827
-
```{r 06-lots-of-ks, message = FALSE, fig.height = 4, fig.width = 5, fig.cap="Plot of accuracy estimate versus number of neighbors for many K values."}
827
+
```{r 06-lots-of-ks, message = FALSE, fig.height = 3.5, fig.width = 4, fig.cap="Plot of accuracy estimate versus number of neighbors for many K values."}
828
828
k_lots <- tibble(neighbors = seq(from = 1, to = 385, by = 10))
this evidence; if we fix the number of neighbors to $K=3$, the accuracy falls off more quickly.
1119
1121
1120
-
```{r 06-neighbors-irrelevant-features, echo = FALSE, warning = FALSE, fig.retina = 2, out.width = "100%", fig.cap = "Tuned number of neighbors for varying number of irrelevant predictors."}
1122
+
```{r 06-neighbors-irrelevant-features, echo = FALSE, warning = FALSE, fig.retina = 2, out.width = "60%", fig.cap = "Tuned number of neighbors for varying number of irrelevant predictors."}
1121
1123
plt_irrelevant_nghbrs <- ggplot(res) +
1122
1124
geom_line(mapping = aes(x=ks, y=nghbrs)) +
1123
1125
labs(x = "Number of Irrelevant Predictors",
1124
-
y = "Number of neighbors")
1126
+
y = "Number of neighbors") +
1127
+
theme(text = element_text(size = 18))
1125
1128
1126
1129
plt_irrelevant_nghbrs
1127
1130
```
1128
1131
1129
-
```{r 06-fixed-irrelevant-features, echo = FALSE, warning = FALSE, fig.retina = 2, out.width = "100%", fig.cap = "Accuracy versus number of irrelevant predictors for tuned and untuned number of neighbors."}
1130
-
res_tmp <- res |> pivot_longer(cols=c("accs", "fixedaccs"),
1132
+
```{r 06-fixed-irrelevant-features, echo = FALSE, warning = FALSE, fig.retina = 2, out.width = "75%", fig.cap = "Accuracy versus number of irrelevant predictors for tuned and untuned number of neighbors."}
1133
+
res_tmp <- res %>% pivot_longer(cols=c("accs", "fixedaccs"),
@@ -1362,11 +1366,12 @@ where the elbow occurs, and whether adding a variable provides a meaningful incr
1362
1366
> part of tuning your classifier, you *cannot use your test data* for this
1363
1367
> process!
1364
1368
1365
-
```{r 06-fwdsel-3, echo = FALSE, warning = FALSE, fig.retina = 2, out.width = "100%", fig.cap = "Estimated accuracy versus the number of predictors for the sequence of models built using forward selection."}
1369
+
```{r 06-fwdsel-3, echo = FALSE, warning = FALSE, fig.retina = 2, out.width = "60%", fig.cap = "Estimated accuracy versus the number of predictors for the sequence of models built using forward selection."}
1366
1370
fwd_sel_accuracies_plot <- accuracies |>
1367
1371
ggplot(aes(x = size, y = accuracy)) +
1368
1372
geom_line() +
1369
-
labs(x = "Number of Predictors", y = "Estimated Accuracy")
1373
+
labs(x = "Number of Predictors", y = "Estimated Accuracy") +
geom_histogram(fill = "dodgerblue3", color = "lightgrey", bins = 12) +
293
293
ylab("Count") +
@@ -335,7 +335,7 @@ We can visualize the population distribution of the price per night with a histo
335
335
options(pillar.sigfig = 5)
336
336
```
337
337
338
-
```{r 11-example-means2, echo = TRUE, message = FALSE, warning = FALSE, fig.cap = "Population distribution of price per night (Canadian dollars) for all Airbnb listings in Vancouver, Canada.", fig.retina = 2, out.width = "100%"}
338
+
```{r 11-example-means2, echo = TRUE, message = FALSE, warning = FALSE, fig.cap = "Population distribution of price per night (Canadian dollars) for all Airbnb listings in Vancouver, Canada.", fig.height = 3.5, fig.width = 4.5}
geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
386
386
ylab("Count") +
@@ -422,7 +422,7 @@ samples
422
422
Now we can calculate the sample mean for each replicate and plot the sampling
423
423
distribution of sample means for samples of size 40.
424
424
425
-
```{r 11-example-means4, echo = TRUE, message = FALSE, warning = FALSE, fig.cap= "Sampling distribution of the sample means for sample size of 40.", fig.retina = 2, out.width = "100%"}
425
+
```{r 11-example-means4, echo = TRUE, message = FALSE, warning = FALSE, fig.cap= "Sampling distribution of the sample means for sample size of 40.", fig.height = 3.5, fig.width = 4.5}
426
426
sample_estimates <- samples |>
427
427
group_by(replicate) |>
428
428
summarize(sample_mean = mean(price))
@@ -468,15 +468,15 @@ Notice that the mean of the sample means is \$`r round(mean(sample_estimates$sam
468
468
was \$`r round(mean(airbnb$price),2)`.
469
469
-->
470
470
471
-
```{r 11-example-means5, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "Comparison of population distribution, sample distribution, and sampling distribution."}
471
+
```{r 11-example-means5, echo = FALSE, message = FALSE, warning = FALSE, fig.height = 5.5, fig.width = 4, fig.cap = "Comparison of population distribution, sample distribution, and sampling distribution."}
472
472
grid.arrange(population_distribution +
473
473
ggtitle("Population") +
474
474
xlim(min(airbnb$price), 600),
475
475
sample_distribution +
476
476
ggtitle("Sample (n = 40)") +
477
477
xlim(min(airbnb$price), 600),
478
478
sampling_distribution_40 +
479
-
ggtitle("Sampling distribution of the mean for samples of size 40") +
479
+
ggtitle("Sampling distribution of the mean \n for samples of size 40") +
480
480
xlim(min(airbnb$price), 600),
481
481
nrow = 3
482
482
)
@@ -664,7 +664,7 @@ see that the sample’s distribution looks like that of the population for a
664
664
large enough sample.
665
665
666
666
667
-
```{r 11-example-bootstrapping0, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "Comparison of samples of different sizes from the population."}
667
+
```{r 11-example-bootstrapping0, echo = FALSE, message = FALSE, warning = FALSE, fig.height = 7, fig.cap = "Comparison of samples of different sizes from the population."}
Let's compare the bootstrap distribution—which we construct by taking many samples from our original sample of size 40—with
885
885
the true sampling distribution—which corresponds to taking many samples from the population.
886
886
887
-
```{r 11-bootstrapping6, echo = F, message = FALSE, warning = FALSE, fig.cap = "Comparison of the distribution of the bootstrap sample means and sampling distribution.", out.height = "70%"}
887
+
```{r 11-bootstrapping6, echo = F, message = FALSE, warning = FALSE, fig.cap = "Comparison of the distribution of the bootstrap sample means and sampling distribution.", fig.height = 3.5}
0 commit comments