You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+5-2Lines changed: 5 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -142,13 +142,16 @@ bookdown::gitbook:
142
142
#### Figures
143
143
- make sure all figures get (capitalized) labels ("Figure \\@ref(blah)", not "figure below" or "figure above")
144
144
- make sure all figures get captions
145
-
- specify image widths in terms of linewidth percent (e.g. `out.width="70%"`)
145
+
- specify image widths of pngs and jpegs in terms of linewidth percent
146
+
(e.g. `out.width="70%"`),
147
+
for plots we create in R use `fig.width` and `fig.height`.
146
148
- center align all images via `fig.align = "center"`
147
149
- make sure we have permission for every figure/logo that we use
148
150
- Make sure all figures follow the visualization principles in Chapter 4
149
151
- Make sure axes are set appropriately to not inflate/deflate differences artificially *where it does not compromise clarity* (e.g. in the classification
150
152
chapter there are a few examples where zoomed-in accuracy axes are better than using the full range 0 to 1)
151
-
-
153
+
- Fig size for bar charts should be: `fig.width=5, fig.height=3` (an exception are figs 1.7 & 1.8 so that we can read the axis labels)
154
+
- cropping width for syntax diagrams is 1625 (done using `image_crop`)
152
155
153
156
#### Tables
154
157
- make sure all tables get capitalized labels ("Table \\@ref(blah)", not "table below" or "table above")
Copy file name to clipboardExpand all lines: classification1.Rmd
+24-20Lines changed: 24 additions & 20 deletions
Original file line number
Diff line number
Diff line change
@@ -5,6 +5,7 @@ library(formatR)
5
5
library(plotly)
6
6
library(knitr)
7
7
library(kableExtra)
8
+
library(ggpubr)
8
9
9
10
knitr::opts_chunk$set(echo = TRUE,
10
11
fig.align = "center")
@@ -209,7 +210,7 @@ for light orange and `"steelblue2"` for light blue—and
209
210
We also make the category labels ("B" and "M") more readable by
210
211
changing them to "Benign" and "Malignant" using the `labels` argument.
211
212
212
-
```{r 05-scatter, fig.height = 4, fig.width = 5, fig.cap= "Scatter plot of concavity versus perimeter colored by diagnosis label."}
213
+
```{r 05-scatter, fig.height = 3.5, fig.width = 4.5, fig.cap= "Scatter plot of concavity versus perimeter colored by diagnosis label."}
213
214
perim_concav <- cancer %>%
214
215
ggplot(aes(x = Perimeter, y = Concavity, color = Class)) +
215
216
geom_point(alpha = 0.6) +
@@ -285,7 +286,7 @@ new observation, with standardized perimeter of `r new_point[1]` and standardize
285
286
diagnosis "Class" is unknown. This new observation is depicted by the red, diamond point in
286
287
Figure \@ref(fig:05-knn-1).
287
288
288
-
```{r 05-knn-1, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap="Scatter plot of concavity versus perimeter with new observation represented as a red diamond."}
289
+
```{r 05-knn-1, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap="Scatter plot of concavity versus perimeter with new observation represented as a red diamond."}
289
290
perim_concav_with_new_point <- bind_rows(cancer,
290
291
tibble(Perimeter = new_point[1],
291
292
Concavity = new_point[2],
@@ -317,7 +318,7 @@ then the perimeter and concavity values are similar, and so we may expect that
317
318
they would have the same diagnosis.
318
319
319
320
320
-
```{r 05-knn-2, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap="Scatter plot of concavity versus perimeter. The new observation is represented as a red diamond with a line to the one nearest neighbor, which has a malignant label."}
321
+
```{r 05-knn-2, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap="Scatter plot of concavity versus perimeter. The new observation is represented as a red diamond with a line to the one nearest neighbor, which has a malignant label."}
321
322
perim_concav_with_new_point +
322
323
geom_segment(aes(
323
324
x = new_point[1],
@@ -342,7 +343,7 @@ Does this seem like the right prediction to make for this observation? Probably
342
343
not, if you consider the other nearby points...
343
344
344
345
345
-
```{r 05-knn-4, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap="Scatter plot of concavity versus perimeter. The new observation is represented as a red diamond with a line to the one nearest neighbor, which has a benign label."}
346
+
```{r 05-knn-4, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap="Scatter plot of concavity versus perimeter. The new observation is represented as a red diamond with a line to the one nearest neighbor, which has a benign label."}
346
347
347
348
perim_concav_with_new_point2 <- bind_rows(cancer,
348
349
tibble(Perimeter = new_point[1],
@@ -382,7 +383,7 @@ see that the diagnoses of 2 of the 3 nearest neighbors to our new observation
382
383
are malignant. Therefore we take majority vote and classify our new red, diamond
383
384
observation as malignant.
384
385
385
-
```{r 05-knn-5, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap="Scatter plot of concavity versus perimeter with three nearest neighbors."}
386
+
```{r 05-knn-5, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap="Scatter plot of concavity versus perimeter with three nearest neighbors."}
386
387
perim_concav_with_new_point2 +
387
388
geom_segment(aes(
388
389
x = new_point[1], y = new_point[2],
@@ -432,7 +433,7 @@ You will see in the `mutate` \index{mutate} step below, we compute the straight-
432
433
distance using the formula above: we square the differences between the two observations' perimeter
433
434
and concavity coordinates, add the squared differences, and then take the square root.
434
435
435
-
```{r 05-multiknn-1, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap="Scatter plot of concavity versus perimeter with new observation represented as a red diamond."}
436
+
```{r 05-multiknn-1, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap="Scatter plot of concavity versus perimeter with new observation represented as a red diamond."}
436
437
perim_concav <- bind_rows(cancer,
437
438
tibble(Perimeter = new_point[1],
438
439
Concavity = new_point[2],
@@ -514,7 +515,7 @@ The result of this computation shows that 3 of the 5 nearest neighbors to our ne
514
515
malignant (`M`); since this is the majority, we classify our new observation as malignant.
515
516
These 5 neighbors are circled in Figure \@ref(fig:05-multiknn-3).
516
517
517
-
```{r 05-multiknn-3, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap="Scatter plot of concavity versus perimeter with 5 nearest neighbors circled."}
518
+
```{r 05-multiknn-3, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap="Scatter plot of concavity versus perimeter with 5 nearest neighbors circled."}
518
519
perim_concav + annotate("path",
519
520
x = new_point[1] + 1.4 * cos(seq(0, 2 * pi,
520
521
length.out = 100
@@ -903,7 +904,7 @@ Standardizing your data should be a part of the preprocessing you do
903
904
before predictive modeling and you should always think carefully about your problem domain and
904
905
whether you need to standardize your data.
905
906
906
-
```{r 05-scaling-plt, echo = FALSE, fig.height = 4, fig.width = 10, fig.cap = "Comparison of K = 3 nearest neighbors with standardized and unstandardized data."}
907
+
```{r 05-scaling-plt, echo = FALSE, fig.height = 4, fig.cap = "Comparison of K = 3 nearest neighbors with standardized and unstandardized data."}
cancer |> filter(Class == "M") |> slice_head(n = 3)
@@ -1093,7 +1095,7 @@ benign, and the benign vote will always win. For example, Figure \@ref(fig:05-up
1093
1095
shows what happens for a new tumor observation that is quite close to three observations
1094
1096
in the training data that were tagged as malignant.
1095
1097
1096
-
```{r 05-upsample, echo=FALSE, fig.height = 4, fig.width = 5, fig.cap = "Imbalanced data with 7 nearest neighbors to a new observation highlighted."}
1098
+
```{r 05-upsample, echo=FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap = "Imbalanced data with 7 nearest neighbors to a new observation highlighted."}
@@ -1145,7 +1147,7 @@ each area of the plot to the predictions the $K$-nearest neighbor
1145
1147
classifier would make. We can see that the decision is
1146
1148
always "benign," corresponding to the blue color.
1147
1149
1148
-
```{r 05-upsample-2, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap = "Imbalanced data with background color indicating the decision of the classifier and the points represent the labeled data."}
1150
+
```{r 05-upsample-2, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap = "Imbalanced data with background color indicating the decision of the classifier and the points represent the labeled data."}
@@ -1223,7 +1225,7 @@ classifier would make. We can see that the decision is more reasonable; when the
1223
1225
to those labeled malignant, the classifier predicts a malignant tumor, and vice versa when they are
1224
1226
closer to the benign tumor observations.
1225
1227
1226
-
```{r 05-upsample-plot, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap = "Upsampled data with background color indicating the decision of the classifier."}
1228
+
```{r 05-upsample-plot, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap = "Upsampled data with background color indicating the decision of the classifier."}
@@ -1333,7 +1335,7 @@ predict the label of each, and visualize the predictions with a colored scatter
1333
1335
> textbook. It is included for those readers who would like to use similar
1334
1336
> visualizations in their own data analyses.
1335
1337
1336
-
```{r 05-workflow-plot-show, fig.height = 4, fig.width = 5, fig.cap = "Scatter plot of smoothness versus area where background color indicates the decision of the classifier."}
1338
+
```{r 05-workflow-plot-show, fig.height = 3.5, fig.width = 4.6, fig.cap = "Scatter plot of smoothness versus area where background color indicates the decision of the classifier."}
1337
1339
# create the grid of area/smoothness vals, and arrange in a data frame
@@ -187,7 +188,7 @@ tumor cell concavity versus smoothness colored by diagnosis in Figure \@ref(fig:
187
188
You will also notice that we set the random seed here at the beginning of the analysis
188
189
using the `set.seed` function, as described in Section \@ref(randomseeds).
189
190
190
-
```{r 06-precode, fig.height = 4, fig.width = 5, fig.cap="Scatter plot of tumor cell concavity versus smoothness colored by diagnosis label.", message = F, warning = F}
191
+
```{r 06-precode, fig.height = 3.5, fig.width = 4.5, fig.cap="Scatter plot of tumor cell concavity versus smoothness colored by diagnosis label.", message = F, warning = F}
191
192
# load packages
192
193
library(tidyverse)
193
194
library(tidymodels)
@@ -753,7 +754,7 @@ We can select the best value of the number of neighbors (i.e., the one that resu
753
754
in the highest classifier accuracy estimate) by plotting the accuracy versus $K$
754
755
in Figure \@ref(fig:06-find-k).
755
756
756
-
```{r 06-find-k, fig.height = 4, fig.width = 5, fig.cap= "Plot of estimated accuracy versus the number of neighbors."}
757
+
```{r 06-find-k, fig.height = 3.5, fig.width = 4, fig.cap= "Plot of estimated accuracy versus the number of neighbors."}
757
758
accuracy_vs_k <- ggplot(accuracies, aes(x = neighbors, y = mean)) +
758
759
geom_point() +
759
760
geom_line() +
@@ -799,7 +800,7 @@ we vary $K$ from 1 to almost the number of observations in the data set.
799
800
set.seed(1)
800
801
```
801
802
802
-
```{r 06-lots-of-ks, message = FALSE, fig.height = 4, fig.width = 5, fig.cap="Plot of accuracy estimate versus number of neighbors for many K values."}
803
+
```{r 06-lots-of-ks, message = FALSE, fig.height = 3.5, fig.width = 4, fig.cap="Plot of accuracy estimate versus number of neighbors for many K values."}
803
804
k_lots <- tibble(neighbors = seq(from = 1, to = 385, by = 10))
804
805
805
806
knn_results <- workflow() |>
@@ -848,7 +849,7 @@ a balance between the two. You can see these two effects in Figure
848
849
\@ref(fig:06-decision-grid-K), which shows how the classifier changes as
849
850
we set the number of neighbors $K$ to 1, 7, 20, and 300.
850
851
851
-
```{r 06-decision-grid-K, echo = FALSE, message = FALSE, fig.height = 7, fig.width = 10, fig.cap = "Effect of K in overfitting and underfitting."}
852
+
```{r 06-decision-grid-K, echo = FALSE, message = FALSE, fig.height = 10, fig.width = 10, fig.cap = "Effect of K in overfitting and underfitting."}
this evidence; if we fix the number of neighbors to $K=3$, the accuracy falls off more quickly.
1090
1097
1091
-
```{r 06-neighbors-irrelevant-features, echo = FALSE, warning = FALSE, fig.retina = 2, out.width = "100%", fig.cap = "Tuned number of neighbors for varying number of irrelevant predictors."}
1098
+
```{r 06-neighbors-irrelevant-features, echo = FALSE, warning = FALSE, fig.retina = 2, out.width = "60%", fig.cap = "Tuned number of neighbors for varying number of irrelevant predictors."}
1092
1099
plt_irrelevant_nghbrs <- ggplot(res) +
1093
1100
geom_line(mapping = aes(x=ks, y=nghbrs)) +
1094
1101
labs(x = "Number of Irrelevant Predictors",
1095
-
y = "Number of neighbors")
1102
+
y = "Number of neighbors") +
1103
+
theme(text = element_text(size = 18))
1096
1104
1097
1105
plt_irrelevant_nghbrs
1098
1106
```
1099
1107
1100
-
```{r 06-fixed-irrelevant-features, echo = FALSE, warning = FALSE, fig.retina = 2, out.width = "100%", fig.cap = "Accuracy versus number of irrelevant predictors for tuned and untuned number of neighbors."}
1108
+
```{r 06-fixed-irrelevant-features, echo = FALSE, warning = FALSE, fig.retina = 2, out.width = "75%", fig.cap = "Accuracy versus number of irrelevant predictors for tuned and untuned number of neighbors."}
1101
1109
res_tmp <- res %>% pivot_longer(cols=c("accs", "fixedaccs"),
@@ -1333,11 +1342,12 @@ where the elbow occurs, and whether adding a variable provides a meaningful incr
1333
1342
> part of tuning your classifier, you *cannot use your test data* for this
1334
1343
> process!
1335
1344
1336
-
```{r 06-fwdsel-3, echo = FALSE, warning = FALSE, fig.retina = 2, out.width = "100%", fig.cap = "Estimated accuracy versus the number of predictors for the sequence of models built using forward selection."}
1345
+
```{r 06-fwdsel-3, echo = FALSE, warning = FALSE, fig.retina = 2, out.width = "60%", fig.cap = "Estimated accuracy versus the number of predictors for the sequence of models built using forward selection."}
1337
1346
fwd_sel_accuracies_plot <- accuracies |>
1338
1347
ggplot(aes(x = size, y = accuracy)) +
1339
1348
geom_line() +
1340
-
labs(x = "Number of Predictors", y = "Estimated Accuracy")
1349
+
labs(x = "Number of Predictors", y = "Estimated Accuracy") +
0 commit comments