Skip to content

Commit 83c635d

Browse files
authored
Merge pull request #368 from UBC-DSCI/patch-fig-sizes
Patch fig sizes
2 parents ba89491 + d019d48 commit 83c635d

20 files changed

+389
-244
lines changed

README.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -142,13 +142,16 @@ bookdown::gitbook:
142142
#### Figures
143143
- make sure all figures get (capitalized) labels ("Figure \\@ref(blah)", not "figure below" or "figure above")
144144
- make sure all figures get captions
145-
- specify image widths in terms of linewidth percent (e.g. `out.width="70%"`)
145+
- specify image widths of pngs and jpegs in terms of linewidth percent
146+
(e.g. `out.width="70%"`),
147+
for plots we create in R use `fig.width` and `fig.height`.
146148
- center align all images via `fig.align = "center"`
147149
- make sure we have permission for every figure/logo that we use
148150
- Make sure all figures follow the visualization principles in Chapter 4
149151
- Make sure axes are set appropriately to not inflate/deflate differences artificially *where it does not compromise clarity* (e.g. in the classification
150152
chapter there are a few examples where zoomed-in accuracy axes are better than using the full range 0 to 1)
151-
-
153+
- Fig size for bar charts should be: `fig.width=5, fig.height=3` (an exception are figs 1.7 & 1.8 so that we can read the axis labels)
154+
- cropping width for syntax diagrams is 1625 (done using `image_crop`)
152155

153156
#### Tables
154157
- make sure all tables get capitalized labels ("Table \\@ref(blah)", not "table below" or "table above")

build_html.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,2 @@
11
# Script to generate HTML book
2-
docker run --rm -m 5g -v $(pwd):/home/rstudio/introduction-to-datascience ubcdsci/intro-to-ds:v0.19.0 /bin/bash -c "cd /home/rstudio/introduction-to-datascience; Rscript _build_html.r"
2+
docker run --rm -m 5g -v $(pwd):/home/rstudio/introduction-to-datascience ubcdsci/intro-to-ds:v0.21.0 /bin/bash -c "cd /home/rstudio/introduction-to-datascience; Rscript _build_html.r"

build_pdf.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ cp -r data/ pdf/data
2424
cp -r img/ pdf/img
2525

2626
# Build the book with bookdown
27-
docker run --rm -m 5g -v $(pwd):/home/rstudio/introduction-to-datascience ubcdsci/intro-to-ds:v0.19.0 /bin/bash -c "cd /home/rstudio/introduction-to-datascience/pdf; Rscript _build_pdf.r"
27+
docker run --rm -m 5g -v $(pwd):/home/rstudio/introduction-to-datascience ubcdsci/intro-to-ds:v0.21.0 /bin/bash -c "cd /home/rstudio/introduction-to-datascience/pdf; Rscript _build_pdf.r"
2828

2929
# clean files in pdf dir
3030
rm -rf pdf/references.bib

classification1.Rmd

Lines changed: 24 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ library(formatR)
55
library(plotly)
66
library(knitr)
77
library(kableExtra)
8+
library(ggpubr)
89
910
knitr::opts_chunk$set(echo = TRUE,
1011
fig.align = "center")
@@ -209,7 +210,7 @@ for light orange and `"steelblue2"` for light blue—and
209210
We also make the category labels ("B" and "M") more readable by
210211
changing them to "Benign" and "Malignant" using the `labels` argument.
211212

212-
```{r 05-scatter, fig.height = 4, fig.width = 5, fig.cap= "Scatter plot of concavity versus perimeter colored by diagnosis label."}
213+
```{r 05-scatter, fig.height = 3.5, fig.width = 4.5, fig.cap= "Scatter plot of concavity versus perimeter colored by diagnosis label."}
213214
perim_concav <- cancer %>%
214215
ggplot(aes(x = Perimeter, y = Concavity, color = Class)) +
215216
geom_point(alpha = 0.6) +
@@ -285,7 +286,7 @@ new observation, with standardized perimeter of `r new_point[1]` and standardize
285286
diagnosis "Class" is unknown. This new observation is depicted by the red, diamond point in
286287
Figure \@ref(fig:05-knn-1).
287288

288-
```{r 05-knn-1, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap="Scatter plot of concavity versus perimeter with new observation represented as a red diamond."}
289+
```{r 05-knn-1, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap="Scatter plot of concavity versus perimeter with new observation represented as a red diamond."}
289290
perim_concav_with_new_point <- bind_rows(cancer,
290291
tibble(Perimeter = new_point[1],
291292
Concavity = new_point[2],
@@ -317,7 +318,7 @@ then the perimeter and concavity values are similar, and so we may expect that
317318
they would have the same diagnosis.
318319

319320

320-
```{r 05-knn-2, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap="Scatter plot of concavity versus perimeter. The new observation is represented as a red diamond with a line to the one nearest neighbor, which has a malignant label."}
321+
```{r 05-knn-2, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap="Scatter plot of concavity versus perimeter. The new observation is represented as a red diamond with a line to the one nearest neighbor, which has a malignant label."}
321322
perim_concav_with_new_point +
322323
geom_segment(aes(
323324
x = new_point[1],
@@ -342,7 +343,7 @@ Does this seem like the right prediction to make for this observation? Probably
342343
not, if you consider the other nearby points...
343344

344345

345-
```{r 05-knn-4, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap="Scatter plot of concavity versus perimeter. The new observation is represented as a red diamond with a line to the one nearest neighbor, which has a benign label."}
346+
```{r 05-knn-4, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap="Scatter plot of concavity versus perimeter. The new observation is represented as a red diamond with a line to the one nearest neighbor, which has a benign label."}
346347
347348
perim_concav_with_new_point2 <- bind_rows(cancer,
348349
tibble(Perimeter = new_point[1],
@@ -382,7 +383,7 @@ see that the diagnoses of 2 of the 3 nearest neighbors to our new observation
382383
are malignant. Therefore we take majority vote and classify our new red, diamond
383384
observation as malignant.
384385

385-
```{r 05-knn-5, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap="Scatter plot of concavity versus perimeter with three nearest neighbors."}
386+
```{r 05-knn-5, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap="Scatter plot of concavity versus perimeter with three nearest neighbors."}
386387
perim_concav_with_new_point2 +
387388
geom_segment(aes(
388389
x = new_point[1], y = new_point[2],
@@ -432,7 +433,7 @@ You will see in the `mutate` \index{mutate} step below, we compute the straight-
432433
distance using the formula above: we square the differences between the two observations' perimeter
433434
and concavity coordinates, add the squared differences, and then take the square root.
434435

435-
```{r 05-multiknn-1, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap="Scatter plot of concavity versus perimeter with new observation represented as a red diamond."}
436+
```{r 05-multiknn-1, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap="Scatter plot of concavity versus perimeter with new observation represented as a red diamond."}
436437
perim_concav <- bind_rows(cancer,
437438
tibble(Perimeter = new_point[1],
438439
Concavity = new_point[2],
@@ -514,7 +515,7 @@ The result of this computation shows that 3 of the 5 nearest neighbors to our ne
514515
malignant (`M`); since this is the majority, we classify our new observation as malignant.
515516
These 5 neighbors are circled in Figure \@ref(fig:05-multiknn-3).
516517

517-
```{r 05-multiknn-3, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap="Scatter plot of concavity versus perimeter with 5 nearest neighbors circled."}
518+
```{r 05-multiknn-3, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap="Scatter plot of concavity versus perimeter with 5 nearest neighbors circled."}
518519
perim_concav + annotate("path",
519520
x = new_point[1] + 1.4 * cos(seq(0, 2 * pi,
520521
length.out = 100
@@ -903,7 +904,7 @@ Standardizing your data should be a part of the preprocessing you do
903904
before predictive modeling and you should always think carefully about your problem domain and
904905
whether you need to standardize your data.
905906

906-
```{r 05-scaling-plt, echo = FALSE, fig.height = 4, fig.width = 10, fig.cap = "Comparison of K = 3 nearest neighbors with standardized and unstandardized data."}
907+
```{r 05-scaling-plt, echo = FALSE, fig.height = 4, fig.cap = "Comparison of K = 3 nearest neighbors with standardized and unstandardized data."}
907908
908909
attrs <- c("Area", "Smoothness")
909910
@@ -994,10 +995,11 @@ scaled <- ggplot(scaled_cancer, aes(x = Area,
994995
yend = unlist(neighbors_scaled[3, attrs[2]])
995996
), color = "black", size = 0.5)
996997
997-
gridExtra::grid.arrange(unscaled, scaled, ncol = 2)
998+
ggarrange(unscaled, scaled, ncol = 2, common.legend = TRUE, legend = "bottom")
999+
9981000
```
9991001

1000-
```{r 05-scaling-plt-zoomed, fig.height = 4, fig.width = 10, echo = FALSE, fig.cap = "Close up of three nearest neighbors for unstandardized data."}
1002+
```{r 05-scaling-plt-zoomed, fig.height = 4.5, fig.width = 9, echo = FALSE, fig.cap = "Close up of three nearest neighbors for unstandardized data."}
10011003
library(ggforce)
10021004
ggplot(unscaled_cancer, aes(x = Area,
10031005
y = Smoothness,
@@ -1029,11 +1031,11 @@ ggplot(unscaled_cancer, aes(x = Area,
10291031
x = unlist(new_obs[1]), y = unlist(new_obs[2]),
10301032
xend = unlist(neighbors[3, attrs[1]]),
10311033
yend = unlist(neighbors[3, attrs[2]])
1032-
), color = "black") + theme_light() +
1033-
# facet_zoom( xlim = c(399.7, 401.6), ylim = c(0.08, 0.14), zoom.size = 2) +
1034+
), color = "black") +
10341035
facet_zoom(x = ( Area > 380 & Area < 420) ,
10351036
y = (Smoothness > 0.08 & Smoothness < 0.14), zoom.size = 2) +
1036-
theme_bw()
1037+
theme_bw() +
1038+
theme(text = element_text(size = 14), legend.position="bottom")
10371039
```
10381040

10391041
### Balancing
@@ -1058,14 +1060,14 @@ function, which takes two arguments: a data frame-like object,
10581060
and the number of rows to select from the top (`n`).
10591061
The new imbalanced data is shown in Figure \@ref(fig:05-unbalanced).
10601062

1061-
```{r 05-unbalanced-seed, echo = FALSE, fig.height = 4, fig.width = 5, warning = FALSE, message = FALSE}
1063+
```{r 05-unbalanced-seed, echo = FALSE, fig.height = 3.5, fig.width = 4.5, warning = FALSE, message = FALSE}
10621064
# hidden seed here for reproducibility
10631065
# randomness shouldn't affect much in this use of step_upsample,
10641066
# but just in case...
10651067
set.seed(3)
10661068
```
10671069

1068-
```{r 05-unbalanced, fig.height = 4, fig.width = 5, fig.cap = "Imbalanced data."}
1070+
```{r 05-unbalanced, fig.height = 3.5, fig.width = 4.5, fig.cap = "Imbalanced data."}
10691071
rare_cancer <- bind_rows(
10701072
filter(cancer, Class == "B"),
10711073
cancer |> filter(Class == "M") |> slice_head(n = 3)
@@ -1093,7 +1095,7 @@ benign, and the benign vote will always win. For example, Figure \@ref(fig:05-up
10931095
shows what happens for a new tumor observation that is quite close to three observations
10941096
in the training data that were tagged as malignant.
10951097

1096-
```{r 05-upsample, echo=FALSE, fig.height = 4, fig.width = 5, fig.cap = "Imbalanced data with 7 nearest neighbors to a new observation highlighted."}
1098+
```{r 05-upsample, echo=FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap = "Imbalanced data with 7 nearest neighbors to a new observation highlighted."}
10971099
new_point <- c(2, 2)
10981100
attrs <- c("Perimeter", "Concavity")
10991101
my_distances <- table_with_distances(rare_cancer[, attrs], new_point)
@@ -1145,7 +1147,7 @@ each area of the plot to the predictions the $K$-nearest neighbor
11451147
classifier would make. We can see that the decision is
11461148
always "benign," corresponding to the blue color.
11471149

1148-
```{r 05-upsample-2, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap = "Imbalanced data with background color indicating the decision of the classifier and the points represent the labeled data."}
1150+
```{r 05-upsample-2, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap = "Imbalanced data with background color indicating the decision of the classifier and the points represent the labeled data."}
11491151
11501152
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 7) |>
11511153
set_engine("kknn") |>
@@ -1223,7 +1225,7 @@ classifier would make. We can see that the decision is more reasonable; when the
12231225
to those labeled malignant, the classifier predicts a malignant tumor, and vice versa when they are
12241226
closer to the benign tumor observations.
12251227

1226-
```{r 05-upsample-plot, echo = FALSE, fig.height = 4, fig.width = 5, fig.cap = "Upsampled data with background color indicating the decision of the classifier."}
1228+
```{r 05-upsample-plot, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap = "Upsampled data with background color indicating the decision of the classifier."}
12271229
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 7) |>
12281230
set_engine("kknn") |>
12291231
set_mode("classification")
@@ -1333,7 +1335,7 @@ predict the label of each, and visualize the predictions with a colored scatter
13331335
> textbook. It is included for those readers who would like to use similar
13341336
> visualizations in their own data analyses.
13351337
1336-
```{r 05-workflow-plot-show, fig.height = 4, fig.width = 5, fig.cap = "Scatter plot of smoothness versus area where background color indicates the decision of the classifier."}
1338+
```{r 05-workflow-plot-show, fig.height = 3.5, fig.width = 4.6, fig.cap = "Scatter plot of smoothness versus area where background color indicates the decision of the classifier."}
13371339
# create the grid of area/smoothness vals, and arrange in a data frame
13381340
are_grid <- seq(min(unscaled_cancer$Area),
13391341
max(unscaled_cancer$Area),
@@ -1367,7 +1369,9 @@ wkflw_plot <-
13671369
color = Class),
13681370
alpha = 0.02,
13691371
size = 5) +
1370-
labs(color = "Diagnosis") +
1372+
labs(color = "Diagnosis",
1373+
x = "Area (standardized)",
1374+
y = "Smoothness (standardized)") +
13711375
scale_color_manual(labels = c("Malignant", "Benign"),
13721376
values = c("orange2", "steelblue2"))
13731377

classification2.Rmd

Lines changed: 25 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22

33
```{r classification2-setup, echo = FALSE, message = FALSE, warning = FALSE}
44
library(gridExtra)
5+
library(cowplot)
56
67
knitr::opts_chunk$set(fig.align = "center")
78
```
@@ -187,7 +188,7 @@ tumor cell concavity versus smoothness colored by diagnosis in Figure \@ref(fig:
187188
You will also notice that we set the random seed here at the beginning of the analysis
188189
using the `set.seed` function, as described in Section \@ref(randomseeds).
189190

190-
```{r 06-precode, fig.height = 4, fig.width = 5, fig.cap="Scatter plot of tumor cell concavity versus smoothness colored by diagnosis label.", message = F, warning = F}
191+
```{r 06-precode, fig.height = 3.5, fig.width = 4.5, fig.cap="Scatter plot of tumor cell concavity versus smoothness colored by diagnosis label.", message = F, warning = F}
191192
# load packages
192193
library(tidyverse)
193194
library(tidymodels)
@@ -753,7 +754,7 @@ We can select the best value of the number of neighbors (i.e., the one that resu
753754
in the highest classifier accuracy estimate) by plotting the accuracy versus $K$
754755
in Figure \@ref(fig:06-find-k).
755756

756-
```{r 06-find-k, fig.height = 4, fig.width = 5, fig.cap= "Plot of estimated accuracy versus the number of neighbors."}
757+
```{r 06-find-k, fig.height = 3.5, fig.width = 4, fig.cap= "Plot of estimated accuracy versus the number of neighbors."}
757758
accuracy_vs_k <- ggplot(accuracies, aes(x = neighbors, y = mean)) +
758759
geom_point() +
759760
geom_line() +
@@ -799,7 +800,7 @@ we vary $K$ from 1 to almost the number of observations in the data set.
799800
set.seed(1)
800801
```
801802

802-
```{r 06-lots-of-ks, message = FALSE, fig.height = 4, fig.width = 5, fig.cap="Plot of accuracy estimate versus number of neighbors for many K values."}
803+
```{r 06-lots-of-ks, message = FALSE, fig.height = 3.5, fig.width = 4, fig.cap="Plot of accuracy estimate versus number of neighbors for many K values."}
803804
k_lots <- tibble(neighbors = seq(from = 1, to = 385, by = 10))
804805
805806
knn_results <- workflow() |>
@@ -848,7 +849,7 @@ a balance between the two. You can see these two effects in Figure
848849
\@ref(fig:06-decision-grid-K), which shows how the classifier changes as
849850
we set the number of neighbors $K$ to 1, 7, 20, and 300.
850851

851-
```{r 06-decision-grid-K, echo = FALSE, message = FALSE, fig.height = 7, fig.width = 10, fig.cap = "Effect of K in overfitting and underfitting."}
852+
```{r 06-decision-grid-K, echo = FALSE, message = FALSE, fig.height = 10, fig.width = 10, fig.cap = "Effect of K in overfitting and underfitting."}
852853
ks <- c(1, 7, 20, 300)
853854
plots <- list()
854855
@@ -893,9 +894,14 @@ for (i in 1:length(ks)) {
893894
labs(color = "Diagnosis") +
894895
ggtitle(paste("K = ", ks[[i]])) +
895896
scale_color_manual(labels = c("Malignant", "Benign"),
896-
values = c("orange2", "steelblue2"))
897-
}
898-
grid.arrange(grobs = plots)
897+
values = c("orange2", "steelblue2")) +
898+
theme(text = element_text(size = 18))
899+
}
900+
901+
p_no_legend <- lapply(plots, function(x) x + theme(legend.position = "none"))
902+
legend <- get_legend(plots[[1]] + theme(legend.position = "bottom"))
903+
p_grid <- plot_grid(plotlist = p_no_legend, ncol = 2)
904+
plot_grid(p_grid, legend, ncol = 1, rel_heights = c(1, 0.2))
899905
```
900906

901907
## Summary
@@ -999,7 +1005,7 @@ variables there are, the more (random) influence they have, and the more they
9991005
corrupt the set of nearest neighbors that vote on the class of the new
10001006
observation to predict.
10011007

1002-
```{r 06-performance-irrelevant-features, echo = FALSE, warning = FALSE, fig.retina = 2, out.width = "100%", fig.cap = "Effect of inclusion of irrelevant predictors."}
1008+
```{r 06-performance-irrelevant-features, echo = FALSE, warning = FALSE, fig.retina = 2, out.width = "60%", fig.cap = "Effect of inclusion of irrelevant predictors."}
10031009
# get accuracies after including k irrelevant features
10041010
ks <- c(0, 5, 10, 15, 20, 40)
10051011
fixedaccs <- list()
@@ -1072,7 +1078,8 @@ res <- tibble(ks = ks, accs = accs, fixedaccs = fixedaccs, nghbrs = nghbrs)
10721078
plt_irrelevant_accuracies <- ggplot(res) +
10731079
geom_line(mapping = aes(x=ks, y=accs)) +
10741080
labs(x = "Number of Irrelevant Predictors",
1075-
y = "Model Accuracy Estimate")
1081+
y = "Model Accuracy Estimate") +
1082+
theme(text = element_text(size = 18))
10761083
10771084
plt_irrelevant_accuracies
10781085
```
@@ -1088,24 +1095,26 @@ variables, the number of neighbors does not increase smoothly; but the general t
10881095
Figure \@ref(fig:06-fixed-irrelevant-features) corroborates
10891096
this evidence; if we fix the number of neighbors to $K=3$, the accuracy falls off more quickly.
10901097

1091-
```{r 06-neighbors-irrelevant-features, echo = FALSE, warning = FALSE, fig.retina = 2, out.width = "100%", fig.cap = "Tuned number of neighbors for varying number of irrelevant predictors."}
1098+
```{r 06-neighbors-irrelevant-features, echo = FALSE, warning = FALSE, fig.retina = 2, out.width = "60%", fig.cap = "Tuned number of neighbors for varying number of irrelevant predictors."}
10921099
plt_irrelevant_nghbrs <- ggplot(res) +
10931100
geom_line(mapping = aes(x=ks, y=nghbrs)) +
10941101
labs(x = "Number of Irrelevant Predictors",
1095-
y = "Number of neighbors")
1102+
y = "Number of neighbors") +
1103+
theme(text = element_text(size = 18))
10961104
10971105
plt_irrelevant_nghbrs
10981106
```
10991107

1100-
```{r 06-fixed-irrelevant-features, echo = FALSE, warning = FALSE, fig.retina = 2, out.width = "100%", fig.cap = "Accuracy versus number of irrelevant predictors for tuned and untuned number of neighbors."}
1108+
```{r 06-fixed-irrelevant-features, echo = FALSE, warning = FALSE, fig.retina = 2, out.width = "75%", fig.cap = "Accuracy versus number of irrelevant predictors for tuned and untuned number of neighbors."}
11011109
res_tmp <- res %>% pivot_longer(cols=c("accs", "fixedaccs"),
11021110
names_to="Type",
11031111
values_to="accuracy")
11041112
11051113
plt_irrelevant_nghbrs <- ggplot(res_tmp) +
11061114
geom_line(mapping = aes(x=ks, y=accuracy, color=Type)) +
11071115
labs(x = "Number of Irrelevant Predictors", y = "Accuracy") +
1108-
scale_color_discrete(labels= c("Tuned K", "K = 3"))
1116+
scale_color_discrete(labels= c("Tuned K", "K = 3")) +
1117+
theme(text = element_text(size = 16))
11091118
11101119
plt_irrelevant_nghbrs
11111120
```
@@ -1333,11 +1342,12 @@ where the elbow occurs, and whether adding a variable provides a meaningful incr
13331342
> part of tuning your classifier, you *cannot use your test data* for this
13341343
> process!
13351344
1336-
```{r 06-fwdsel-3, echo = FALSE, warning = FALSE, fig.retina = 2, out.width = "100%", fig.cap = "Estimated accuracy versus the number of predictors for the sequence of models built using forward selection."}
1345+
```{r 06-fwdsel-3, echo = FALSE, warning = FALSE, fig.retina = 2, out.width = "60%", fig.cap = "Estimated accuracy versus the number of predictors for the sequence of models built using forward selection."}
13371346
fwd_sel_accuracies_plot <- accuracies |>
13381347
ggplot(aes(x = size, y = accuracy)) +
13391348
geom_line() +
1340-
labs(x = "Number of Predictors", y = "Estimated Accuracy")
1349+
labs(x = "Number of Predictors", y = "Estimated Accuracy") +
1350+
theme(text = element_text(size = 18))
13411351
13421352
fwd_sel_accuracies_plot
13431353
```

0 commit comments

Comments
 (0)