Skip to content

Commit 67e8218

Browse files
Merge pull request #501 from UBC-DSCI/no-legend-editing
Remove legend editing, replace with renaming values in the df itself
2 parents 0cde56c + 6d31c81 commit 67e8218

File tree

2 files changed

+51
-71
lines changed

2 files changed

+51
-71
lines changed

source/classification1.Rmd

Lines changed: 45 additions & 65 deletions
Original file line numberDiff line numberDiff line change
@@ -190,29 +190,33 @@ glimpse(cancer)
190190
```
191191

192192
From the summary of the data above, we can see that `Class` is of type character
193-
(denoted by `<chr>`). Since we will be working with `Class` as a
194-
categorical statistical variable, we will convert it to a factor using the
195-
function `as_factor`. \index{factor!as\_factor}
196-
193+
(denoted by `<chr>`). We can use the `distinct`\index{distinct} function to see all the unique
194+
values present in that column. We see that there are two diagnoses: benign, represented by "B",
195+
and malignant, represented by "M".
196+
```{r 05-distinct}
197+
cancer |>
198+
distinct(Class)
199+
```
200+
Since we will be working with `Class` as a categorical
201+
variable, it is a good idea to convert it to a factor type using the `as_factor` function. \index{factor!as\_factor}
202+
We will also improve the readability of our analysis by renaming "M" to
203+
"Malignant" and "B" to "Benign" using the `fct_recode` method. The `fct_recode` method \index{factor!fct\_recode}
204+
is used to replace the names of factor values with other names. The arguments of `fct_recode` are the column that you
205+
want to modify, followed any number of arguments of the form `"new name" = "old name"` to specify the renaming scheme.
206+
197207
```{r 05-class}
198208
cancer <- cancer |>
199-
mutate(Class = as_factor(Class))
209+
mutate(Class = as_factor(Class)) |>
210+
mutate(Class = fct_recode(Class, "Malignant" = "M", "Benign" = "B"))
200211
glimpse(cancer)
201212
```
202213

203-
Recall that factors have what are called "levels", which you can think of as categories. We
204-
can verify the levels of the `Class` column by using the `levels`\index{levels}\index{factor!levels} function.
205-
This function should return the name of each category in that column. Given
206-
that we only have two different values in our `Class` column (B for benign and M
207-
for malignant), we only expect to get two names back. Note that the `levels` function requires a *vector* argument;
208-
so we use the `pull` function to extract a single column (`Class`) and
209-
pass that into the `levels` function to see the categories
210-
in the `Class` column.
214+
Let's verify that we have successfully converted the `Class` column to a factor variable
215+
and renamed its values to "Benign" and "Malignant" using the `distinct` function once more.
211216

212-
```{r 05-levels}
217+
```{r 05-distinct2}
213218
cancer |>
214-
pull(Class) |>
215-
levels()
219+
distinct(Class)
216220
```
217221

218222
### Exploring the cancer data
@@ -238,8 +242,6 @@ perimeter and concavity variables. Rather than use `ggplot's` default palette,
238242
we select our own colorblind-friendly colors&mdash;`"orange2"`
239243
for light orange and `"steelblue2"` for light blue&mdash;and
240244
pass them as the `values` argument to the `scale_color_manual` function.
241-
We also make the category labels ("B" and "M") more readable by
242-
changing them to "Benign" and "Malignant" using the `labels` argument.
243245

244246
```{r 05-scatter, fig.height = 3.5, fig.width = 4.5, fig.cap= "Scatter plot of concavity versus perimeter colored by diagnosis label."}
245247
perim_concav <- cancer |>
@@ -248,8 +250,7 @@ perim_concav <- cancer |>
248250
labs(x = "Perimeter (standardized)",
249251
y = "Concavity (standardized)",
250252
color = "Diagnosis") +
251-
scale_color_manual(labels = c("Malignant", "Benign"),
252-
values = c("orange2", "steelblue2")) +
253+
scale_color_manual(values = c("orange2", "steelblue2")) +
253254
theme(text = element_text(size = 12))
254255
perim_concav
255256
```
@@ -333,13 +334,10 @@ perim_concav_with_new_point <- bind_rows(cancer,
333334
labs(color = "Diagnosis", x = "Perimeter (standardized)",
334335
y = "Concavity (standardized)") +
335336
scale_color_manual(name = "Diagnosis",
336-
labels = c("Benign", "Malignant", "Unknown"),
337337
values = c("steelblue2", "orange2", "red")) +
338338
scale_shape_manual(name = "Diagnosis",
339-
labels = c("Benign", "Malignant", "Unknown"),
340339
values= c(16, 16, 18))+
341340
scale_size_manual(name = "Diagnosis",
342-
labels = c("Benign", "Malignant", "Unknown"),
343341
values= c(2, 2, 2.5))
344342
perim_concav_with_new_point
345343
```
@@ -391,13 +389,10 @@ perim_concav_with_new_point2 <- bind_rows(cancer,
391389
x = "Perimeter (standardized)",
392390
y = "Concavity (standardized)") +
393391
scale_color_manual(name = "Diagnosis",
394-
labels = c("Benign", "Malignant", "Unknown"),
395392
values = c("steelblue2", "orange2", "red")) +
396393
scale_shape_manual(name = "Diagnosis",
397-
labels = c("Benign", "Malignant", "Unknown"),
398394
values= c(16, 16, 18))+
399395
scale_size_manual(name = "Diagnosis",
400-
labels = c("Benign", "Malignant", "Unknown"),
401396
values= c(2, 2, 2.5))
402397
perim_concav_with_new_point2 +
403398
geom_segment(aes(
@@ -488,13 +483,10 @@ perim_concav <- bind_rows(cancer,
488483
breaks = seq(-2, 4, 1)) +
489484
labs(color = "Diagnosis") +
490485
scale_color_manual(name = "Diagnosis",
491-
labels = c("Benign", "Malignant", "Unknown"),
492486
values = c("steelblue2", "orange2", "red")) +
493487
scale_shape_manual(name = "Diagnosis",
494-
labels = c("Benign", "Malignant", "Unknown"),
495488
values= c(16, 16, 18))+
496489
scale_size_manual(name = "Diagnosis",
497-
labels = c("Benign", "Malignant", "Unknown"),
498490
values= c(2, 2, 2.5))
499491
500492
perim_concav
@@ -545,7 +537,7 @@ kable(math_table, booktabs = TRUE,
545537
```
546538

547539
The result of this computation shows that 3 of the 5 nearest neighbors to our new observation are
548-
malignant (`M`); since this is the majority, we classify our new observation as malignant.
540+
malignant; since this is the majority, we classify our new observation as malignant.
549541
These 5 neighbors are circled in Figure \@ref(fig:05-multiknn-3).
550542

551543
```{r 05-multiknn-3, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.cap="Scatter plot of concavity versus perimeter with 5 nearest neighbors circled."}
@@ -602,7 +594,8 @@ cancer |>
602594
slice(1:5) # take the first 5 rows
603595
```
604596

605-
Based on $K=5$ nearest neighbors with these three predictors we would classify the new observation as malignant since 4 out of 5 of the nearest neighbors are malignant class.
597+
Based on $K=5$ nearest neighbors with these three predictors, we would classify
598+
the new observation as malignant since 4 out of 5 of the nearest neighbors are from the malignant class.
606599
Figure \@ref(fig:05-more) shows what the data look like when we visualize them
607600
as a 3-dimensional scatter with lines from the new observation to its five nearest neighbors.
608601

@@ -621,8 +614,7 @@ neighbors_3 <- cancer[order(my_distances_3$Distance), ]
621614
data <- neighbors_3 |> select(Perimeter, Concavity, Symmetry) |> slice(1:5)
622615
623616
# add to the df
624-
scaled_cancer_3 <- bind_rows(cancer, new_obs_3) |>
625-
mutate(Class = fct_recode(Class, "Benign" = "B", "Malignant"= "M"))
617+
scaled_cancer_3 <- bind_rows(cancer, new_obs_3)
626618
627619
plot_3d <- scaled_cancer_3 |>
628620
plot_ly() |>
@@ -637,7 +629,7 @@ plot_3d <- scaled_cancer_3 |>
637629
color = ~Class,
638630
opacity = 0.4,
639631
size = 2,
640-
colors = c("orange2", "steelblue2", "red"),
632+
colors = c("steelblue2", "orange2", "red"),
641633
symbol = ~Class, symbols = c('circle','circle','diamond'))
642634
643635
x1 <- c(pull(new_obs_3[1]), data$Perimeter[1])
@@ -662,15 +654,15 @@ z5 <- c(pull(new_obs_3[3]), data$Symmetry[5])
662654
663655
plot_3d <- plot_3d |>
664656
add_trace(x = x1, y = y1, z = z1, type = "scatter3d", mode = "lines",
665-
name = "lines", showlegend = FALSE, color = I("steelblue2")) |>
657+
name = "lines", showlegend = FALSE, color = I("orange2")) |>
666658
add_trace(x = x2, y = y2, z = z2, type = "scatter3d", mode = "lines",
667-
name = "lines", showlegend = FALSE, color = I("steelblue2")) |>
659+
name = "lines", showlegend = FALSE, color = I("orange2")) |>
668660
add_trace(x = x3, y = y3, z = z3, type = "scatter3d", mode = "lines",
669-
name = "lines", showlegend = FALSE, color = I("steelblue2")) |>
670-
add_trace(x = x4, y = y4, z = z4, type = "scatter3d", mode = "lines",
671661
name = "lines", showlegend = FALSE, color = I("orange2")) |>
662+
add_trace(x = x4, y = y4, z = z4, type = "scatter3d", mode = "lines",
663+
name = "lines", showlegend = FALSE, color = I("steelblue2")) |>
672664
add_trace(x = x5, y = y5, z = z5, type = "scatter3d", mode = "lines",
673-
name = "lines", showlegend = FALSE, color = I("steelblue2"))
665+
name = "lines", showlegend = FALSE, color = I("orange2"))
674666
675667
if(!is_latex_output()){
676668
plot_3d
@@ -786,7 +778,7 @@ Finally, we make the prediction on the new observation by calling the `predict`
786778
passing both the fit object we just created and the new observation itself. As above,
787779
when we ran the $K$-nearest neighbors
788780
classification algorithm manually, the `knn_fit` object classifies the new observation as
789-
malignant ("M"). Note that the `predict` function outputs a data frame with a single
781+
malignant. Note that the `predict` function outputs a data frame with a single
790782
variable named `.pred_class`.
791783

792784
```{r 05-predict}
@@ -837,12 +829,15 @@ is said to be *standardized*, \index{standardization!K-nearest neighbors} and al
837829
and a standard deviation of 1. To illustrate the effect that standardization can have on the $K$-nearest
838830
neighbor algorithm, we will read in the original, unstandardized Wisconsin breast
839831
cancer data set; we have been using a standardized version of the data set up
840-
until now. To keep things simple, we will just use the `Area`, `Smoothness`, and `Class`
832+
until now. As before, we will convert the `Class` variable to the factor type
833+
and rename the values to "Malignant" and "Benign."
834+
To keep things simple, we will just use the `Area`, `Smoothness`, and `Class`
841835
variables:
842836

843837
```{r 05-scaling-1, message = FALSE}
844838
unscaled_cancer <- read_csv("data/unscaled_wdbc.csv") |>
845839
mutate(Class = as_factor(Class)) |>
840+
mutate(Class = fct_recode(Class, "Benign" = "B", "Malignant" = "M")) |>
846841
select(Class, Area, Smoothness)
847842
unscaled_cancer
848843
```
@@ -972,13 +967,10 @@ unscaled <- ggplot(unscaled_cancer, aes(x = Area,
972967
shape = Class, size = Class)) +
973968
geom_point(alpha = 0.6) +
974969
scale_color_manual(name = "Diagnosis",
975-
labels = c("Benign", "Malignant", "Unknown"),
976970
values = c("steelblue2", "orange2", "red")) +
977971
scale_shape_manual(name = "Diagnosis",
978-
labels = c("Benign", "Malignant", "Unknown"),
979972
values= c(16, 16, 18)) +
980973
scale_size_manual(name = "Diagnosis",
981-
labels = c("Benign", "Malignant", "Unknown"),
982974
values=c(2,2,2.5)) +
983975
ggtitle("Unstandardized Data") +
984976
geom_segment(aes(
@@ -1015,13 +1007,10 @@ scaled <- ggplot(scaled_cancer, aes(x = Area,
10151007
size = Class)) +
10161008
geom_point(alpha = 0.6) +
10171009
scale_color_manual(name = "Diagnosis",
1018-
labels = c("Benign", "Malignant", "Unknown"),
10191010
values = c("steelblue2", "orange2", "red")) +
10201011
scale_shape_manual(name = "Diagnosis",
1021-
labels = c("Benign", "Malignant", "Unknown"),
10221012
values= c(16, 16, 18)) +
10231013
scale_size_manual(name = "Diagnosis",
1024-
labels = c("Benign", "Malignant", "Unknown"),
10251014
values=c(2,2,2.5)) +
10261015
ggtitle("Standardized Data") +
10271016
labs(x = "Area (standardized)", y = "Smoothness (standardized)") +
@@ -1055,13 +1044,10 @@ ggplot(unscaled_cancer, aes(x = Area,
10551044
shape = Class)) +
10561045
geom_point(size = 2.5, alpha = 0.6) +
10571046
scale_color_manual(name = "Diagnosis",
1058-
labels = c("Benign", "Malignant", "Unknown"),
10591047
values = c("steelblue2", "orange2", "red")) +
10601048
scale_shape_manual(name = "Diagnosis",
1061-
labels = c("Benign", "Malignant", "Unknown"),
10621049
values= c(16, 16, 18)) +
10631050
scale_size_manual(name = "Diagnosis",
1064-
labels = c("Benign", "Malignant", "Unknown"),
10651051
values = c(1, 1, 2.5)) +
10661052
ggtitle("Unstandardized Data") +
10671053
geom_segment(aes(
@@ -1119,8 +1105,8 @@ set.seed(3)
11191105

11201106
```{r 05-unbalanced, fig.height = 3.5, fig.width = 4.5, fig.pos = "H", out.extra="", fig.cap = "Imbalanced data."}
11211107
rare_cancer <- bind_rows(
1122-
filter(cancer, Class == "B"),
1123-
cancer |> filter(Class == "M") |> slice_head(n = 3)
1108+
filter(cancer, Class == "Benign"),
1109+
cancer |> filter(Class == "Malignant") |> slice_head(n = 3)
11241110
) |>
11251111
select(Class, Perimeter, Concavity)
11261112
@@ -1130,8 +1116,7 @@ rare_plot <- rare_cancer |>
11301116
labs(x = "Perimeter (standardized)",
11311117
y = "Concavity (standardized)",
11321118
color = "Diagnosis") +
1133-
scale_color_manual(labels = c("Malignant", "Benign"),
1134-
values = c("orange2", "steelblue2")) +
1119+
scale_color_manual(values = c("orange2", "steelblue2")) +
11351120
theme(text = element_text(size = 12))
11361121
11371122
rare_plot
@@ -1164,18 +1149,15 @@ rare_plot <- bind_rows(rare_cancer,
11641149
x = "Perimeter (standardized)",
11651150
y = "Concavity (standardized)") +
11661151
scale_color_manual(name = "Diagnosis",
1167-
labels = c("Benign", "Malignant", "Unknown"),
11681152
values = c("steelblue2", "orange2", "red")) +
11691153
scale_shape_manual(name = "Diagnosis",
1170-
labels = c("Benign", "Malignant", "Unknown"),
11711154
values= c(16, 16, 18))+
11721155
scale_size_manual(name = "Diagnosis",
1173-
labels = c("Benign", "Malignant", "Unknown"),
11741156
values= c(2, 2, 2.5))
11751157
11761158
for (i in 1:7) {
11771159
clr <- "steelblue2"
1178-
if (neighbors$Class[i] == "M") {
1160+
if (neighbors$Class[i] == "Malignant") {
11791161
clr <- "orange2"
11801162
}
11811163
rare_plot <- rare_plot +
@@ -1236,8 +1218,7 @@ rare_plot <-
12361218
labs(color = "Diagnosis",
12371219
x = "Perimeter (standardized)",
12381220
y = "Concavity (standardized)") +
1239-
scale_color_manual(labels = c("Malignant", "Benign"),
1240-
values = c("orange2", "steelblue2"))
1221+
scale_color_manual(values = c("orange2", "steelblue2"))
12411222
12421223
rare_plot
12431224
```
@@ -1308,8 +1289,7 @@ upsampled_plot <-
13081289
labs(color = "Diagnosis",
13091290
x = "Perimeter (standardized)",
13101291
y = "Concavity (standardized)") +
1311-
scale_color_manual(labels = c("Malignant", "Benign"),
1312-
values = c("orange2", "steelblue2"))
1292+
scale_color_manual(values = c("orange2", "steelblue2"))
13131293
13141294
upsampled_plot
13151295
```
@@ -1324,7 +1304,8 @@ First we will load the data, create a model, and specify a recipe for how the da
13241304
# load the unscaled cancer data
13251305
# and make sure the response variable, Class, is a factor
13261306
unscaled_cancer <- read_csv("data/unscaled_wdbc.csv") |>
1327-
mutate(Class = as_factor(Class))
1307+
mutate(Class = as_factor(Class)) |>
1308+
mutate(Class = fct_recode(Class, "Malignant" = "M", "Benign" = "B"))
13281309
13291310
# create the KNN model
13301311
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 7) |>
@@ -1431,8 +1412,7 @@ wkflw_plot <-
14311412
labs(color = "Diagnosis",
14321413
x = "Area",
14331414
y = "Smoothness") +
1434-
scale_color_manual(labels = c("Malignant", "Benign"),
1435-
values = c("orange2", "steelblue2")) +
1415+
scale_color_manual(values = c("orange2", "steelblue2")) +
14361416
theme(text = element_text(size = 12))
14371417
14381418
wkflw_plot

source/classification2.Rmd

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -234,16 +234,17 @@ set.seed(1)
234234
# load data
235235
cancer <- read_csv("data/unscaled_wdbc.csv") |>
236236
# convert the character Class variable to the factor datatype
237-
mutate(Class = as_factor(Class))
237+
mutate(Class = as_factor(Class)) |>
238+
# rename the factor values to be more readable
239+
mutate(Class = fct_recode(Class, "Malignant" = "M", "Benign" = "B"))
238240
239241
# create scatter plot of tumor cell concavity versus smoothness,
240242
# labeling the points be diagnosis class
241243
perim_concav <- cancer |>
242244
ggplot(aes(x = Smoothness, y = Concavity, color = Class)) +
243245
geom_point(alpha = 0.5) +
244246
labs(color = "Diagnosis") +
245-
scale_color_manual(labels = c("Malignant", "Benign"),
246-
values = c("orange2", "steelblue2")) +
247+
scale_color_manual(values = c("orange2", "steelblue2")) +
247248
theme(text = element_text(size = 12))
248249
249250
perim_concav
@@ -268,7 +269,7 @@ in the data does not influence the data that ends up in the training and testing
268269
Second, it **stratifies** the \index{stratification} data by the class label, to ensure that roughly
269270
the same proportion of each class ends up in both the training and testing sets. For example,
270271
in our data set, roughly 63% of the
271-
observations are from the benign class (`B`), and 37% are from the malignant class (`M`),
272+
observations are from the benign class, and 37% are from the malignant class,
272273
so `initial_split` ensures that roughly 63% of the training data are benign,
273274
37% of the training data are malignant,
274275
and the same proportions exist in the testing data.
@@ -958,8 +959,7 @@ for (i in 1:length(ks)) {
958959
size = 5.) +
959960
labs(color = "Diagnosis") +
960961
ggtitle(paste("K = ", ks[[i]])) +
961-
scale_color_manual(labels = c("Malignant", "Benign"),
962-
values = c("orange2", "steelblue2")) +
962+
scale_color_manual(values = c("orange2", "steelblue2")) +
963963
theme(text = element_text(size = 18), axis.title=element_text(size=18))
964964
}
965965

0 commit comments

Comments
 (0)