Merge pull request #516 from UBC-DSCI/slice-min-max

trevorcampbell · web-flow · commit e02391af4c16 · 2023-08-06T11:13:02.000-07:00
slice_min / slice_max
diff --git a/source/classification1.Rmd b/source/classification1.Rmd
@@ -460,6 +460,7 @@ the $K=5$ neighbors that are nearest to our new point.
 You will see in the `mutate` \index{mutate} step below, we compute the straight-line
 distance using the formula above: we square the differences between the two observations' perimeter 
 and concavity coordinates, add the squared differences, and then take the square root.
+In order to find the $K=5$ nearest neighbors, we will use the `slice_min` function. \index{slice\_min}
 
 ```{r 05-multiknn-1, echo = FALSE, fig.height = 3.5, fig.width = 4.5, fig.pos = "H", out.extra="", fig.cap="Scatter plot of concavity versus perimeter with new observation represented as a red diamond."}
 perim_concav <- bind_rows(cancer, 
@@ -499,8 +500,7 @@ cancer |>
   select(ID, Perimeter, Concavity, Class) |>
   mutate(dist_from_new = sqrt((Perimeter - new_obs_Perimeter)^2 + 
                               (Concavity - new_obs_Concavity)^2)) |>
-  arrange(dist_from_new) |>
-  slice(1:5) # take the first 5 rows
+  slice_min(dist_from_new, n = 5) # take the 5 rows of minimum distance
 ```
 
 In Table \@ref(tab:05-multiknn-mathtable) we show in mathematical detail how
@@ -590,8 +590,7 @@ cancer |>
   mutate(dist_from_new = sqrt((Perimeter - new_obs_Perimeter)^2 + 
                               (Concavity - new_obs_Concavity)^2 +
                                 (Symmetry - new_obs_Symmetry)^2)) |>
-  arrange(dist_from_new) |>
-  slice(1:5) # take the first 5 rows
+  slice_min(dist_from_new, n = 5) # take the 5 rows of minimum distance
 ```
 
 Based on $K=5$ nearest neighbors with these three predictors, we would classify 
diff --git a/source/regression1.Rmd b/source/regression1.Rmd
@@ -233,13 +233,12 @@ sale price might be.
 For the example shown in Figure \@ref(fig:07-small-eda-regr), 
 we find and label the 5 nearest neighbors to our observation 
 of a house that is 2,000 square feet.
-\index{mutate}\index{slice}\index{arrange}\index{abs}
+\index{mutate}\index{slice\_min}\index{abs}
 
 ```{r 07-find-k3}
 nearest_neighbors <- small_sacramento |>
   mutate(diff = abs(2000 - sqft)) |>
-  arrange(diff) |>
-  slice(1:5) #subset the first 5 rows
+  slice_min(diff, n = 5)
 
 nearest_neighbors
 ```
diff --git a/source/viz.Rmd b/source/viz.Rmd
@@ -922,10 +922,18 @@ are hard to distinguish, and the names of the landmasses are obscuring each
 other as they have been squished into too little space. But remember that the
 question we asked was only about the largest landmasses; let's make the plot a
 little bit clearer by keeping only the largest 12 landmasses. We do this using
-the `slice_max` function.  Then to give the labels enough
+the `slice_max` function: the `order_by` argument is the name of the column we 
+want to use for comparing which is largest, and the `n` argument specifies how many
+rows to keep. Then to give the labels enough
 space, we'll use horizontal bars instead of vertical ones. We do this by
-swapping the `x` and `y` variables:
-\index{slice\_max}
+swapping the `x` and `y` variables.\index{slice\_max}\index{slice\_min}
+
+> **Note:** Recall that in Chapter \@ref(intro), we used `arrange` followed by `slice` to
+> obtain the ten rows with the largest values of a variable. We could have instead used
+> the `slice_max` function for this purpose. The `slice_max` and `slice_min` functions
+> achieve the same goal as `arrange` followed by `slice`, but are slightly more efficient
+> because they are specialized for this purpose. In general, it is good to use more specialized
+> functions when they are available!
 
 ```{r 03-data-islands-bar-2, warning=FALSE, message=FALSE, fig.width=5, fig.height=2.75, fig.align = "center", fig.pos = "H", out.extra="", fig.cap = "Bar plot of size for Earth's largest 12 landmasses."}
 islands_top12 <- slice_max(islands_df, order_by = size, n = 12)