fixing based on Trevor's comments

leem44 · leem44 · commit 744d17732b21 · 2021-09-22T16:54:37.000-07:00
diff --git a/classification1.Rmd b/classification1.Rmd
@@ -610,7 +610,7 @@ especially if we want to handle multiple classes, more than two variables,
 or predicting the class for multiple new observations. Thankfully, in R,
 the $K$-nearest neighbors algorithm is implemented in the `parsnip` package 
 included in the 
-[`tidymodels` meta package](https://www.tidymodels.org/), along with 
+[`tidymodels` package](https://www.tidymodels.org/), along with 
 many [other models](https://www.tidymodels.org/find/parsnip/)
  that you will encounter in this and future chapters of the book. The `tidymodels` collection
 provides tools to help make and use models, such as classifiers.  Using the packages
@@ -627,7 +627,7 @@ We will use the `cancer` data set from above, with
 perimeter and concavity as predictors and $K = 5$ neighbors to build our classifier. Then
 we will use the classifier to predict the diagnosis label for a new observation with
 perimeter 0, concavity 3.5, and an unknown diagnosis label. Let's pick out our two desired
-predictor variables and class label and store it as a new dataset named `cancer_train`:
+predictor variables and class label and store them as a new data set named `cancer_train`:
 
 ```{r 05-tidymodels-2}
 cancer_train <- cancer |>
@@ -655,7 +655,7 @@ knn_spec
 ```
 
 In order to fit the model on the breast cancer data, we need to pass the model specification
-and the dataset to the `fit` function. We also need to specify what variables to use as predictors
+and the data setto the `fit` function. We also need to specify what variables to use as predictors
 and what variable to use as the target. Below, the `Class ~ Perimeter + Concavity` argument specifies 
 that `Class` is the target variable (the one we want to predict),
 and both `Perimeter` and `Concavity` are to be used as the predictors.
@@ -698,8 +698,8 @@ predict(knn_fit, new_obs)
 
 Is this predicted malignant label the true class for this observation? 
 Well, we don't know because we do not have this
-observation's diagnosis&mdash; that is what we were trying to predict.
-In the next chapter, we will 
+observation's diagnosis&mdash; that is what we were trying to predict! The 
+classifier's prediction is not necessarily correct, but in the next chapter, we will 
 learn ways to quantify how accurate we think our predictions are.
 
 ## Data preprocessing with `tidymodels`
@@ -731,7 +731,8 @@ $K$-nearest neighbor classification algorithm, this large shift can change the
 outcome of using many other predictive models. 
 
 To scale and center our data, we need to find
-our variables' mean and *standard deviation* (a number quantifying how spread out values are). 
+our variables' *mean* (the average, which quantifies the "central" value of a 
+set of numbers) and *standard deviation* (a number quantifying how spread out values are). 
 For each observed value of the variable, we subtract the mean (center the variable) 
 and divide by the standard deviation (scale the variable). When we do this, the data 
 is said to be *standardized*, and all variables in a data set will have a mean of 0 
@@ -795,7 +796,7 @@ For example:
 You can find [a full set of all the steps and variable selection functions](https://tidymodels.github.io/recipes/reference/index.html)
 on the recipes home page.
 
-Here we have calculated the required statistics based on the data input into the 
+At this point, we have calculated the required statistics based on the data input into the 
 recipe, but the data are not yet scaled and centred. To actually scale and center 
 the data, we need to apply the bake function to the unscaled data.
 
@@ -805,10 +806,10 @@ scaled_cancer
 ```
 
 It may seem redundant that we had to both `bake` *and* `prep` to scale and center the data.
-However, we do this in two steps so we could specify a different data set in the `bake` step
-if desired, say, new data you want to predict, which were not part of the training set. 
+ However, we do this in two steps so we can specify a different data set in the `bake` step, 
+ for instance, new data that were not part of the training set. 
 
-At this point, you may wonder why we are doing so much work just to center and
+You may wonder why we are doing so much work just to center and
 scale our variables.  Can't we just manually scale and center the `Area` and
 `Smoothness` variables ourselves before building our $K$-nearest neighbor model? Well,
 technically *yes*; but doing so is error-prone.  In particular, we might
@@ -951,7 +952,10 @@ ggplot(unscaled_cancer, aes(x = Area, y = Smoothness, group = Class, color = Cla
     xend = unlist(neighbors[3, attrs[1]]),
     yend = unlist(neighbors[3, attrs[2]])
   ), color = "black") +   theme_light() +  
- facet_zoom( xlim= c(399.7, 401.6), ylim = c(0.08, 0.14), zoom.size = 2) 
+# facet_zoom( xlim = c(399.7, 401.6), ylim = c(0.08, 0.14), zoom.size = 2) + 
+   facet_zoom(x = ( Area > 380 & Area < 420) , 
+              y = (Smoothness > 0.08 & Smoothness < 0.14), zoom.size = 2) + 
+  theme_bw()
 ```
 
 ### Balancing
@@ -1000,7 +1004,10 @@ rare_plot
 > process, which then guarantees the same result, i.e., the same choice of 3
 > observations, each time the code is run. In general, when your code involves
 > random numbers, if you want *the same result* each time, you should use
-> `set.seed`; if you want a *different result* each time, you should not.
+> `set.seed`; if you want a *different result* each time, you should not. 
+> You only need to `set.seed` once at the beginning of your analysis, so the 
+rest of the analysis uses seemingly random numbers.
+ 
 
 Suppose we now decided to use $K = 7$ in $K$-nearest neighbor classification.
 With only 3 observations of malignant tumors, the classifier 
diff --git a/classification2.Rmd b/classification2.Rmd
@@ -53,7 +53,7 @@ the observations in the test set? One way we can do this is to calculate the
 classifier made the correct prediction. To calculate this we divide the number
 of correct predictions by the number of predictions made. 
 
-$$prediction \; accuracy = \frac{number \; of  \; correct  \; predictions}{total \;  number \;  of  \; predictions}$$
+$$\mathrm{prediction \; accuracy} = \frac{\mathrm{number \; of  \; correct  \; predictions}}{\mathrm{total \;  number \;  of  \; predictions}}$$
 
 
 The process for assessing if our predictions match the true labels in the 
@@ -73,7 +73,7 @@ We start by loading the necessary packages, reading in the breast cancer data
 from the previous chapter, and making a quick scatter plot visualization of
 tumor cell concavity versus smoothness colored by diagnosis in Figure \@ref(fig:06-precode).
 
-```{r 06-precode, fig.height = 4, fig.width = 5, fig.cap="Scatterplot of tumor cell concavity versus smoothness coloured by diagnosis label", message = F, warning = F}
+```{r 06-precode, fig.height = 4, fig.width = 5, fig.cap="Scatter plot of tumor cell concavity versus smoothness coloured by diagnosis label", message = F, warning = F}
 # load packages
 library(tidyverse)
 library(tidymodels)
@@ -180,7 +180,7 @@ our test data does not influence any aspect of our model training. Once we have
 created the standardization preprocessor, we can then apply it separately to both the
 training and test data sets.
 
-Fortunately, the `recipe` framework from `tidymodels` makes it simple to handle
+Fortunately, the `recipe` framework from `tidymodels` helps us handle
 this properly. Below we construct and prepare the recipe using only the training
 data (due to `data = cancer_train` in the first line).
 
@@ -218,7 +218,7 @@ knn_fit
 > if there is a tie for the majority neighbor class, the winner is randomly selected. Although there is no chance
 > of a tie when $K$ is odd (here $K=3$), it is possible that the code may be changed in the future to have an even value of $K$. 
 > Thus, to prevent potential issues with reproducibility, we have set the seed. Note that in your own code,
-> you only have to set the seed once at the beginning of your analysis. 
+> you should have to set the seed once at the beginning of your analysis. 
 
 ### Predict the labels in the test set
 
@@ -591,7 +591,6 @@ variable that contains the sequence of values of $K$ to try; below we create the
 data frame with the `neighbors` variable containing each value from $K=1$ to $K=15$ using 
 the `seq` function.
 Then we pass that data frame to the `grid` argument of `tune_grid`.
-We set the seed prior to tuning to ensure results are reproducible:
 ```{r 06-range-cross-val-2}
 set.seed(1)
 k_vals <- tibble(neighbors = seq(from = 1, to = 15, by = 1))
@@ -946,7 +945,7 @@ is less obvious, as all seem like reasonable candidates. It
 is not clear which subset of them will create the best classifier. One could use visualizations and
 other exploratory analyses to try to help understand which variables are potentially relevant, but
 this process is both time-consuming and error-prone when there are many variables to consider.
-We therefore, need a more systematic and programmatic way of choosing variables. 
+Therefore we need a more systematic and programmatic way of choosing variables. 
 This is a very difficult problem to solve in
 general, and there are a number of methods that have been developed that apply
 in particular cases of interest. Here we will discuss two basic