80ch line limit

trevorcampbell · trevorcampbell · commit 183370a5f618 · 2023-07-20T13:29:32.000-04:00
diff --git a/source/classification1.Rmd b/source/classification1.Rmd
@@ -1296,17 +1296,23 @@ upsampled_plot
 
 ### Missing data
 
-One of the most common issues in real data sets in the wild is *missing data*, i.e., observations
-where the values of some of the variables were not recorded. 
-Unfortunately, as common as it is, handling missing data properly is very challenging and generally
-relies on expert knowledge about the data, setting, and how the data were collected. One typical challenge with missing data
-is that missing entries can be *informative*: the very fact that an entries were missing is related to the values of other variables. 
-For example, survey participants from a marginalized group of people may be less likely to respond to certain kinds of questions if they
-fear that answering honestly will come with negative consequences. In that case, if we were to simply throw away data with missing entries,
-we would bias the conclusions of the survey by inadvertently removing many members of that group of respondents.
-So ignoring this issue in real problems can easily lead to misleading analyses, with detrimental impacts.
-In this book, we will cover only those techniques for dealing with missing entries in situations
-where missing entries are just "randomly missing", i.e., where the fact that certain entries are missing *isn't related to anything else* about the observation.
+One of the most common issues in real data sets in the wild is *missing data*,
+i.e., observations where the values of some of the variables were not recorded.
+Unfortunately, as common as it is, handling missing data properly is very
+challenging and generally relies on expert knowledge about the data, setting,
+and how the data were collected. One typical challenge with missing data is
+that missing entries can be *informative*: the very fact that an entries were
+missing is related to the values of other variables.  For example, survey
+participants from a marginalized group of people may be less likely to respond
+to certain kinds of questions if they fear that answering honestly will come
+with negative consequences. In that case, if we were to simply throw away data
+with missing entries, we would bias the conclusions of the survey by
+inadvertently removing many members of that group of respondents.  So ignoring
+this issue in real problems can easily lead to misleading analyses, with
+detrimental impacts.  In this book, we will cover only those techniques for
+dealing with missing entries in situations where missing entries are just
+"randomly missing", i.e., where the fact that certain entries are missing
+*isn't related to anything else* about the observation.
 
 Let's load and examine a modified subset of the tumor image data
 that has a few missing entries:
@@ -1317,25 +1323,27 @@ missing_cancer <- read_csv("data/wdbc_missing.csv") |>
   mutate(Class = fct_recode(Class, "Malignant" = "M", "Benign" = "B"))
 missing_cancer
 ```
-Recall that K-nearest neighbor classification makes predictions by
-computing the straight-line distance to nearby training observations, and hence requires access to the values
-of *all* variables for *all* observations in the training data. 
-So how can we perform K-nearest neighbor classification in the presence of missing data?
-Well, since there are not too many observations with missing entries, one option is to simply remove
-those observations prior to building the K-nearest neighbor classifier. We can accomplish this by
-using the `drop_na` function from `tidyverse` prior to working with the data.
+Recall that K-nearest neighbor classification makes predictions by computing
+the straight-line distance to nearby training observations, and hence requires
+access to the values of *all* variables for *all* observations in the training
+data.  So how can we perform K-nearest neighbor classification in the presence
+of missing data?  Well, since there are not too many observations with missing
+entries, one option is to simply remove those observations prior to building
+the K-nearest neighbor classifier. We can accomplish this by using the
+`drop_na` function from `tidyverse` prior to working with the data.
 
 ```{r 05-naomit}
 no_missing_cancer <- missing_cancer |> drop_na()
 no_missing_cancer
 ```
 
-However, this strategy will not work when many of the rows have missing entries, as we may end up throwing away
-too much data. In this case, another
-possible approach is to *impute* the missing entries, i.e., fill in synthetic values based on the other
-observations in the data set. One reasonable choice is to perform *mean imputation*, where missing entries
-are filled in using the mean of the present entries in each variable. To perform mean imputation,
-we add the `step_impute_mean` step to the `tidymodels` preprocessing recipe.
+However, this strategy will not work when many of the rows have missing
+entries, as we may end up throwing away too much data. In this case, another
+possible approach is to *impute* the missing entries, i.e., fill in synthetic
+values based on the other observations in the data set. One reasonable choice
+is to perform *mean imputation*, where missing entries are filled in using the
+mean of the present entries in each variable. To perform mean imputation, we
+add the `step_impute_mean` step to the `tidymodels` preprocessing recipe.
 ```{r 05-impute, results=FALSE, message=FALSE, echo=TRUE}
 impute_missing_recipe <- recipe(Class ~ ., data = missing_cancer) |>
   step_impute_mean(all_predictors()) |>
@@ -1347,19 +1355,21 @@ impute_missing_recipe
 hidden_print_cli(impute_missing_recipe)
 ```
 
-We can now include this recipe in a `workflow`. To visualize what mean imputation does, 
-let's just apply the recipe directly to the `missing_cancer` data frame using the `bake` function.
-The imputation step fills in the missing entries with the mean values of their corresponding variables.
+We can now include this recipe in a `workflow`. To visualize what mean
+imputation does, let's just apply the recipe directly to the `missing_cancer`
+data frame using the `bake` function.  The imputation step fills in the missing
+entries with the mean values of their corresponding variables.
 
 ```{r 05-impute-bake}
 imputed_cancer <- bake(impute_missing_recipe, missing_cancer)
 imputed_cancer
 ```
 
-Many other options for missing data imputation can
-be found in [the `recipes` documentation](https://recipes.tidymodels.org/reference/index.html).
-However you decide to handle missing data in your data analysis, it is always crucial 
-to think critically about the setting, how the data were collected, and the question you are answering.
+Many other options for missing data imputation can be found in 
+[the `recipes` documentation](https://recipes.tidymodels.org/reference/index.html).  However
+you decide to handle missing data in your data analysis, it is always crucial
+to think critically about the setting, how the data were collected, and the
+question you are answering.
 
 
 ## Putting it together in a `workflow` {#puttingittogetherworkflow}