mssing data polish/bugfixes/adding example data

trevorcampbell · trevorcampbell · commit 97eada939b46 · 2023-07-13T16:01:59.000-07:00
diff --git a/data/wdbc_missing.csv b/data/wdbc_missing.csv
@@ -0,0 +1,9 @@
+ID,Class,Radius,Texture,Perimeter,Area,Smoothness,Compactness,Concavity,Concave_Points,Symmetry,Fractal_Dimension
+842302,M,,,1.2688172627037921,0.983509520104142,1.5670874574786582,3.2806280641246857,2.650541786383573,2.530248864134298,2.215565541846305,2.25376381072807
+842517,M,1.8282119737343598,-0.3533215225500966,1.684472552277101,1.9070302686337925,-0.826235446757039,-0.486643477616135,-0.023824891805531347,0.5476622708254778,0.001391139243576388,-0.8678888068037953
+84300903,M,1.5784992020342323,,1.5651259839837746,1.5575131853441093,0.941382123037953,1.051999895332493,1.362279788963212,2.0354397832616953,0.9388587199172193,-0.39765801323729066
+84348301,M,-0.7682333229203782,0.25350905052192196,-0.5921661228907633,-0.7637917361139566,3.280666839299224,3.3999174223523045,1.9142128745181868,1.4504311303550237,2.864862154141668,4.906601992505377
+84358402,M,1.7487579100115918,-1.1508038465489563,1.7750113282237618,1.8246238018419159,0.2801253491403896,0.5388663067660666,1.3698061492207798,1.4272369546891206,-0.009552062087244153,-0.5619555194231786
+843786,M,-0.4759558742259106,-0.8346009425727322,-0.3868077174481091,-0.5052059265256544,2.2354545192675923,1.2432415648720105,0.8655400119637346,0.8239306743126811,1.0045179279021434,1.888343495245663
+844359,M,1.1698783028885684,0.16050819641126807,1.1371244976904666,1.0943320099277,-0.12302797430038338,0.08821762012839307,0.2998085992698855,0.646366373937044,-0.06426806874134787,-0.7616619709077471
+
diff --git a/source/classification1.Rmd b/source/classification1.Rmd
@@ -1305,14 +1305,14 @@ For example, survey participants from a marginalized group of people may be less
 fear that answering honestly will come with negative consequences. In that case, if we were to simply throw away data with missing entries,
 we would bias the conclusions of the survey by inadvertently removing many members of that group of respondents.
 So ignoring this issue in real problems can easily lead to misleading analyses, with detrimental impacts.
-In this book, we will only give you techniques for dealing with missing entries in situations
-where missing entries are just "randomly missing", i.e.,
-where *the fact that entries are missing isn't related to anything else about the observation*.
+In this book, we will cover only those techniques for dealing with missing entries in situations
+where missing entries are just "randomly missing", i.e., where the fact that certain entries are missing *isn't related to anything else* about the observation.
 
-As an example, let's load and examine a modified version of the tumor image data
-that has missing entries:
+Let's load and examine a modified subset of the tumor image data
+that has a few missing entries:
 ```{r 05-missing-entries, message = FALSE, warning = FALSE}
-missing_cancer <- read_csv("data/missing_wdbc.csv") |>
+missing_cancer <- read_csv("data/wdbc_missing.csv") |>
+  select(Class, Radius, Texture, Perimeter) |>
   mutate(Class = as_factor(Class)) |>
   mutate(Class = fct_recode(Class, "Malignant" = "M", "Benign" = "B"))
 missing_cancer
@@ -1321,37 +1321,24 @@ Recall that K-nearest neighbor classification makes predictions by
 computing the straight-line distance to nearby training observations, and hence requires access to the values
 of *all* variables for *all* observations in the training data. 
 So how can we perform K-nearest neighbor classification in the presence of missing data?
-
 Well, since there are not too many observations with missing entries, one option is to simply remove
 those observations prior to building the K-nearest neighbor classifier. We can accomplish this by
-adding a `step_naomit` to the recipe.
-```{r 05-naomit, results=FALSE, message=FALSE, echo=TRUE}
-remove_missing_recipe <- recipe(Class ~ ., data = missing_cancer)
-remove_missing_recipe <- remove_missing_recipe |>
-  step_naomit(all_predictors())
-  prep()
-remove_missing_recipe
-```
+using the `drop_na` function from `tidyverse` prior to working with the data.
 
-```{r 05-naomit-print, echo=FALSE}
-hidden_print_cli(remove_missing_recipe)
-```
-Applying the recipe to the `missing_cancer` data frame removes the rows with missing entries.
-
-```{r 05-naomit-bake}
-no_missing_cancer <- bake(remove_missing_recipe, data = missing_cancer)
+```{r 05-naomit}
+no_missing_cancer <- missing_cancer |> drop_na()
 no_missing_cancer
 ```
+
 However, this strategy will not work when many of the rows have missing entries, as we may end up throwing away
 too much data. In this case, another
 possible approach is to *impute* the missing entries, i.e., fill in synthetic values based on the other
 observations in the data set. One reasonable choice is to perform *mean imputation*, where missing entries
 are filled in using the mean of the present entries in each variable. To perform mean imputation,
-we can use the `step_impute_mean` recipe step.
+we add the `step_impute_mean` step to the `tidymodels` preprocessing recipe.
 ```{r 05-impute, results=FALSE, message=FALSE, echo=TRUE}
-impute_missing_recipe <- recipe(Class ~ ., data = missing_cancer)
-impute_missing_recipe <- impute_missing_recipe |>
-  step_impute_mean(all_predictors())
+impute_missing_recipe <- recipe(Class ~ ., data = missing_cancer) |>
+  step_impute_mean(all_predictors()) |>
   prep()
 impute_missing_recipe
 ```
@@ -1363,12 +1350,13 @@ hidden_print_cli(impute_missing_recipe)
 Applying the recipe to the `missing_cancer` data frame fills in the missing entries with the mean values of their corresponding variables.
 
 ```{r 05-impute-bake}
-imputed_cancer <- bake(impute_missing_recipe, data = missing_cancer)
+imputed_cancer <- bake(impute_missing_recipe, missing_cancer)
 imputed_cancer
 ```
 
-However you decide to handle missing data in your data analysis, it is always crucial to think critically about
-the setting, how the data were collected, and the question you are answering.
+Many other options for missing data imputation can be found in [the `recipes` documentation](https://recipes.tidymodels.org/reference/index.html).
+However you decide to handle missing data in your data analysis, it is always crucial 
+to think critically about the setting, how the data were collected, and the question you are answering.
 
 
 ## Putting it together in a `workflow` {#puttingittogetherworkflow}