Merge pull request #509 from UBC-DSCI/missing-data

trevorcampbell · web-flow · commit 9124b800570d · 2023-08-04T09:07:42.000-07:00
Missing data
diff --git a/data/wdbc_missing.csv b/data/wdbc_missing.csv
@@ -0,0 +1,9 @@
+ID,Class,Radius,Texture,Perimeter,Area,Smoothness,Compactness,Concavity,Concave_Points,Symmetry,Fractal_Dimension
+842302,M,,,1.2688172627037921,0.983509520104142,1.5670874574786582,3.2806280641246857,2.650541786383573,2.530248864134298,2.215565541846305,2.25376381072807
+842517,M,1.8282119737343598,-0.3533215225500966,1.684472552277101,1.9070302686337925,-0.826235446757039,-0.486643477616135,-0.023824891805531347,0.5476622708254778,0.001391139243576388,-0.8678888068037953
+84300903,M,1.5784992020342323,,1.5651259839837746,1.5575131853441093,0.941382123037953,1.051999895332493,1.362279788963212,2.0354397832616953,0.9388587199172193,-0.39765801323729066
+84348301,M,-0.7682333229203782,0.25350905052192196,-0.5921661228907633,-0.7637917361139566,3.280666839299224,3.3999174223523045,1.9142128745181868,1.4504311303550237,2.864862154141668,4.906601992505377
+84358402,M,1.7487579100115918,-1.1508038465489563,1.7750113282237618,1.8246238018419159,0.2801253491403896,0.5388663067660666,1.3698061492207798,1.4272369546891206,-0.009552062087244153,-0.5619555194231786
+843786,M,-0.4759558742259106,-0.8346009425727322,-0.3868077174481091,-0.5052059265256544,2.2354545192675923,1.2432415648720105,0.8655400119637346,0.8239306743126811,1.0045179279021434,1.888343495245663
+844359,M,1.1698783028885684,0.16050819641126807,1.1371244976904666,1.0943320099277,-0.12302797430038338,0.08821762012839307,0.2998085992698855,0.646366373937044,-0.06426806874134787,-0.7616619709077471
+
diff --git a/source/classification1.Rmd b/source/classification1.Rmd
@@ -1294,11 +1294,92 @@ upsampled_plot <-
 upsampled_plot
 ```
 
+### Missing data
+
+One of the most common issues in real data sets in the wild is *missing data*,
+i.e., observations where the values of some of the variables were not recorded.
+Unfortunately, as common as it is, handling missing data properly is very
+challenging and generally relies on expert knowledge about the data, setting,
+and how the data were collected. One typical challenge with missing data is
+that missing entries can be *informative*: the very fact that an entries were
+missing is related to the values of other variables.  For example, survey
+participants from a marginalized group of people may be less likely to respond
+to certain kinds of questions if they fear that answering honestly will come
+with negative consequences. In that case, if we were to simply throw away data
+with missing entries, we would bias the conclusions of the survey by
+inadvertently removing many members of that group of respondents.  So ignoring
+this issue in real problems can easily lead to misleading analyses, with
+detrimental impacts.  In this book, we will cover only those techniques for
+dealing with missing entries in situations where missing entries are just
+"randomly missing", i.e., where the fact that certain entries are missing
+*isn't related to anything else* about the observation.
+
+Let's load and examine a modified subset of the tumor image data
+that has a few missing entries:
+```{r 05-missing-entries, message = FALSE, warning = FALSE}
+missing_cancer <- read_csv("data/wdbc_missing.csv") |>
+  select(Class, Radius, Texture, Perimeter) |>
+  mutate(Class = as_factor(Class)) |>
+  mutate(Class = fct_recode(Class, "Malignant" = "M", "Benign" = "B"))
+missing_cancer
+```
+Recall that K-nearest neighbor classification makes predictions by computing
+the straight-line distance to nearby training observations, and hence requires
+access to the values of *all* variables for *all* observations in the training
+data.  So how can we perform K-nearest neighbor classification in the presence
+of missing data?  Well, since there are not too many observations with missing
+entries, one option is to simply remove those observations prior to building
+the K-nearest neighbor classifier. We can accomplish this by using the
+`drop_na` function from `tidyverse` prior to working with the data.
+
+```{r 05-naomit}
+no_missing_cancer <- missing_cancer |> drop_na()
+no_missing_cancer
+```
+
+However, this strategy will not work when many of the rows have missing
+entries, as we may end up throwing away too much data. In this case, another
+possible approach is to *impute* the missing entries, i.e., fill in synthetic
+values based on the other observations in the data set. One reasonable choice
+is to perform *mean imputation*, where missing entries are filled in using the
+mean of the present entries in each variable. To perform mean imputation, we
+add the `step_impute_mean` step to the `tidymodels` preprocessing recipe.
+```{r 05-impute, results=FALSE, message=FALSE, echo=TRUE}
+impute_missing_recipe <- recipe(Class ~ ., data = missing_cancer) |>
+  step_impute_mean(all_predictors()) |>
+  prep()
+impute_missing_recipe
+```
+
+```{r 05-impute-print, echo=FALSE}
+hidden_print_cli(impute_missing_recipe)
+```
+
+We can now include this recipe in a `workflow`. To visualize what mean
+imputation does, let's just apply the recipe directly to the `missing_cancer`
+data frame using the `bake` function.  The imputation step fills in the missing
+entries with the mean values of their corresponding variables.
+
+```{r 05-impute-bake}
+imputed_cancer <- bake(impute_missing_recipe, missing_cancer)
+imputed_cancer
+```
+
+Many other options for missing data imputation can be found in 
+[the `recipes` documentation](https://recipes.tidymodels.org/reference/index.html).  However
+you decide to handle missing data in your data analysis, it is always crucial
+to think critically about the setting, how the data were collected, and the
+question you are answering.
+
+
 ## Putting it together in a `workflow` {#puttingittogetherworkflow}
 
-The `tidymodels` package collection also provides the `workflow`, a way to chain\index{tidymodels!workflow}\index{workflow|see{tidymodels}} together multiple data analysis steps without a lot of otherwise necessary code for intermediate steps.
-To illustrate the whole pipeline, let's start from scratch with the `unscaled_wdbc.csv` data.
-First we will load the data, create a model, and specify a recipe for how the data should be preprocessed:
+The `tidymodels` package collection also provides the `workflow`, a way to
+chain\index{tidymodels!workflow}\index{workflow|see{tidymodels}} together
+multiple data analysis steps without a lot of otherwise necessary code for
+intermediate steps. To illustrate the whole pipeline, let's start from scratch
+with the `unscaled_wdbc.csv` data.  First we will load the data, create a
+model, and specify a recipe for how the data should be preprocessed:
 
 ```{r 05-workflow, message = FALSE, warning = FALSE}
 # load the unscaled cancer data