Skip to content

Commit 41fb506

Browse files
initial draft of missing data bit
1 parent 67e8218 commit 41fb506

File tree

1 file changed

+83
-3
lines changed

1 file changed

+83
-3
lines changed

source/classification1.Rmd

Lines changed: 83 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1294,11 +1294,91 @@ upsampled_plot <-
12941294
upsampled_plot
12951295
```
12961296

1297+
### Missing data
1298+
1299+
One of the most common issues in real data sets in the wild is *missing data*, i.e., observations
1300+
where the values of some of the variables were not recorded.
1301+
Unfortunately, as common as it is, handling missing data properly is very challenging and generally
1302+
relies on expert knowledge about the data, setting, and how the data were collected. One typical challenge with missing data
1303+
is that missing entries can be *informative*: the very fact that an entries were missing is related to the values of other variables.
1304+
For example, survey participants from a marginalized group of people may be less likely to respond to certain kinds of questions if they
1305+
fear that answering honestly will come with negative consequences. In that case, if we were to simply throw away data with missing entries,
1306+
we would bias the conclusions of the survey by inadvertently removing many members of that group of respondents.
1307+
So ignoring this issue in real problems can easily lead to misleading analyses, with detrimental impacts.
1308+
In this book, we will only give you techniques for dealing with missing entries in situations
1309+
where missing entries are just "randomly missing", i.e.,
1310+
where *the fact that entries are missing isn't related to anything else about the observation*.
1311+
1312+
As an example, let's load and examine a modified version of the tumor image data
1313+
that has missing entries:
1314+
```{r 05-missing-entries, message = FALSE, warning = FALSE}
1315+
missing_cancer <- read_csv("data/missing_wdbc.csv") |>
1316+
mutate(Class = as_factor(Class)) |>
1317+
mutate(Class = fct_recode(Class, "Malignant" = "M", "Benign" = "B"))
1318+
missing_cancer
1319+
```
1320+
Recall that K-nearest neighbor classification makes predictions by
1321+
computing the straight-line distance to nearby training observations, and hence requires access to the values
1322+
of *all* variables for *all* observations in the training data.
1323+
So how can we perform K-nearest neighbor classification in the presence of missing data?
1324+
1325+
Well, since there are not too many observations with missing entries, one option is to simply remove
1326+
those observations prior to building the K-nearest neighbor classifier. We can accomplish this by
1327+
adding a `step_naomit` to the recipe.
1328+
```{r 05-naomit, results=FALSE, message=FALSE, echo=TRUE}
1329+
remove_missing_recipe <- recipe(Class ~ ., data = missing_cancer)
1330+
remove_missing_recipe <- remove_missing_recipe |>
1331+
step_naomit(all_predictors())
1332+
prep()
1333+
remove_missing_recipe
1334+
```
1335+
1336+
```{r 05-naomit-print, echo=FALSE}
1337+
hidden_print_cli(remove_missing_recipe)
1338+
```
1339+
Applying the recipe to the `missing_cancer` data frame removes the rows with missing entries.
1340+
1341+
```{r 05-naomit-bake}
1342+
no_missing_cancer <- bake(remove_missing_recipe, data = missing_cancer)
1343+
no_missing_cancer
1344+
```
1345+
However, this strategy will not work when many of the rows have missing entries, as we may end up throwing away
1346+
too much data. In this case, another
1347+
possible approach is to *impute* the missing entries, i.e., fill in synthetic values based on the other
1348+
observations in the data set. One reasonable choice is to perform *mean imputation*, where missing entries
1349+
are filled in using the mean of the present entries in each variable. To perform mean imputation,
1350+
we can use the `step_impute_mean` recipe step.
1351+
```{r 05-impute, results=FALSE, message=FALSE, echo=TRUE}
1352+
impute_missing_recipe <- recipe(Class ~ ., data = missing_cancer)
1353+
impute_missing_recipe <- impute_missing_recipe |>
1354+
step_impute_mean(all_predictors())
1355+
prep()
1356+
impute_missing_recipe
1357+
```
1358+
1359+
```{r 05-impute-print, echo=FALSE}
1360+
hidden_print_cli(impute_missing_recipe)
1361+
```
1362+
1363+
Applying the recipe to the `missing_cancer` data frame fills in the missing entries with the mean values of their corresponding variables.
1364+
1365+
```{r 05-impute-bake}
1366+
imputed_cancer <- bake(impute_missing_recipe, data = missing_cancer)
1367+
imputed_cancer
1368+
```
1369+
1370+
However you decide to handle missing data in your data analysis, it is always crucial to think critically about
1371+
the setting, how the data were collected, and the question you are answering.
1372+
1373+
12971374
## Putting it together in a `workflow` {#puttingittogetherworkflow}
12981375

1299-
The `tidymodels` package collection also provides the `workflow`, a way to chain\index{tidymodels!workflow}\index{workflow|see{tidymodels}} together multiple data analysis steps without a lot of otherwise necessary code for intermediate steps.
1300-
To illustrate the whole pipeline, let's start from scratch with the `unscaled_wdbc.csv` data.
1301-
First we will load the data, create a model, and specify a recipe for how the data should be preprocessed:
1376+
The `tidymodels` package collection also provides the `workflow`, a way to
1377+
chain\index{tidymodels!workflow}\index{workflow|see{tidymodels}} together
1378+
multiple data analysis steps without a lot of otherwise necessary code for
1379+
intermediate steps. To illustrate the whole pipeline, let's start from scratch
1380+
with the `unscaled_wdbc.csv` data. First we will load the data, create a
1381+
model, and specify a recipe for how the data should be preprocessed:
13021382

13031383
```{r 05-workflow, message = FALSE, warning = FALSE}
13041384
# load the unscaled cancer data

0 commit comments

Comments
 (0)