Skip to content

Commit 9124b80

Browse files
Merge pull request #509 from UBC-DSCI/missing-data
Missing data
2 parents 532fd87 + 183370a commit 9124b80

File tree

2 files changed

+93
-3
lines changed

2 files changed

+93
-3
lines changed

data/wdbc_missing.csv

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
ID,Class,Radius,Texture,Perimeter,Area,Smoothness,Compactness,Concavity,Concave_Points,Symmetry,Fractal_Dimension
2+
842302,M,,,1.2688172627037921,0.983509520104142,1.5670874574786582,3.2806280641246857,2.650541786383573,2.530248864134298,2.215565541846305,2.25376381072807
3+
842517,M,1.8282119737343598,-0.3533215225500966,1.684472552277101,1.9070302686337925,-0.826235446757039,-0.486643477616135,-0.023824891805531347,0.5476622708254778,0.001391139243576388,-0.8678888068037953
4+
84300903,M,1.5784992020342323,,1.5651259839837746,1.5575131853441093,0.941382123037953,1.051999895332493,1.362279788963212,2.0354397832616953,0.9388587199172193,-0.39765801323729066
5+
84348301,M,-0.7682333229203782,0.25350905052192196,-0.5921661228907633,-0.7637917361139566,3.280666839299224,3.3999174223523045,1.9142128745181868,1.4504311303550237,2.864862154141668,4.906601992505377
6+
84358402,M,1.7487579100115918,-1.1508038465489563,1.7750113282237618,1.8246238018419159,0.2801253491403896,0.5388663067660666,1.3698061492207798,1.4272369546891206,-0.009552062087244153,-0.5619555194231786
7+
843786,M,-0.4759558742259106,-0.8346009425727322,-0.3868077174481091,-0.5052059265256544,2.2354545192675923,1.2432415648720105,0.8655400119637346,0.8239306743126811,1.0045179279021434,1.888343495245663
8+
844359,M,1.1698783028885684,0.16050819641126807,1.1371244976904666,1.0943320099277,-0.12302797430038338,0.08821762012839307,0.2998085992698855,0.646366373937044,-0.06426806874134787,-0.7616619709077471
9+

source/classification1.Rmd

Lines changed: 84 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1294,11 +1294,92 @@ upsampled_plot <-
12941294
upsampled_plot
12951295
```
12961296

1297+
### Missing data
1298+
1299+
One of the most common issues in real data sets in the wild is *missing data*,
1300+
i.e., observations where the values of some of the variables were not recorded.
1301+
Unfortunately, as common as it is, handling missing data properly is very
1302+
challenging and generally relies on expert knowledge about the data, setting,
1303+
and how the data were collected. One typical challenge with missing data is
1304+
that missing entries can be *informative*: the very fact that an entries were
1305+
missing is related to the values of other variables. For example, survey
1306+
participants from a marginalized group of people may be less likely to respond
1307+
to certain kinds of questions if they fear that answering honestly will come
1308+
with negative consequences. In that case, if we were to simply throw away data
1309+
with missing entries, we would bias the conclusions of the survey by
1310+
inadvertently removing many members of that group of respondents. So ignoring
1311+
this issue in real problems can easily lead to misleading analyses, with
1312+
detrimental impacts. In this book, we will cover only those techniques for
1313+
dealing with missing entries in situations where missing entries are just
1314+
"randomly missing", i.e., where the fact that certain entries are missing
1315+
*isn't related to anything else* about the observation.
1316+
1317+
Let's load and examine a modified subset of the tumor image data
1318+
that has a few missing entries:
1319+
```{r 05-missing-entries, message = FALSE, warning = FALSE}
1320+
missing_cancer <- read_csv("data/wdbc_missing.csv") |>
1321+
select(Class, Radius, Texture, Perimeter) |>
1322+
mutate(Class = as_factor(Class)) |>
1323+
mutate(Class = fct_recode(Class, "Malignant" = "M", "Benign" = "B"))
1324+
missing_cancer
1325+
```
1326+
Recall that K-nearest neighbor classification makes predictions by computing
1327+
the straight-line distance to nearby training observations, and hence requires
1328+
access to the values of *all* variables for *all* observations in the training
1329+
data. So how can we perform K-nearest neighbor classification in the presence
1330+
of missing data? Well, since there are not too many observations with missing
1331+
entries, one option is to simply remove those observations prior to building
1332+
the K-nearest neighbor classifier. We can accomplish this by using the
1333+
`drop_na` function from `tidyverse` prior to working with the data.
1334+
1335+
```{r 05-naomit}
1336+
no_missing_cancer <- missing_cancer |> drop_na()
1337+
no_missing_cancer
1338+
```
1339+
1340+
However, this strategy will not work when many of the rows have missing
1341+
entries, as we may end up throwing away too much data. In this case, another
1342+
possible approach is to *impute* the missing entries, i.e., fill in synthetic
1343+
values based on the other observations in the data set. One reasonable choice
1344+
is to perform *mean imputation*, where missing entries are filled in using the
1345+
mean of the present entries in each variable. To perform mean imputation, we
1346+
add the `step_impute_mean` step to the `tidymodels` preprocessing recipe.
1347+
```{r 05-impute, results=FALSE, message=FALSE, echo=TRUE}
1348+
impute_missing_recipe <- recipe(Class ~ ., data = missing_cancer) |>
1349+
step_impute_mean(all_predictors()) |>
1350+
prep()
1351+
impute_missing_recipe
1352+
```
1353+
1354+
```{r 05-impute-print, echo=FALSE}
1355+
hidden_print_cli(impute_missing_recipe)
1356+
```
1357+
1358+
We can now include this recipe in a `workflow`. To visualize what mean
1359+
imputation does, let's just apply the recipe directly to the `missing_cancer`
1360+
data frame using the `bake` function. The imputation step fills in the missing
1361+
entries with the mean values of their corresponding variables.
1362+
1363+
```{r 05-impute-bake}
1364+
imputed_cancer <- bake(impute_missing_recipe, missing_cancer)
1365+
imputed_cancer
1366+
```
1367+
1368+
Many other options for missing data imputation can be found in
1369+
[the `recipes` documentation](https://recipes.tidymodels.org/reference/index.html). However
1370+
you decide to handle missing data in your data analysis, it is always crucial
1371+
to think critically about the setting, how the data were collected, and the
1372+
question you are answering.
1373+
1374+
12971375
## Putting it together in a `workflow` {#puttingittogetherworkflow}
12981376

1299-
The `tidymodels` package collection also provides the `workflow`, a way to chain\index{tidymodels!workflow}\index{workflow|see{tidymodels}} together multiple data analysis steps without a lot of otherwise necessary code for intermediate steps.
1300-
To illustrate the whole pipeline, let's start from scratch with the `unscaled_wdbc.csv` data.
1301-
First we will load the data, create a model, and specify a recipe for how the data should be preprocessed:
1377+
The `tidymodels` package collection also provides the `workflow`, a way to
1378+
chain\index{tidymodels!workflow}\index{workflow|see{tidymodels}} together
1379+
multiple data analysis steps without a lot of otherwise necessary code for
1380+
intermediate steps. To illustrate the whole pipeline, let's start from scratch
1381+
with the `unscaled_wdbc.csv` data. First we will load the data, create a
1382+
model, and specify a recipe for how the data should be preprocessed:
13021383

13031384
```{r 05-workflow, message = FALSE, warning = FALSE}
13041385
# load the unscaled cancer data

0 commit comments

Comments
 (0)