Skip to content

Commit 183370a

Browse files
80ch line limit
1 parent 058357f commit 183370a

File tree

1 file changed

+41
-31
lines changed

1 file changed

+41
-31
lines changed

source/classification1.Rmd

Lines changed: 41 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -1296,17 +1296,23 @@ upsampled_plot
12961296

12971297
### Missing data
12981298

1299-
One of the most common issues in real data sets in the wild is *missing data*, i.e., observations
1300-
where the values of some of the variables were not recorded.
1301-
Unfortunately, as common as it is, handling missing data properly is very challenging and generally
1302-
relies on expert knowledge about the data, setting, and how the data were collected. One typical challenge with missing data
1303-
is that missing entries can be *informative*: the very fact that an entries were missing is related to the values of other variables.
1304-
For example, survey participants from a marginalized group of people may be less likely to respond to certain kinds of questions if they
1305-
fear that answering honestly will come with negative consequences. In that case, if we were to simply throw away data with missing entries,
1306-
we would bias the conclusions of the survey by inadvertently removing many members of that group of respondents.
1307-
So ignoring this issue in real problems can easily lead to misleading analyses, with detrimental impacts.
1308-
In this book, we will cover only those techniques for dealing with missing entries in situations
1309-
where missing entries are just "randomly missing", i.e., where the fact that certain entries are missing *isn't related to anything else* about the observation.
1299+
One of the most common issues in real data sets in the wild is *missing data*,
1300+
i.e., observations where the values of some of the variables were not recorded.
1301+
Unfortunately, as common as it is, handling missing data properly is very
1302+
challenging and generally relies on expert knowledge about the data, setting,
1303+
and how the data were collected. One typical challenge with missing data is
1304+
that missing entries can be *informative*: the very fact that an entries were
1305+
missing is related to the values of other variables. For example, survey
1306+
participants from a marginalized group of people may be less likely to respond
1307+
to certain kinds of questions if they fear that answering honestly will come
1308+
with negative consequences. In that case, if we were to simply throw away data
1309+
with missing entries, we would bias the conclusions of the survey by
1310+
inadvertently removing many members of that group of respondents. So ignoring
1311+
this issue in real problems can easily lead to misleading analyses, with
1312+
detrimental impacts. In this book, we will cover only those techniques for
1313+
dealing with missing entries in situations where missing entries are just
1314+
"randomly missing", i.e., where the fact that certain entries are missing
1315+
*isn't related to anything else* about the observation.
13101316

13111317
Let's load and examine a modified subset of the tumor image data
13121318
that has a few missing entries:
@@ -1317,25 +1323,27 @@ missing_cancer <- read_csv("data/wdbc_missing.csv") |>
13171323
mutate(Class = fct_recode(Class, "Malignant" = "M", "Benign" = "B"))
13181324
missing_cancer
13191325
```
1320-
Recall that K-nearest neighbor classification makes predictions by
1321-
computing the straight-line distance to nearby training observations, and hence requires access to the values
1322-
of *all* variables for *all* observations in the training data.
1323-
So how can we perform K-nearest neighbor classification in the presence of missing data?
1324-
Well, since there are not too many observations with missing entries, one option is to simply remove
1325-
those observations prior to building the K-nearest neighbor classifier. We can accomplish this by
1326-
using the `drop_na` function from `tidyverse` prior to working with the data.
1326+
Recall that K-nearest neighbor classification makes predictions by computing
1327+
the straight-line distance to nearby training observations, and hence requires
1328+
access to the values of *all* variables for *all* observations in the training
1329+
data. So how can we perform K-nearest neighbor classification in the presence
1330+
of missing data? Well, since there are not too many observations with missing
1331+
entries, one option is to simply remove those observations prior to building
1332+
the K-nearest neighbor classifier. We can accomplish this by using the
1333+
`drop_na` function from `tidyverse` prior to working with the data.
13271334

13281335
```{r 05-naomit}
13291336
no_missing_cancer <- missing_cancer |> drop_na()
13301337
no_missing_cancer
13311338
```
13321339

1333-
However, this strategy will not work when many of the rows have missing entries, as we may end up throwing away
1334-
too much data. In this case, another
1335-
possible approach is to *impute* the missing entries, i.e., fill in synthetic values based on the other
1336-
observations in the data set. One reasonable choice is to perform *mean imputation*, where missing entries
1337-
are filled in using the mean of the present entries in each variable. To perform mean imputation,
1338-
we add the `step_impute_mean` step to the `tidymodels` preprocessing recipe.
1340+
However, this strategy will not work when many of the rows have missing
1341+
entries, as we may end up throwing away too much data. In this case, another
1342+
possible approach is to *impute* the missing entries, i.e., fill in synthetic
1343+
values based on the other observations in the data set. One reasonable choice
1344+
is to perform *mean imputation*, where missing entries are filled in using the
1345+
mean of the present entries in each variable. To perform mean imputation, we
1346+
add the `step_impute_mean` step to the `tidymodels` preprocessing recipe.
13391347
```{r 05-impute, results=FALSE, message=FALSE, echo=TRUE}
13401348
impute_missing_recipe <- recipe(Class ~ ., data = missing_cancer) |>
13411349
step_impute_mean(all_predictors()) |>
@@ -1347,19 +1355,21 @@ impute_missing_recipe
13471355
hidden_print_cli(impute_missing_recipe)
13481356
```
13491357

1350-
We can now include this recipe in a `workflow`. To visualize what mean imputation does,
1351-
let's just apply the recipe directly to the `missing_cancer` data frame using the `bake` function.
1352-
The imputation step fills in the missing entries with the mean values of their corresponding variables.
1358+
We can now include this recipe in a `workflow`. To visualize what mean
1359+
imputation does, let's just apply the recipe directly to the `missing_cancer`
1360+
data frame using the `bake` function. The imputation step fills in the missing
1361+
entries with the mean values of their corresponding variables.
13531362

13541363
```{r 05-impute-bake}
13551364
imputed_cancer <- bake(impute_missing_recipe, missing_cancer)
13561365
imputed_cancer
13571366
```
13581367

1359-
Many other options for missing data imputation can
1360-
be found in [the `recipes` documentation](https://recipes.tidymodels.org/reference/index.html).
1361-
However you decide to handle missing data in your data analysis, it is always crucial
1362-
to think critically about the setting, how the data were collected, and the question you are answering.
1368+
Many other options for missing data imputation can be found in
1369+
[the `recipes` documentation](https://recipes.tidymodels.org/reference/index.html). However
1370+
you decide to handle missing data in your data analysis, it is always crucial
1371+
to think critically about the setting, how the data were collected, and the
1372+
question you are answering.
13631373

13641374

13651375
## Putting it together in a `workflow` {#puttingittogetherworkflow}

0 commit comments

Comments
 (0)