@@ -1295,7 +1295,7 @@ upsampled_plot
12951295
12961296### Missing data
12971297
1298- One of the most common issues in real data sets in the wild is * missing data* ,
1298+ One of the most common issues in real data sets in the wild is * missing data* ,\index{missing data}
12991299i.e., observations where the values of some of the variables were not recorded.
13001300Unfortunately, as common as it is, handling missing data properly is very
13011301challenging and generally relies on expert knowledge about the data, setting,
@@ -1329,7 +1329,7 @@ data. So how can we perform K-nearest neighbors classification in the presence
13291329of missing data? Well, since there are not too many observations with missing
13301330entries, one option is to simply remove those observations prior to building
13311331the K-nearest neighbors classifier. We can accomplish this by using the
1332- ` drop_na ` function from ` tidyverse ` prior to working with the data.
1332+ ` drop_na ` function from ` tidyverse ` prior to working with the data.\label{missing data!drop \_ na}
13331333
13341334``` {r 05-naomit}
13351335no_missing_cancer <- missing_cancer |> drop_na()
@@ -1342,7 +1342,8 @@ possible approach is to *impute* the missing entries, i.e., fill in synthetic
13421342values based on the other observations in the data set. One reasonable choice
13431343is to perform * mean imputation* , where missing entries are filled in using the
13441344mean of the present entries in each variable. To perform mean imputation, we
1345- add the ` step_impute_mean ` step to the ` tidymodels ` preprocessing recipe.
1345+ add the ` step_impute_mean ` \index{recipe!step\_ impute\_ mean}\index{missing data!mean imputation}
1346+ step to the ` tidymodels ` preprocessing recipe.
13461347``` {r 05-impute, results=FALSE, message=FALSE, echo=TRUE}
13471348impute_missing_recipe <- recipe(Class ~ ., data = missing_cancer) |>
13481349 step_impute_mean(all_predictors()) |>
0 commit comments