Skip to content

Commit 97eada9

Browse files
mssing data polish/bugfixes/adding example data
1 parent 41fb506 commit 97eada9

File tree

2 files changed

+26
-29
lines changed

2 files changed

+26
-29
lines changed

data/wdbc_missing.csv

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
ID,Class,Radius,Texture,Perimeter,Area,Smoothness,Compactness,Concavity,Concave_Points,Symmetry,Fractal_Dimension
2+
842302,M,,,1.2688172627037921,0.983509520104142,1.5670874574786582,3.2806280641246857,2.650541786383573,2.530248864134298,2.215565541846305,2.25376381072807
3+
842517,M,1.8282119737343598,-0.3533215225500966,1.684472552277101,1.9070302686337925,-0.826235446757039,-0.486643477616135,-0.023824891805531347,0.5476622708254778,0.001391139243576388,-0.8678888068037953
4+
84300903,M,1.5784992020342323,,1.5651259839837746,1.5575131853441093,0.941382123037953,1.051999895332493,1.362279788963212,2.0354397832616953,0.9388587199172193,-0.39765801323729066
5+
84348301,M,-0.7682333229203782,0.25350905052192196,-0.5921661228907633,-0.7637917361139566,3.280666839299224,3.3999174223523045,1.9142128745181868,1.4504311303550237,2.864862154141668,4.906601992505377
6+
84358402,M,1.7487579100115918,-1.1508038465489563,1.7750113282237618,1.8246238018419159,0.2801253491403896,0.5388663067660666,1.3698061492207798,1.4272369546891206,-0.009552062087244153,-0.5619555194231786
7+
843786,M,-0.4759558742259106,-0.8346009425727322,-0.3868077174481091,-0.5052059265256544,2.2354545192675923,1.2432415648720105,0.8655400119637346,0.8239306743126811,1.0045179279021434,1.888343495245663
8+
844359,M,1.1698783028885684,0.16050819641126807,1.1371244976904666,1.0943320099277,-0.12302797430038338,0.08821762012839307,0.2998085992698855,0.646366373937044,-0.06426806874134787,-0.7616619709077471
9+

source/classification1.Rmd

Lines changed: 17 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -1305,14 +1305,14 @@ For example, survey participants from a marginalized group of people may be less
13051305
fear that answering honestly will come with negative consequences. In that case, if we were to simply throw away data with missing entries,
13061306
we would bias the conclusions of the survey by inadvertently removing many members of that group of respondents.
13071307
So ignoring this issue in real problems can easily lead to misleading analyses, with detrimental impacts.
1308-
In this book, we will only give you techniques for dealing with missing entries in situations
1309-
where missing entries are just "randomly missing", i.e.,
1310-
where *the fact that entries are missing isn't related to anything else about the observation*.
1308+
In this book, we will cover only those techniques for dealing with missing entries in situations
1309+
where missing entries are just "randomly missing", i.e., where the fact that certain entries are missing *isn't related to anything else* about the observation.
13111310

1312-
As an example, let's load and examine a modified version of the tumor image data
1313-
that has missing entries:
1311+
Let's load and examine a modified subset of the tumor image data
1312+
that has a few missing entries:
13141313
```{r 05-missing-entries, message = FALSE, warning = FALSE}
1315-
missing_cancer <- read_csv("data/missing_wdbc.csv") |>
1314+
missing_cancer <- read_csv("data/wdbc_missing.csv") |>
1315+
select(Class, Radius, Texture, Perimeter) |>
13161316
mutate(Class = as_factor(Class)) |>
13171317
mutate(Class = fct_recode(Class, "Malignant" = "M", "Benign" = "B"))
13181318
missing_cancer
@@ -1321,37 +1321,24 @@ Recall that K-nearest neighbor classification makes predictions by
13211321
computing the straight-line distance to nearby training observations, and hence requires access to the values
13221322
of *all* variables for *all* observations in the training data.
13231323
So how can we perform K-nearest neighbor classification in the presence of missing data?
1324-
13251324
Well, since there are not too many observations with missing entries, one option is to simply remove
13261325
those observations prior to building the K-nearest neighbor classifier. We can accomplish this by
1327-
adding a `step_naomit` to the recipe.
1328-
```{r 05-naomit, results=FALSE, message=FALSE, echo=TRUE}
1329-
remove_missing_recipe <- recipe(Class ~ ., data = missing_cancer)
1330-
remove_missing_recipe <- remove_missing_recipe |>
1331-
step_naomit(all_predictors())
1332-
prep()
1333-
remove_missing_recipe
1334-
```
1326+
using the `drop_na` function from `tidyverse` prior to working with the data.
13351327

1336-
```{r 05-naomit-print, echo=FALSE}
1337-
hidden_print_cli(remove_missing_recipe)
1338-
```
1339-
Applying the recipe to the `missing_cancer` data frame removes the rows with missing entries.
1340-
1341-
```{r 05-naomit-bake}
1342-
no_missing_cancer <- bake(remove_missing_recipe, data = missing_cancer)
1328+
```{r 05-naomit}
1329+
no_missing_cancer <- missing_cancer |> drop_na()
13431330
no_missing_cancer
13441331
```
1332+
13451333
However, this strategy will not work when many of the rows have missing entries, as we may end up throwing away
13461334
too much data. In this case, another
13471335
possible approach is to *impute* the missing entries, i.e., fill in synthetic values based on the other
13481336
observations in the data set. One reasonable choice is to perform *mean imputation*, where missing entries
13491337
are filled in using the mean of the present entries in each variable. To perform mean imputation,
1350-
we can use the `step_impute_mean` recipe step.
1338+
we add the `step_impute_mean` step to the `tidymodels` preprocessing recipe.
13511339
```{r 05-impute, results=FALSE, message=FALSE, echo=TRUE}
1352-
impute_missing_recipe <- recipe(Class ~ ., data = missing_cancer)
1353-
impute_missing_recipe <- impute_missing_recipe |>
1354-
step_impute_mean(all_predictors())
1340+
impute_missing_recipe <- recipe(Class ~ ., data = missing_cancer) |>
1341+
step_impute_mean(all_predictors()) |>
13551342
prep()
13561343
impute_missing_recipe
13571344
```
@@ -1363,12 +1350,13 @@ hidden_print_cli(impute_missing_recipe)
13631350
Applying the recipe to the `missing_cancer` data frame fills in the missing entries with the mean values of their corresponding variables.
13641351

13651352
```{r 05-impute-bake}
1366-
imputed_cancer <- bake(impute_missing_recipe, data = missing_cancer)
1353+
imputed_cancer <- bake(impute_missing_recipe, missing_cancer)
13671354
imputed_cancer
13681355
```
13691356

1370-
However you decide to handle missing data in your data analysis, it is always crucial to think critically about
1371-
the setting, how the data were collected, and the question you are answering.
1357+
Many other options for missing data imputation can be found in [the `recipes` documentation](https://recipes.tidymodels.org/reference/index.html).
1358+
However you decide to handle missing data in your data analysis, it is always crucial
1359+
to think critically about the setting, how the data were collected, and the question you are answering.
13721360

13731361

13741362
## Putting it together in a `workflow` {#puttingittogetherworkflow}

0 commit comments

Comments
 (0)