|
| 1 | +--- |
| 2 | +output: hugodown::hugo_document |
| 3 | + |
| 4 | +slug: recipes-1-3-0 |
| 5 | +title: recipes 1.3.0 |
| 6 | +date: 2025-04-28 |
| 7 | +author: Emil Hvitfeldt |
| 8 | +description: > |
| 9 | + This release brings changes for strings_as_factors, step_select(), step_dummy(), and step_impute_bag(). |
| 10 | +
|
| 11 | +photo: |
| 12 | + url: https://unsplash.com/photos/background-pattern-3b7sos3CD2c |
| 13 | + author: James Trenda |
| 14 | + |
| 15 | +# one of: "deep-dive", "learn", "package", "programming", "roundup", or "other" |
| 16 | +categories: [package] |
| 17 | +tags: [tidymodels, recipes] |
| 18 | +--- |
| 19 | + |
| 20 | +```{=html} |
| 21 | +<!-- |
| 22 | +TODO: |
| 23 | +* [x] Look over / edit the post's title in the yaml |
| 24 | +* [x] Edit (or delete) the description; note this appears in the Twitter card |
| 25 | +* [x] Pick category and tags (see existing with `hugodown::tidy_show_meta()`) |
| 26 | +* [x] Find photo & update yaml metadata |
| 27 | +* [x] Create `thumbnail-sq.jpg`; height and width should be equal |
| 28 | +* [x] Create `thumbnail-wd.jpg`; width should be >5x height |
| 29 | +* [x] `hugodown::use_tidy_thumbnails()` |
| 30 | +* [x] Add intro sentence, e.g. the standard tagline for the package |
| 31 | +* [x] `usethis::use_tidy_thanks()` |
| 32 | +--> |
| 33 | +``` |
| 34 | + |
| 35 | +We're thrilled to announce the release of [recipes](https://recipes.tidymodels.org/) 1.3.0. recipes lets you create a pipeable sequence of feature engineering steps. |
| 36 | + |
| 37 | +You can install it from CRAN with: |
| 38 | + |
| 39 | +```{r, eval = FALSE} |
| 40 | +install.packages("recipes") |
| 41 | +``` |
| 42 | + |
| 43 | +This blog post will walk through some of the highlights of this release, which includes changes to how `strings_as_factors` are specified, deprecation of `step_select()`, new `contrasts` argument for `step_dummy()`, and improvements for `step_impute_bag()`. |
| 44 | + |
| 45 | + |
| 46 | +You can see a full list of changes in the [release notes](https://recipes.tidymodels.org/news/index.html#recipes-130). |
| 47 | + |
| 48 | +Let's first load the package: |
| 49 | + |
| 50 | +```{r setup, message=FALSE} |
| 51 | +library(recipes) |
| 52 | +``` |
| 53 | + |
| 54 | +## `strings_as_factors` |
| 55 | + |
| 56 | +Recipes by default convert predictor strings to factors, and the option for that is located in `prep()`. This caused an issue when you wanted to set `strings_as_factors = FALSE` for a recipe that is used somewhere else like in a workflow. |
| 57 | + |
| 58 | +This is no longer an issue as we have moved the argument to `recipe()` itself. We are at the same time deprecating the use of `strings_as_factors` when used in `prep()`. Here is an example: |
| 59 | + |
| 60 | +```{r} |
| 61 | +library(modeldata) |
| 62 | +tate_text |
| 63 | +``` |
| 64 | + |
| 65 | +We are loading the modeldata package to get `tate_text` which has a character column `title`. If we don't do anything then it turns into a factor. |
| 66 | + |
| 67 | +```{r} |
| 68 | +recipe(~., data = tate_text) |> |
| 69 | + prep() |> |
| 70 | + bake(tate_text) |
| 71 | +``` |
| 72 | + |
| 73 | +But we can set `strings_as_factors = FALSE` in `recipe()` and it won't anymore. |
| 74 | + |
| 75 | +```{r} |
| 76 | +recipe(~., data = tate_text, strings_as_factors = FALSE) |> |
| 77 | + prep() |> |
| 78 | + bake(tate_text) |
| 79 | +``` |
| 80 | + |
| 81 | +This change should also make pragmatic sense as whether you want to turn strings into factors is something that should encoded into the recipe itself. |
| 82 | + |
| 83 | +## Deprecating `step_select()` |
| 84 | + |
| 85 | +We have started the process of deprecating `step_select()`. Given the number of issues people are having with the step and the fact that it doesn't play well with workflows we think this is the right call. |
| 86 | + |
| 87 | +There are two main use cases where `step_select()` was used: removing variables, and selecting variables. Removing variables when done with `-` in `step_select()` |
| 88 | + |
| 89 | +```{r, warning=FALSE} |
| 90 | +recipe(mpg ~ ., mtcars) |> |
| 91 | + step_select(-starts_with("d")) |> |
| 92 | + prep() |> |
| 93 | + bake(new_data = NULL) |
| 94 | +``` |
| 95 | + |
| 96 | +These use cases can seamlessly be converted to use `step_rm()` without the `-` for the same result. |
| 97 | + |
| 98 | +```{r} |
| 99 | +recipe(mpg ~ ., mtcars) |> |
| 100 | + step_rm(starts_with("d")) |> |
| 101 | + prep() |> |
| 102 | + bake(new_data = NULL) |
| 103 | +``` |
| 104 | + |
| 105 | +For selecting variables there are two cases. The first is as a tool to select which variables to use in our model. We recommend that you use `select()` to do that before passing the data into the `recipe()`. This is especially helpful since [recipes are tighter with respect to their input types](https://www.tidyverse.org/blog/2024/07/recipes-1-1-0/#column-type-checking), so only passing the data you need to use is helpful. |
| 106 | + |
| 107 | +If you need to do the selection after another step takes effect you should still be able to do so, by using `step_rm()` in the following manner. |
| 108 | + |
| 109 | +```r |
| 110 | +step_rm(recipe, all_predictors(), -all_of(<variables that you want to keep>)) |
| 111 | +``` |
| 112 | + |
| 113 | +## `step_dummy()` contrasts argument |
| 114 | + |
| 115 | +Contrasts such as `contr.treatment()` and `contr.poly()` are used in `step_dummy()` to determine how the steps should translate categorical values into one or more numeric columns. Traditionally the contrasts were set using `options()` like so: |
| 116 | + |
| 117 | +```{r} |
| 118 | +options(contrasts = c(unordered = "contr.poly", ordered = "contr.poly")) |
| 119 | +``` |
| 120 | + |
| 121 | +```{r, warning=FALSE} |
| 122 | +recipe(~species + island, penguins) |> |
| 123 | + step_dummy(all_nominal_predictors()) |> |
| 124 | + prep() |> |
| 125 | + bake(new_data = penguins) |
| 126 | +``` |
| 127 | + |
| 128 | +The issue with this approach is that it pulls from `options()` when it needs it instead of storing the information. This means that if you put this recipe in production you will need to set the option in the production environment to match that of the training environment. |
| 129 | + |
| 130 | +```{r} |
| 131 | +#| echo: false |
| 132 | +options(contrasts = c(unordered = "contr.treatment", ordered = "contr.poly")) |
| 133 | +``` |
| 134 | + |
| 135 | +To fix this issue we have given `step_dummy()` an argument `contrasts` that work in much the same way. You simply specify the contrast you want and it will be stored in the object for easy deployment. |
| 136 | + |
| 137 | +```{r} |
| 138 | +recipe(~species + island, penguins) |> |
| 139 | + step_dummy( |
| 140 | + all_nominal_predictors(), contrasts = "contr.poly") |> |
| 141 | + prep() |> |
| 142 | + bake(new_data = penguins) |
| 143 | +``` |
| 144 | + |
| 145 | +If you are using a contrasts from an external package such as `hardhat::contr_one_hot()` you will need to have the package loaded in the environments you are working in with `library(hardhat)` and setting `contrasts = "contr_one_hot"`. You will also need to call `library(hardhat)` in any production environments you are using this recipe. |
| 146 | + |
| 147 | +## tidyselect can be used everywhere |
| 148 | + |
| 149 | +Several steps such as `step_pls()` and `step_impute_bag()` require the selection of more than just the affected columns. `step_pls()` needs you to select an `outcome` variable and `step_impute_bag()` needs you to select which variables to impute with, `impute_with`, if you don't want to use all predictors. Previously these needed to be strings or use special selectors like `imp_vars()`. You don't have to do that anymore. You can now use tidyselect in these arguments too. |
| 150 | + |
| 151 | +```{r} |
| 152 | +recipe(mpg ~ ., mtcars) |> |
| 153 | + step_pls(all_predictors(), outcome = mpg) |> |
| 154 | + prep() |> |
| 155 | + bake(new_data = mtcars) |
| 156 | +``` |
| 157 | + |
| 158 | +For arguments that allow for multiple selections now work with recipes selectors like `all_numeric_predictors()` and `has_role()`. |
| 159 | + |
| 160 | +```{r} |
| 161 | +recipe(mpg ~ ., mtcars) |> |
| 162 | + step_impute_bag(all_predictors(), impute_with = has_role("predictor")) |> |
| 163 | + prep() |> |
| 164 | + bake(new_data = mtcars) |
| 165 | +``` |
| 166 | + |
| 167 | +These changes are backwards compatible meaning that the old ways still work with minimal warnings. |
| 168 | + |
| 169 | +## `step_impute_bag()` now takes up less memory |
| 170 | + |
| 171 | +We have another benefit for users of `step_impute_bag()`. For each variable it imputes on, it fits a bagged tree model, which is then used to predict with for imputation. It was noticed that these models had a larger memory footprint than was needed. This has been remedied, so now there should be a noticeable decrease in size for recipes with `step_impute_bag()`. |
| 172 | + |
| 173 | +```{r} |
| 174 | +rec <- recipe(Sale_Price ~ ., data = ames) |> |
| 175 | + step_impute_bag(starts_with("Lot_"), impute_with = all_numeric_predictors()) |> |
| 176 | + prep() |
| 177 | +
|
| 178 | +lobstr::obj_size(rec) |
| 179 | +``` |
| 180 | + |
| 181 | +This recipe took up over `75 MB` and now takes up `20 MB`. |
| 182 | + |
| 183 | +## Acknowledgements |
| 184 | + |
| 185 | +Many thanks to all the people who contributed to recipes since the last release! |
| 186 | + |
| 187 | +[@chillerb](https://github.com/chillerb), [@dshemetov](https://github.com/dshemetov), [@EmilHvitfeldt](https://github.com/EmilHvitfeldt), [@kevbaer](https://github.com/kevbaer), [@nhward](https://github.com/nhward), [@regisely](https://github.com/regisely), and [@topepo](https://github.com/topepo). |
0 commit comments