+ "markdown": "---\ntitle: \"Preprocessing data with recipes :: Cheatsheet\"\ndescription: \" \"\nimage-alt: \"\"\nexecute:\n eval: true\n output: false\n warning: false\n---\n\n\n::: {.cell .column-margin}\n<img src=\"images/logo-recipes.png\" height=\"138\" alt=\"Hex logo for recipes, drawing of a cartoon cupcake that has arms and legs and it is smiling .\" />\n<br><br><a href=\"../ml-preprocessing-data.pdf\">\n<p><i class=\"bi bi-file-pdf\"></i> Download PDF</p>\n<img src=\"../pngs/ml-preprocessing-data.png\" width=\"200\" alt=\"\"/>\n</a>\n<br><br>\n:::\n\n\n## Basics\n\nGet your data ready for modeling using ‘pipable’ sequences of feature engineering steps with **recipes.**\n\n```r\n# Initialize the recipe and add steps \nrec <- recipe(x ~ ., data = train_data) |>\n step_normalize(all_numeric_predictors())\n\n# Run the steps using training data\npr <- prep(rec, training = train_data)\n\n# Apply estimates to new data \nbake(pr, new_data = new_data)\n```\n\n- `recipe(x, ...)`: Begins a new recipe specification.\n\n- `prep(x, ...)`: Prepares the recipe with training data.\n\n- `bake(object, ...)`: Applies estimates from prep().\n\n- `update(object, ...)`: Updates and re-fits a model.\n\n### Common `step_` arguments\n\n| | |\n|---------------|---------------------------------------------------------|\n| `recipe` | A recipe object. New steps are appended to the recipe. |\n| `...` | Arguments passed to the external R function accessed by the step function |\n| `options` | Selector functions to choose variables for this step |\n\n## Filters\n\n- `step_nzv(recipe, ..., freq_cut = 95/5, unique_cut = 10, options = list(freq_cut = 95/5)`: Removes variables that are highly sparse and unbalanced.\n\n- `step_zv(recipe, ..., group = NULL)`: Removes variables that contain only a single value.\n\n- `step_lincomb(recipe, ..., max_steps = 5)`: Removes numeric variables that have exact linear combinations between them.\n\n- `step_corr(recipe, ..., threshold = 0.9, use = \"pairwise.complete.obs\", method = \"pearson\")`: Removes variables that have large absolute correlations with other variables.\n\n- `step_filter_missing(recipe, ..., threshold = 0.1)`: Removes variables that have too many missing values.\n\n- `step_rm(recipe, ...)`: Removes selected variables.\n\n## In-place Transformations\n\n- `step_mutate(recipe, ..., .pkgs = character())`: General purpose transformer using `dplyr.`\n\n- `step_relu(recipe, ..., shift = 0, reverse = FALSE, smooth = FALSE, prefix = \"right_relu_\")`: Applies smoothed rectified linear transformation.\n\n- `step_sqrt(recipe, ...)`: Applies square root transformation.\n\n### Basis functions\n\n- `step_spline_natural(recipe, ..., deg_free = 10, options = NULL, keep_original_cols = FALSE)`: Creates a natural spline (a.k.a restricted cubic spline) features.\n\n- `step_spline_b(recipe, ..., deg_free = 10, degree = 3, options = NULL, keep_original_cols = FALSE)`: Creates b-spline features.\n\n- `step_spline_convex(recipe, ..., deg_free = 10, degree = 3, options = NULL, keep_original_cols = FALSE)`\n\n- `step_spline_monotone(recipe, ..., deg_free = 10, degree = 3, options = NULL, keep_original_cols = FALSE)`\n\n- `step_spline_nonnegative(recipe, ..., deg_free = 10, degree = 3, options = NULL, keep_original_cols = FALSE)`\n\n- `step_poly(recipe, ..., degree = 2L, options = list(), keep_original_cols = FALSE)`: Creates new columns that are basis expansions of variables using orthogonal polynomials.\n\n- `step_poly_bernstein(recipe, ..., degree = 10, options = NULL, results = NULL, keep_original_cols = FALSE)`: Creates Bernstein polynomial features.\n\n### Normalization\n\n- `step_normalize(recipe, ..., na_rm = TRUE)`: Normalizes to have a standard deviation of 1 and mean of 0.\n\n- `step_YeoJohnson(recipe, ...)`: Makes data look more like a normal distribution.\n\n- `step_percentile(recipe, ..., options = list(probs = (0:100)/100), outside = \"none\")`: Replaces the value of a variable with its percentile from the training set.\n\n- `step_range(recipe, ..., min = 0, max = 1, clipping = TRUE)`: Normalizes numeric data to be within a pre-defined range of values.\n\n- `step_spatialsign(recipe, ..., na_rm = TRUE)`: Converts numeric data into a projection on to a unit sphere.\n\n### Discretize\n\n- `step_discretize(recipe, ..., num_breaks = 4, min_unique = 10, options = list(prefix = \"bin\"))`: Converts numeric data into a factor with bins having approximately the same number of data points.\n\n- `step_cut(recipe, ..., breaks, include_outside_range = FALSE)`: Cuts a numeric variable into a factor based on provided boundary values.\n\n\n## Imputation\n\n- `step_impute_bag(recipe, ..., impute_with = all_predictors(), trees = 25, options = list(keepX = FALSE))`: Creates a bagged tree model for data. Good for categorical data.\n\n- `step_impute_knn(recipe, ..., neighbors = 5, impute_with = all_predictors(), options = list(nthread = 1, eps = 1e-08))`: Uses Gower's distance which can be used for mixtures of nominal and numeric data.\n\n- `step_impute_linear(recipe, ..., impute_with = all_predictors())`: Creates linear regression models to impute missing data.\n\n- `step_impute_lower(recipe, ..., threshold = NULL)`: Substitutes the truncated value by a random number between zero and the truncation point.\n\n- `step_impute_mean(recipe, ..., trim = 0)`: Substitutes missing values of numeric variables by the training set mean of those variables.\n\n- `step_impute_median(recipe, ...)`: Substitutes missing values of numeric variables by the training set median of those variables.\n\n- `step_impute_mode(recipe, ...)`: Imputes nominal data using the most common value.\n\n- `step_impute_roll(recipe, ..., statistic = median, window = 5L)`: Imputes numeric data using a rolling window statistic.\n\n- `step_unknown(recipe, ..., new_level = \"unknown\")`: Assigns a missing value in a factor level to “unknown”.\n\n\n## Encodings\n\n### Type Converters\n\n- `step_factor2string(recipe, ...)`: Converts one or more factor vectors to strings.\n\n- `step_string2factor(recipe, ...)`: Converts one or more character vectors to factors (ordered or unordered).\n\n- `step_num2factor(recipe, ..., transform = function(x) x)`: Converts one or more numeric vectors to factors (ordered or unordered). This can be useful when categories are encoded as integers.\n\n- `step_integer(recipe, ..., strict = TRUE, zero_based = FALSE)`: Converts data into a set of ascending integers based on the ascending order from the training data.\n\n### Value Converters\n\n- `step_indicate_na(recipe, ..., sparse = \"auto\", keep_original_cols = TRUE)`: Creates and append additional binary columns to the data set to indicate which observations are missing.\n\n- `step_ordinalscore(recipe, ..., convert = as.numeric)`: Converts ordinal factor variables into numeric scores.\n\n- `step_unorder(recipe, ...)`: Turns ordered factor variables into unordered factor variables.\n\n### Other\n\n- `step_relevel(recipe, ..., ref_level)`: Reorders factor columns so that the level specified by ref_level is first. This is useful for `contr.treatment()` contrasts which take the first level as the reference.\n\n- `step_novel(recipe, ..., new_level = \"new\")`: Assigns a previously unseen factor level to \"new\" .\n\n- `step_other(recipe, ..., threshold = 0.05, other = \"other\" )`: Pools infrequently occurring values into an \"other\" category.\n\n\n## Dummy Variables\n\n- `step_dummy(recipe, ..., threshold = 0, other = \"other\", naming = dummy_names, prefix = NULL, keep_original_cols = TRUE)`: Standard dummy variable converter.\n\n- `step_dummy_extract(recipe, ..., sep = NULL, pattern = NULL, threshold = 0, other = \"other\", keep_original_cols = TRUE)`: Converts multiple nominal data into one or more numeric integer terms for the levels of the original data.\n\n- `step_dummy_multi_choice(recipe, ..., threshold = 0, other = \"other\", keep_original_cols = TRUE)`: Converts multiple nominal data into one or more numeric binary terms for the levels of the original data.\n\n\n### Convert\n\n- `step_bin2factor(recipe, ..., levels = c(\"yes\", \"no\"), ref_first = TRUE)`: Converts dummy variable into 2-level factor.\n\n### Text\n\n- `step_regex(recipe, ..., options = list(), pattern = \".\", options = list(), result = make.names(pattern), sparse = \"auto\", keep_original_cols = TRUE)`: Creates a dummy variable that detects the given regular expression.\n\n- `step_count(recipe, ..., normalize = FALSE, pattern = \".\", options = list(), result = make.names(pattern), sparse = \"auto\", keep_original_cols = TRUE)`: Create counts of patterns using regular expressions.\n\n## Date & Time\n\n- `step_date(recipe, ..., features = c(\"dow\", \"month\", \"year\"), abbr = TRUE, label = TRUE, ordinal = FALSE, locale = clock::clock_locale()$labels, keep_original_cols = TRUE)`: Converts date data into one or more factor or numeric variables (dow = day of week). \n\n- `step_time(recipe, ..., features = c(\"hour\", \"minute\", \"second\"), keep_original_cols = TRUE)`: Converts date-time data into one or more factor or numeric variables.\n\n- `step_holiday(recipe, ..., holidays = c(\"LaborDay\", \"NewYearsDay\", \"ChristmasDay\"), sparse = \"auto\", keep_original_cols = TRUE)`: Converts date data into binary indicators variables for common holidays.\n\n## Multivariate Transformation\n\n- `step_pca(recipe, ..., num_comp = 5, threshold = NA, options = list(), keep_original_cols = TRUE)`: Converts numeric variables into one or more principal components.\n\n- `step_ica(recipe, ..., num_comp = 5, options = list(method = \"C\"), keep_original_cols = TRUE)`: Converts numeric data into one or more independent components.\n\n- `step_kpca_poly(recipe, ..., num_comp = 5, degree = 2, scale_factor = 1, offset = 1, keep_original_cols = TRUE)`: Converts numeric data into principal components using a polynomial kernel basis expansion.\n\n- `step_kpca_rbf(recipe, ..., num_comp = 5, sigma = 0.2, keep_original_cols = TRUE)`: Converts numeric data into principal components using a radial basis function kernel basis expansion.\n\n- `step_isomap(recipe, ..., num_terms = 5, neighbors = 50, options = list(.mute = c(\"message\", \"output\")), keep_original_cols = TRUE)`: Uses multidimensional scaling to convert numeric data into new dimensions.\n\n- `step_nnmf_sparse(recipe, ..., num_comp = 2, penalty = 0.001, options = list(), keep_original_cols = TRUE)`: Converts numeric data into non-negative components.\n\n- `step_pls(recipe, ..., num_comp = 2, predictor_prop = 1, outcome = NULL, options = list(scale = TRUE), preserve = deprecated(), prefix = \"PLS\", keep_original_cols = TRUE)`: Converts numeric data into one or more new dimensions.\n\n\n### Centroids\n\n- `step_classdist(recipe, ..., class, mean_func = mean, cov_func = cov, pool = FALSE, log = TRUE, prefix = \"classdist_\", keep_original_cols = TRUE)`: Converts numeric data into Mahalanobis distance measurements to the data centroid.\n\n- `step_classdist_shrunken(recipe, ..., class = NULL, threshold = 1/2, sd_offset = 1/2, log = TRUE, prefix = \"classdist_\", keep_original_cols = TRUE)`: Converts numeric data into Euclidean distance to the regularized class centroid.\n\n- `step_depth(recipe, ..., class, metric = \"halfspace\", options = list(), data = NULL, prefix = \"depth_\", keep_original_cols = TRUE)`: Converts numeric data into a measurement of data depth by category\n\n### Other\n\n- `step_geodist(recipe, lat = NULL, lon = NULL, ref_lat = NULL, ref_lon = NULL, is_lat_lon = TRUE, log = FALSE, name = \"geo_dist\", keep_original_cols = TRUE)`: Calculates the distance between points on a map to a reference location.\n\n- `step_ratio(recipe, ..., denom = denom_vars(), naming = function(numer, denom) {make.names(paste(numer, denom, sep = \"_o_\")) }, keep_original_cols = TRUE)`: Creates ratios from selected numeric variables (denom).\n\n## Row Operations\n\n- `step_naomit(recipe, ...)`: Removes observations if they contain `NA` or `NaN` values.\n\n- `step_sample(recipe, ..., size = NULL, replace = FALSE)`: Samples rows using `dplyr::sample_n()` or `dplyr::sample_frac()`.\n\n- `step_shuffle(recipe, ...)`: Randomly changes the order of rows for selected variables.\n\n- `step_slice(recipe, ...)`: Filters rows using `dplyr::slice()`.\n\n## Other\n\n- `step_interact(recipe, terms, sep = \"_x_\", keep_original_cols = TRUE)` - Creates new columns that are interaction terms between two or more variables.\n\n- `step_rename(recipe, ...)` - Adds variables using `dplyr::rename()`.\n\n- `step_window(recipe, ..., size = 3, na_rm = TRUE, statistic = \"mean\", keep_original_cols = TRUE)` - Creates new columns that are the results of functions that compute statistics across moving windows.\n\n## Role & Type\n\n### Selectors\n\n- `all_outcomes()` / `all_predictors()` - Select variables from formula based on the most common two roles.\n\n- `has_role(match = “predictor\")` - Select by passing the role name required.\n\n- `has_type(match = \"numeric\")`- Select by type of variable.\n\n### Covenience selectors\n\n| | Double | Integer | Text |Logical|Factor \\n Unordered | Factor \\n Ordered|\n|---------------|------|------|------|------|------|------|\n| `all_string_predictors()` | | |✅| | | |\n| `all_logical_predictors()` | | | |✅| | |\n| `all_numeric_predictors()` | ✅|✅| | | | |\n| `all_integer_predictors()` | |✅| | | | |\n| `all_double_predictors()` | ✅| | | | | |\n| `all_factor_predictors()` | | | | |✅|✅|\n| `all_ordered_predictors()` | | | | | |✅|\n| `all_unordered_predictors()` || | | |✅|\n| `all_nominal_predictors()` | | |✅| |✅|✅|\n\n\n- `all_date_predictors()` / `all_datetime_predictors()` \n\n\n\n### Role Management\n\n*In case a variable is not a outcome or predictor but needs to be retained. Create new role, and set it to not ‘bake’.*\n\n```r\nrec <- recipe(x ~ ., data = train_data) |>\n update_role(my_id, new_role = \"id\") |>\n update_role_requirements(rec,\"id\",bake = FALSE)\n```\n\n- `add_role(recipe, ..., new_role = \"predictor\", new_type = NULL)` - Adds an additional role to variables that already have a role in the recipe. \n\n- `update_role(recipe, ..., new_role = \"predictor\", old_role = NULL)` - Alters an existing role in the recipe or assigns an initial role to variables that do not yet have a declared role.\n\n- `remove_role(recipe, ..., old_role)` - Eliminates a single existing role in the recipe.\n\n- `update_role_requirements(recipe, ..., bake = NULL)` - Allows for fine tunes requirements of the various roles you might come across in recipes.\n\nTo learm more about roles see: [https://recipes.tidymodels.org/reference/roles.html](https://recipes.tidymodels.org/reference/roles.html).\n",
0 commit comments