-
Notifications
You must be signed in to change notification settings - Fork 105
Description
Hello there!
I work in a financial services and monotone constraints are particularly important for interpretability in our predictive models. Fortunately, the xgboost
framework support monotone constraints for numeric predictors. My team heavily uses tidymodels
for model research. While parsnip
supports additional arguments to the computational engine via ...
in parsnip::set_engine()
, the xgboost
engine requires a vector of values where -1
corresponds to a negative constraint, 0
corresponds to no constraint, and 1
corresponds to a positive constraint.
This gets a little tricky when working with recipes
because we typically think of predictors by their column name, not their position in the design matrix. I have found a workaround where I first prep the recipe, extract the predictors, and use pattern matching to create this vector. I then have to pass it to parsnip::set_engine()
using {{ }}
so it injects the vector and allows the workflow to be used in cross validation and other tuning exercises.
#### Setup ####
library(tidyverse)
library(tidymodels)
#### Model ####
xgb_rec <-
recipes::recipe(
disp ~ mpg + hp + qsec,
data = mtcars
)
# Create vector of monotone constraints
# First, prep the recipe and pull predictors
# to simulate design matrix. Then, use term
# names to constrain positive or negative
# and snag the vector. This is a little janky.
monotone <-
xgb_rec |>
recipes::prep() |>
purrr::pluck("term_info") |>
dplyr::filter(
role == "predictor"
) |>
dplyr::mutate(
monotone = dplyr::case_match(
variable,
"mpg" ~ -1, # Negative constraint
"hp" ~ 1, # Positive constraint
.default = 0 # No constraint
)
) |>
dplyr::pull(monotone)
xgb_wf <-
parsnip::boost_tree(
mode = "regression"
) |>
parsnip::set_engine(
engine = "xgboost",
monotone_constraints = {{ monotone }}
) |>
workflows::workflow(
preprocessor = xgb_rec,
spec = _
)
xgb_fit <- parsnip::fit(xgb_wf, mtcars)
#### Validate HP Constraint ####
# Create synthetic data where we ensure
# a positive, monotonic, relationship
# between HP and the predicted output
# holding different MPG values constant.
synthetic <-
mtcars |>
dplyr::slice_head(
n = 1
) |>
dplyr::select(
- c(hp, mpg)
) |>
tidyr::expand_grid(
mpg = c(10, 20, 30),
hp = seq(50, 235)
)
# Looks good!
synthetic |>
parsnip::augment(
x = xgb_fit
) |>
ggplot2::ggplot(
ggplot2::aes(
x = hp,
y = .pred,
color = factor(mpg)
)
) +
ggplot2::geom_line()
My proposal is to provide more official support for monotone constraints via the parsnip::boost_tree()
function so that we can tune our models with and without these constraints and refer to the predictors by their name in the prepped recipe rather than relying on a solution like I showed above. Maybe it would look something like this:
# Explicit
parsnip::boost_tree(
mode = "regression",
engine = "xgboost",
monotone_positive = "hp",
monotone_negative = "mpg"
)
# For use later in tune grid
parsnip::boost_tree(
mode = "regression",
engine = "xgboost",
monotone_positive = tune::tune()
)
I am also interested in contributing to open source! I have experience with one of my own packages on CRAN and develop R packages internal to my organization, so I am happy to make an attempt at developing this change, if folks are open to it.
Let me know your thoughts!