Skip to content

Add a parsnip Supported Argument for Monotone Constraints on the XGBoost Engine #1259

@walkerjameschris

Description

@walkerjameschris

Hello there!

I work in a financial services and monotone constraints are particularly important for interpretability in our predictive models. Fortunately, the xgboost framework support monotone constraints for numeric predictors. My team heavily uses tidymodels for model research. While parsnip supports additional arguments to the computational engine via ... in parsnip::set_engine(), the xgboost engine requires a vector of values where -1 corresponds to a negative constraint, 0 corresponds to no constraint, and 1 corresponds to a positive constraint.

This gets a little tricky when working with recipes because we typically think of predictors by their column name, not their position in the design matrix. I have found a workaround where I first prep the recipe, extract the predictors, and use pattern matching to create this vector. I then have to pass it to parsnip::set_engine() using {{ }} so it injects the vector and allows the workflow to be used in cross validation and other tuning exercises.

#### Setup ####

library(tidyverse)
library(tidymodels)

#### Model ####

xgb_rec <-
  recipes::recipe(
    disp ~ mpg + hp + qsec,
    data = mtcars
  )

# Create vector of monotone constraints
# First, prep the recipe and pull predictors
# to simulate design matrix. Then, use term
# names to constrain positive or negative
# and snag the vector. This is a little janky.
monotone <-
  xgb_rec |>
  recipes::prep() |>
  purrr::pluck("term_info") |>
  dplyr::filter(
    role == "predictor"
  ) |>
  dplyr::mutate(
    monotone = dplyr::case_match(
      variable,
      "mpg" ~ -1, # Negative constraint
      "hp" ~ 1, # Positive constraint
      .default = 0 # No constraint
    )
  ) |>
  dplyr::pull(monotone)

xgb_wf <-
  parsnip::boost_tree(
    mode = "regression"
  ) |>
  parsnip::set_engine(
    engine = "xgboost",
    monotone_constraints = {{ monotone }}
  ) |>
  workflows::workflow(
    preprocessor = xgb_rec,
    spec = _
  )

xgb_fit <- parsnip::fit(xgb_wf, mtcars)

#### Validate HP Constraint ####

# Create synthetic data where we ensure
# a positive, monotonic, relationship
# between HP and the predicted output
# holding different MPG values constant.
synthetic <-
  mtcars |>
  dplyr::slice_head(
    n = 1
  ) |>
  dplyr::select(
    - c(hp, mpg)
  ) |>
  tidyr::expand_grid(
    mpg = c(10, 20, 30),
    hp = seq(50, 235)
  )

# Looks good!
synthetic |>
  parsnip::augment(
    x = xgb_fit
  ) |>
  ggplot2::ggplot(
    ggplot2::aes(
      x = hp,
      y = .pred,
      color = factor(mpg)
    )
  ) +
  ggplot2::geom_line()

Image

My proposal is to provide more official support for monotone constraints via the parsnip::boost_tree() function so that we can tune our models with and without these constraints and refer to the predictors by their name in the prepped recipe rather than relying on a solution like I showed above. Maybe it would look something like this:

# Explicit
parsnip::boost_tree(
  mode = "regression",
  engine = "xgboost",
  monotone_positive = "hp",
  monotone_negative = "mpg"
)

# For use later in tune grid
parsnip::boost_tree(
  mode = "regression",
  engine = "xgboost",
  monotone_positive = tune::tune()
)

I am also interested in contributing to open source! I have experience with one of my own packages on CRAN and develop R packages internal to my organization, so I am happy to make an attempt at developing this change, if folks are open to it.

Let me know your thoughts!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions