Skip to content

Categorical Variables: Adding approach described by ESL as alternative to sufrep or dummies? #1512

@skranz

Description

@skranz

I wanted to ask whether if not too complicated you might still consider, at least for regression forests, too implement the original proposal to deal with categorical variables here: #109

We should likely implement the approach suggested in ESL where in each node, factor variables are ordered by their mean outcome before performing the split. This should be properly generalized to handle non-regression forests.

Being able to order factor variables by their mean outcome at each node seems sort of the most intuitive way to deal with categorical variables. For example, I imagine using a sufrep representation based on the mean of the outcome variable in the complete data set, might violate the clean honest sample splitting and thus might not be too nice from a statistics perspective. Using the approach described by ESL (Section 9.2.4 p. 311), one would only use outcome means based on the subsample from the half-sample used to build the tree, so honesty should still be fine. Also the ordering based on the subsample at each node might sometimes substantially deviate from a global ordering.

I know it is probably a lot of work and it would be much better if I made it myself and could offer a pull request. But unfortunately I am really not firm at all in C++ coding and don't really trust what ChatGPT proposed when I asked it for the code changes. I added some raw conceptual idea below in case that is any help. But I can fully understand if you think it would require too complex changes and is not worth the gains.

A sketch on possible code change for training a tree

tree trainer: subsample, options → tree

  1. draw half sample for honesty
  2. until termination condition,
    2.1 draw possible split variables
    2.2 compute pseudo outcomes
    2.3 find best split
  3. repopulate leaves for honesty
  4. precompute sufficient statistics

Extension for categorical variables in a regression tree

Consider a categorical variable x with K levels. It must be stored as an integer with levels 0,..., K-1.

  1. For every possible split variable x that is a categorical variable do the following:

2.1: Compute a new numeric variable z with K levels, where z[k] where is the mean of the outcome variable y for all observations in the node (a subset of the honest half sample used to build the tree) where x=k.

2.2 I don't understand that step but if possible use z instead of x.

2.3 Use z to find the best split for variable x.

  1. To repopulate the leaf or make predictions for a complete new data set, we need to store which x values go to the left and which to the right. A simple threshold value as for numeric variables does not suffice, since z has in general not the same ordering as x. One way is to store a simple boolean vector L of size K (L for left) that is 1 if and only if z was smaller than the selected threshold value.

Problem: Which values to assign to L[k] for levels k that did not exist for any x observation in the subsample at the node we split? We could set it by default to 0, meaning we only find the subset of levels that make us move left and say for all other levels we go right. I could imagine rpart does the same.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions