Categorical Variables: Adding approach described by ESL as alternative to sufrep or dummies?

I wanted to ask whether if not too complicated you might still consider, at least for regression forests, too implement the original proposal to deal with categorical variables here: https://github.com/grf-labs/grf/issues/109

> We should likely implement the approach suggested in ESL where in each node, factor variables are ordered by their mean outcome before performing the split. This should be properly generalized to handle non-regression forests.

Being able to order factor variables by their mean outcome at each node seems sort of the most intuitive way to deal with categorical variables. For example, I imagine using a `sufrep` representation based on the mean of the outcome variable in the complete data set, might violate the clean honest sample splitting and thus might not be too nice from a statistics perspective. Using the approach described by ESL (Section 9.2.4 p. 311), one would only use outcome means based on the subsample from the half-sample used to build the tree, so honesty should still be fine. Also the ordering based on the subsample at each node might sometimes substantially deviate from a global ordering.

I know it is probably a lot of work and it would be much better if I made it myself and could offer a pull request. But unfortunately I am really not firm at all in C++ coding and don't really trust what ChatGPT proposed when I asked it for the code changes. I added some raw conceptual idea below in case that is any help. But I can fully understand if you think it would require too complex changes and is not worth the gains.

# A sketch on possible code change for training a tree 

**tree trainer: subsample, options → tree**  
1. draw half sample for honesty  
2. until termination condition,  
  2.1 draw possible split variables  
  2.2 `compute pseudo outcomes`  
  2.3 `find best split`  
3. repopulate leaves for honesty  
4. `precompute sufficient statistics`


## Extension for categorical variables in a regression tree

Consider a categorical variable x with K levels. It must be stored as an integer with levels 0,..., K-1.

2. For every possible split variable x that is a categorical variable do the following:

2.1: Compute a new numeric variable z with K levels, where z[k] where is the mean of the outcome variable y for all observations in the node (a subset of the honest half sample used to build the tree) where x=k.

2.2 I don't understand that step but if possible use z instead of x.

2.3 Use z to find the best split for variable x.

3. To repopulate the leaf or make predictions for a complete new data set, we need to store which x values go to the left and which to the right. A simple threshold value as for numeric variables does not suffice, since z has in general not the same ordering as x. One way is to store a simple boolean vector L  of size K (L for left) that is 1 if and only if z was smaller than the selected threshold value. 

Problem: Which values to assign to L[k] for levels k that did not exist for any x observation in the subsample at the node we split? We could set it by default to 0, meaning we only find the subset of levels that make us move left and say for all other levels we go right. I could imagine `rpart` does the same.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Categorical Variables: Adding approach described by ESL as alternative to sufrep or dummies? #1512

A sketch on possible code change for training a tree

Extension for categorical variables in a regression tree

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Categorical Variables: Adding approach described by ESL as alternative to sufrep or dummies? #1512

Description

A sketch on possible code change for training a tree

Extension for categorical variables in a regression tree

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions