handling categorical varaible with large number of levels


Hi All,

I have a data set where the variables are a mix of numeric and factor variables, several with 100's of levels.
I can encode the factor variables as one-hot-encoded variables but this will increase the number of variables and reduce the amount of information for each level.

What I would do in ranger is use the option **respect.unordered.factors = TRUE**. This orders the levels of the factor 
according to the mean of the response variable. It is known that this gives the same result as the "partition" method if it is done at 
every node. In this case it is only done once at the root node, but still seem to give a good approximation and is the recommended method.
In the absence of that option I can make an X matrix with the factor variables encoded as **respect.unordered.factors = TRUE** would do it and use that in fitting the random forest. 

However, **causal_forest** requires numeric variables and fits two regression forests, 
forest.Y and then forest.W (to estimate Y and W) and then fits **causal_train** to estimate the causal effect.

So far I have considered
1) making an X matrix with the factor variables  ordered on the  response Y -- this seem right for the forest.Y regression
2) the forest.W regression is only predicting two levels so the same encoded matrix should be fine
3) **causal_train** forest is more complex and I am not sure how to handle the factor variables.


Alternatively 
1) I use **ranger** with the **respect.unordered.factors = TRUE** to estimate tau
2) I then make a matrix with the factor variables ordered as tau
3) I use that in **causal_forest**


Do either of these methods seem right? What are other people doing?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

handling categorical varaible with large number of levels #1482

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

handling categorical varaible with large number of levels #1482

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions