Skip to content

No split found when target/output of RandomForestRegressor is very low. #191

@Hugemiler

Description

@Hugemiler

Hello! I'm running into an issue when I use the RandomForestRegressor on a dataset with very low (in the range of 1e-3, 1e-4 and lower) values as outputs (Y). When the values fall into that range, the DecisionTrees are made up entirely of a single layer of leaf, which predict always a constant value, the mean of the desired output. Curiously, simply multiplying the output vector for 100 or 1000 seems to solve the problem, and valid splits can then be found.

I'll try to provide a shareable MWE in the next few days. It might take a while, because the feature set is large and the data is protected, and I will have to do some masking. However, going through the source code, I'm wondering if the cuplrit is not

|| tsum * node.label > -1e-7 * wsum + tssq)

because the cases where the issue happens coincide with the true values of sum(y) * mean(y) > -1e-7 * length(y) + sum(y .^ 2). Needless to say, when I multiply everything for a small power of 10, the value of this expression becomes false, and valid splits can then be found.

Now, it is my understanding that, to a certain extent, Decision Trees should be irrespective of any kind of normalization. I expect that valid splits can be found even when the outputs are very, very low. Also, sweeping through Breiman's original C code, - avaiable here - I have not found an equivalent test to this specific line I tagged (but that could be my poor C reading skills?).

Can you please clarify the purpose of such line, confirm whether this is intended behavior, and see what can be done to mitigate this issue? (I know, the lack of a MWE makes things harder, however, I just wanted to leave the issue registered for now)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions