Skip to content

R: setting weights_column = NULL causes unwanted variables to be used as predictors #14936

@exalate-issue-sync

Description

@exalate-issue-sync

Setting "weights_column = NULL" explicitly when calling h2o.gbm (or h2o.randomForest) causes unwanted variables to be used as predictors. The following example illustrates this issue using the prostate data that comes with the h2o R package.

library(h2o)
h2o.init()
pros_path <- system.file("extdata", "prostate.csv", package="h2o")
pros_data <- h2o.importFile(pros_path)

pros_data$CAPSULE <- as.factor(pros_data$CAPSULE)

WITHOUT setting "weights_column = NULL" explicitly in h2o.gbm, things work fine.

model <- h2o.gbm(x = c(3:9), y = 2, pros_data, distribution = "bernoulli")

h2o.varimp(model)
#Variable Importances:

variable relative_importance scaled_importance percentage

#1 GLEASON 112.808281 1.000000 0.352302
#2 PSA 70.384674 0.623932 0.219813
#3 VOL 51.909214 0.460154 0.162113
#4 DPROS 42.125877 0.373429 0.131560
#5 AGE 36.616043 0.324586 0.114353
#6 RACE 3.215139 0.028501 0.010041
#7 DCAPS 3.143716 0.027868 0.009818

Setting "weights_column = NULL" explicitly in h2o.gbm causes ID (not selected # initially) to be used as a predictor.

model <- h2o.gbm(x = c(3:9), y = 2, pros_data, distribution = "bernoulli", weights_column = NULL)

h2o.varimp(model)
#Variable Importances:

variable relative_importance scaled_importance percentage

#1 GLEASON 114.578041 1.000000 0.331926
#2 ID 55.921295 0.488063 0.162001
#3 PSA 54.399891 0.474785 0.157593
#4 VOL 39.352005 0.343452 0.114000
#5 DPROS 37.653904 0.328631 0.109081
#6 AGE 34.698425 0.302837 0.100519
#7 DCAPS 6.719040 0.058642 0.019465
#8 RACE 1.869231 0.016314 0.005415

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions