-
Notifications
You must be signed in to change notification settings - Fork 2k
Description
Setting "weights_column = NULL" explicitly when calling h2o.gbm (or h2o.randomForest) causes unwanted variables to be used as predictors. The following example illustrates this issue using the prostate data that comes with the h2o R package.
library(h2o)
h2o.init()
pros_path <- system.file("extdata", "prostate.csv", package="h2o")
pros_data <- h2o.importFile(pros_path)
pros_data$CAPSULE <- as.factor(pros_data$CAPSULE)
WITHOUT setting "weights_column = NULL" explicitly in h2o.gbm, things work fine.
model <- h2o.gbm(x = c(3:9), y = 2, pros_data, distribution = "bernoulli")
h2o.varimp(model)
#Variable Importances:
variable relative_importance scaled_importance percentage
#1 GLEASON 112.808281 1.000000 0.352302
#2 PSA 70.384674 0.623932 0.219813
#3 VOL 51.909214 0.460154 0.162113
#4 DPROS 42.125877 0.373429 0.131560
#5 AGE 36.616043 0.324586 0.114353
#6 RACE 3.215139 0.028501 0.010041
#7 DCAPS 3.143716 0.027868 0.009818
Setting "weights_column = NULL" explicitly in h2o.gbm causes ID (not selected # initially) to be used as a predictor.
model <- h2o.gbm(x = c(3:9), y = 2, pros_data, distribution = "bernoulli", weights_column = NULL)
h2o.varimp(model)
#Variable Importances:
variable relative_importance scaled_importance percentage
#1 GLEASON 114.578041 1.000000 0.331926
#2 ID 55.921295 0.488063 0.162001
#3 PSA 54.399891 0.474785 0.157593
#4 VOL 39.352005 0.343452 0.114000
#5 DPROS 37.653904 0.328631 0.109081
#6 AGE 34.698425 0.302837 0.100519
#7 DCAPS 6.719040 0.058642 0.019465
#8 RACE 1.869231 0.016314 0.005415