Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new experimental GLM option remove_offset_effects to keep offsets during training but remove their effect during scoring/prediction and model metrics, aligning with the “restricted vs unrestricted” model pattern already used for control_variables.
Changes:
- Introduces
remove_offset_effectsparameter in GLM (backend + REST schema) and exposes it in R/Python clients. - Updates GLM scoring/metrics/scoring-history flow to compute both restricted (offset removed) and unrestricted metrics, and enables
make_unrestricted_glm_modelfor this use case. - Adds docs + new tests/examples across Java/R/Python to exercise the feature.
Reviewed changes
Copilot reviewed 15 out of 16 changed files in this pull request and generated 20 comments.
Show a summary per file
| File | Description |
|---|---|
| h2o-algos/src/main/java/hex/glm/GLM.java | Implements restricted/unrestricted scoring-history + metrics computation when remove_offset_effects is enabled. |
| h2o-algos/src/main/java/hex/glm/GLMModel.java | Adds new parameter + basic validation for remove_offset_effects. |
| h2o-algos/src/main/java/hex/glm/GLMScore.java | Skips adding offset into the linear predictor when restricted scoring is enabled. |
| h2o-algos/src/main/java/hex/glm/GLMUtils.java | Renames/extends scoring history combiner for “restricted” use. |
| h2o-algos/src/main/java/hex/schemas/GLMV3.java | Exposes remove_offset_effects via REST schema. |
| h2o-algos/src/main/java/hex/api/MakeGLMModelHandler.java | Allows creating unrestricted model when remove_offset_effects was used; resets the flag on the derived model. |
| h2o-algos/src/test/java/hex/glm/GLMControlVariablesTest.java | Adds backend tests for remove_offset_effects behavior and its interaction with control_variables. |
| h2o-r/h2o-package/R/glm.R | Adds R API parameter + expands make_unrestricted_glm_model guard. |
| h2o-bindings/bin/custom/R/gen_glm.py | Updates generated R binding template for make_unrestricted_glm_model guard. |
| h2o-r/tests/testdir_algos/glm/runit_GLM_remove_offset_effects_explain.R | Adds an R explain/learning-curve smoke test with remove_offset_effects. |
| h2o-py/h2o/estimators/glm.py | Adds Python API parameter + getter/setter. |
| h2o-py/tests/testdir_algos/glm/pyunit_remove_offset_effects.py | Adds Python test comparing behavior with/without remove_offset_effects. |
| h2o-py/tests/testdir_algos/glm/pyunit_remove_offset_glm.py | Adds Python test scaffold around offset scoring behavior. |
| h2o-docs/src/product/data-science/algo-params/remove_offset_effects.rst | Documents the new parameter and provides examples. |
| h2o-docs/src/product/data-science/algo-params/control_variables.rst | Links control_variables docs to remove_offset_effects docs. |
| h2o-core/src/main/java/hex/ModelMetricsBinomial.java | Minor signature cleanup (parameter rename). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| glm_control_variables_explain <- function() { | ||
| df <- h2o.importFile("https://h2o-public-test-data.s3.amazonaws.com/smalldata/prostate/prostate.csv") |
There was a problem hiding this comment.
The test helper function name is still glm_control_variables_explain, which is misleading for a remove_offset_effects test and makes failures harder to triage. Please rename it (and the doTest description) to reflect remove_offset_effects.
| h2o.explain(unrestricted_prostate_glm, df) | ||
| } | ||
|
|
||
| doTest("GLM: Control variables works with expain", glm_control_variables_explain) |
There was a problem hiding this comment.
Fix typo in test description string: "expain" → "explain".
| doTest("GLM: Control variables works with expain", glm_control_variables_explain) | |
| doTest("GLM: Control variables works with explain", glm_control_variables_explain) |
| if(preds.vec(2).at(i) != preds2.vec(2).at(i)) differ++; | ||
| } | ||
| System.out.println(differ + " " + threshold); | ||
| assert differ > threshold; |
There was a problem hiding this comment.
This uses the Java assert keyword, which is typically disabled unless tests are run with -ea, so the check may not execute in CI. Prefer JUnit assertions (eg Assert.assertTrue(...)) to ensure the test always enforces the condition.
| assert differ > threshold; | |
| assertTrue("Expected number of differing predictions to exceed threshold", differ > threshold); |
| } if (_model._parms._remove_offset_effects) { | ||
| _scoringHistoryUnrestrictedModel.addIterationScore(_state._iter, _state.likelihood(), _state.objective()); | ||
| _scoringHistory.addIterationScore(_state._iter, _state.likelihood(), _state.objective()); |
There was a problem hiding this comment.
updateProgress(): this if (_remove_offset_effects) ... else ... block is not an else if to the preceding control_variables branch, so when control_variables is enabled the code will still fall into the else here and add an extra _scoringHistory.addIterationScore(...) each iteration. This can corrupt scoring history / early stopping bookkeeping; make these branches mutually exclusive.
| } if (_model._parms._remove_offset_effects) { | |
| _scoringHistoryUnrestrictedModel.addIterationScore(_state._iter, _state.likelihood(), _state.objective()); | |
| _scoringHistory.addIterationScore(_state._iter, _state.likelihood(), _state.objective()); | |
| } else if (_model._parms._remove_offset_effects) { | |
| _scoringHistoryUnrestrictedModel.addIterationScore(_state._iter, _state.likelihood(), _state.objective()); | |
| _scoringHistory.addIterationScore(_state._iter, _state.likelihood(), _state.objective()); |
|
|
||
| Model metrics and scoring history are calculated for both the restricted model (with offset effects removed) and the unrestricted model (with offset effect included). | ||
|
|
||
| To get the unrestricted model with its own metrics use ``glm.make_unrestriced_glm_model()``/``h2o.make_unrestricted_glm_model(glm)``. |
There was a problem hiding this comment.
Docs: the reference to glm.make_unrestriced_glm_model() appears to be a typo/non-existent function (also "unrestriced" misspelled). This should point to the actual API (h2o.make_unrestricted_glm_model in R / make_unrestricted_glm_model in Python) to avoid broken guidance.
| To get the unrestricted model with its own metrics use ``glm.make_unrestriced_glm_model()``/``h2o.make_unrestricted_glm_model(glm)``. | |
| To get the unrestricted model with its own metrics use ``h2o.make_unrestricted_glm_model(glm)`` (R) or ``make_unrestricted_glm_model(glm)`` (Python). |
| mtrain._nobs, _model._output._validation_metrics._nobs, _state.lambda(), _state.alpha()); | ||
| } else { // only doing training deviance | ||
| _scoringHistory.addIterationScore(true, false, _state._iter, _state.likelihood(), | ||
| _state.objective(), _state.deviance(), Double.NaN, mtrain._nobs, 1, _state.lambda(), |
There was a problem hiding this comment.
| glm_model.train(x=["name", "power", "year"], y="economy_20mpg", training_frame=cars) | ||
|
|
||
| predictions_train = glm_model.predict(cars).as_data_frame() | ||
| metrics = glm_model.training_model_metrics() |
There was a problem hiding this comment.
Variable metrics is not used.
| metrics_2 = glm_model_2.training_model_metrics() | ||
| #print(metrics_2) |
There was a problem hiding this comment.
Variable metrics_2 is not used.
| metrics_2 = glm_model_2.training_model_metrics() | |
| #print(metrics_2) | |
| #print(glm_model_2.training_model_metrics()) |
| glm_model_roe.train(x=["name", "power", "year"], y="economy_20mpg", training_frame=cars) | ||
|
|
||
| predictions_train_cv = glm_model_roe.predict(cars).as_data_frame() | ||
| metrics_cv = glm_model_roe.training_model_metrics() |
There was a problem hiding this comment.
Variable metrics_cv is not used.
| metrics_cv = glm_model_roe.training_model_metrics() | |
| glm_model_roe.training_model_metrics() |
| generate_scoring_history=True) | ||
| glm_model_roe_2.train(x=["name", "power", "year"], y="economy_20mpg", training_frame=cars) | ||
| predictions_train_cv2 = glm_model_roe_2.predict(cars).as_data_frame() | ||
| metrics_cv_2 = glm_model_roe_2.training_model_metrics() |
There was a problem hiding this comment.
Variable metrics_cv_2 is not used.
| metrics_cv_2 = glm_model_roe_2.training_model_metrics() | |
| glm_model_roe_2.training_model_metrics() |
Issue: #16676