Lrnr_density_semiparametric bug: density estimates don't always integrate to 1#434
Open
ctesta01 wants to merge 1 commit intotlverse:masterfrom
Open
Lrnr_density_semiparametric bug: density estimates don't always integrate to 1#434ctesta01 wants to merge 1 commit intotlverse:masterfrom
ctesta01 wants to merge 1 commit intotlverse:masterfrom
Conversation
Core issue: in some cases, Lrnr_density_semiparametric produces density models that don't have area-under-the-curve = 1. Diagnosis: The problem is that min_obs_error can get arbitrarily close to 0 (while being positive). When min_obs_error is used to replace values in the var_preds vector, those are then sqrt()'d and divided by in the `errors / sd_preds` step, leading to the rescaled errors being sent towards +/- Infinity inappropriately. This causes `approx(density(...))` to yield density curves that don't have area under them approximately equal to 1. The proposed bugfix is to replace var_preds that are too small with an tolerance based on 10% of the observed standard deviation of the errors from the conditional mean model.
ctesta01
added a commit
to ctesta01/nadir
that referenced
this pull request
Aug 6, 2025
Now that the bugfix related to area-under-the-curve=1 (see tlverse/sl3#434 ) has been fixed, the fix has been incorporated and some re-writing of the Density Learning article was needed.
ctesta01
added a commit
to ctesta01/nadir
that referenced
this pull request
Aug 6, 2025
Now that the bugfix related to area-under-the-curve=1 (see tlverse/sl3#434 ) has been fixed, the fix has been incorporated and some re-writing of the Density Learning article was needed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
In some uses, I have found that the density models that
Lrnr_density_semiparametricdon't pass a simple test:Does the area under the density curve estimated integrate to 1, as it should be for a probability density model?
what's going on
In the training step, on line 124,
min_obs_error <- 2 * min(se_task$Y)can be arbitrarily small.For context, the
se_task$Yis a regression of the squared errors from the conditional mean model on the predictors. So if the conditional mean model makes low error predictions, nothing stops thismin_obs_errorfrom being extremely small.Later on, the
min_obs_errorvalue is used to replace problematic values in thevar_predsobject on line 127. Thevar_predsobject is thensqrt()ed and used to scale the errors (errors <- errors / sd_predson line 129) to make them constant variance before fitting anapprox(density(...))to the scaled error distribution.When
sd_predsis very small (as can happen from themin_obs_errorbeing used), these scaled errors are pushed towards +/- infinity unstably and this seems to mess up the density estimation.For reference, what this does to the density estimates is it seems to make them more coarsely discretized, resulting in the area under the density curve being noisier and more prone to being far from 1.
proposed bugfix
The core issue is that
min_obs_erroris too small. Rather than using the minimum of the squared error, I have tried using the square of 10% of the standard deviation of the prediction errors from the conditional mean model. My reasoning is that we do want to replace the values wherevar_preds < 0with something small, but not too/arbitrarily small.The proposed bugfix included in this PR result in the tests (also included in this PR) now passing, and the density curve estimated is a lot smoother by comparison, yielding area-under-the-curve very close to 1.
test to illustrate the bug
The core logic of the tests is this:
Lrnr_density_semiparametricmodel to a dataset with an outcomeYfor which we estimate the density and predictorsX1throughXn. Say the training dataset isdata. Call the fit modeldensity_model.data[1,].X1throughXnand form a grid of possibleYvalues — say betweenmin(data$Y) - 10*sd(data$Y)tomax(data$Y) + 10*sd(data$Y).Yvalues usingdensity_modelthe fixed values ofdata[1,X1], ..., data[1, Xn]as the predictors, we are evaluating whether or notIn my case, I found that this was fine for the models I fit on the
mtcarsdataset, but using theMASS::Bostondataset, I ran into lots of situations where this test failed. I've included an example from each dataset in the tests I wrote.