Skip to content

Lrnr_density_semiparametric bug: density estimates don't always integrate to 1#434

Open
ctesta01 wants to merge 1 commit intotlverse:masterfrom
ctesta01:master
Open

Lrnr_density_semiparametric bug: density estimates don't always integrate to 1#434
ctesta01 wants to merge 1 commit intotlverse:masterfrom
ctesta01:master

Conversation

@ctesta01
Copy link

In some uses, I have found that the density models that Lrnr_density_semiparametric don't pass a simple test:

Does the area under the density curve estimated integrate to 1, as it should be for a probability density model?

what's going on

In the training step, on line 124, min_obs_error <- 2 * min(se_task$Y) can be arbitrarily small.

For context, the se_task$Y is a regression of the squared errors from the conditional mean model on the predictors. So if the conditional mean model makes low error predictions, nothing stops this min_obs_error from being extremely small.

Later on, the min_obs_error value is used to replace problematic values in the var_preds object on line 127. The var_preds object is then sqrt()ed and used to scale the errors (errors <- errors / sd_preds on line 129) to make them constant variance before fitting an approx(density(...)) to the scaled error distribution.

When sd_preds is very small (as can happen from the min_obs_error being used), these scaled errors are pushed towards +/- infinity unstably and this seems to mess up the density estimation.

For reference, what this does to the density estimates is it seems to make them more coarsely discretized, resulting in the area under the density curve being noisier and more prone to being far from 1.

image

proposed bugfix

The core issue is that min_obs_error is too small. Rather than using the minimum of the squared error, I have tried using the square of 10% of the standard deviation of the prediction errors from the conditional mean model. My reasoning is that we do want to replace the values where var_preds < 0 with something small, but not too/arbitrarily small.

The proposed bugfix included in this PR result in the tests (also included in this PR) now passing, and the density curve estimated is a lot smoother by comparison, yielding area-under-the-curve very close to 1.

image

test to illustrate the bug

The core logic of the tests is this:

  1. Fit a Lrnr_density_semiparametric model to a dataset with an outcome Y for which we estimate the density and predictors X1 through Xn. Say the training dataset is data. Call the fit model density_model.
  2. Take an arbitrary row of the dataset, say data[1,].
  3. For that row, fix the values of X1 through Xn and form a grid of possible Y values — say between min(data$Y) - 10*sd(data$Y) to max(data$Y) + 10*sd(data$Y).
  4. When we integrate the density predictions across the range of possible Y values using density_model the fixed values of data[1,X1], ..., data[1, Xn] as the predictors, we are evaluating whether or not $\hat f_{Y}(y | X_1, ..., X_n)$ (the estimated density model) integrates to 1 when the conditioning variables are fixed, and this test fails for some datasets.

In my case, I found that this was fine for the models I fit on the mtcars dataset, but using the MASS::Boston dataset, I ran into lots of situations where this test failed. I've included an example from each dataset in the tests I wrote.

Core issue: in some cases, Lrnr_density_semiparametric produces density models that don't have area-under-the-curve = 1.

Diagnosis: The problem is that min_obs_error can get arbitrarily close to 0 (while being positive).  When min_obs_error is used to replace values in the var_preds vector, those are then sqrt()'d and divided by in the `errors / sd_preds` step, leading to the rescaled errors being sent towards +/- Infinity inappropriately.  This causes `approx(density(...))` to yield density curves that don't have area under them approximately equal to 1.

The proposed bugfix is to replace var_preds that are too small with an tolerance based on 10% of the observed standard deviation of the errors from the conditional mean model.
ctesta01 added a commit to ctesta01/nadir that referenced this pull request Aug 6, 2025
Now that the bugfix related to area-under-the-curve=1 (see tlverse/sl3#434 ) has been fixed, the fix has been incorporated and some re-writing of the Density Learning article was needed.
ctesta01 added a commit to ctesta01/nadir that referenced this pull request Aug 6, 2025
Now that the bugfix related to area-under-the-curve=1 (see tlverse/sl3#434 ) has been fixed, the fix has been incorporated and some re-writing of the Density Learning article was needed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant