Skip to content

Implementation of Gradient Boosting Nearest Neighbors (GBNN)#99

Merged
grovduck merged 33 commits intomainfrom
gbnn
Feb 17, 2026
Merged

Implementation of Gradient Boosting Nearest Neighbors (GBNN)#99
grovduck merged 33 commits intomainfrom
gbnn

Conversation

@grovduck
Copy link
Member

@grovduck grovduck commented Aug 29, 2025

This PR adds support for using ensembles of GradientBoostingRegressor and GradientBoostingClassifier models in a kNN context. As far as we're aware, this is the first implementation of using gradient-boosting models paired with nearest neighbor imputation, but is a natural extension of RFNN first developed by Crookston and Finley and implemented in sknnr.

Separate gradient-boosting estimators are built for each target feature present in y (or y_fit) and the type of estimator (either regression or classification) is determined by the data type of each target (floating-point and integer use regression estimators, string and pd.Categorical use classification estimators). As with RFNN, node IDs from the reference samples that were used to fit the model are captured. New samples are run through the fit estimators and their node IDs are compared to reference node IDs using a weighted Hamming distance to identify nearest neighbors.

Our intuition is that the individual (and successive) trees in each gradient-boosting model should have separate weights associated with them. @aazuspan first suggested weighting trees in an estimator proportional to the change in loss that is explained with each step. This is the default option for the tree_weighting_method parameter for GBNNRegressor, although uniform weighting is also available as an option.

@aazuspan, maybe a lot to wade through on this PR, but a few specific areas of interest/concern:

  • The tree weighting algorithm that you suggested is the bulk of the new code here. I decided to split that off into a separate function (as well as uniform weighting which can be used by both RF and GB). I've called your method delta_loss, but open to different names as well. What are your thoughts on accepting a function as a parameter to tree_weighting_method?
  • In testing (specifically on test_estimators::test_gridsearchcv), I ran into the situation where the GB models failed to fit. I don't think GridSearchCV fails on these fits, but instead the values for est.train_score_ seem to all be set to 0.0. This was causing the normalizing step (e.g. loss_delta / np.sum(loss_delta)) to fail, so for now I'm just putting in equal weights when this condition happens. My thought is that the model fitting would raise an exception, so it wouldn't typically get to this step, but that's not a good solution for users that want to use it in a grid search context.
  • I didn't write any new transformer tests other than to verify that the regression/classifier keywords get passed appropriately, but it feels like we need to do some checking on values coming out of delta_loss. We can't test on monotonicity based on your example, but is there anything we should be testing on for "correctness" here?
  • Similar to that, I also haven't handled the multi-classification weighting that we raised here. I'll continue to work on that.
  • As of now, I am using the squared-loss delta, but you mentioned that it might be better to sqrt that when calculating deltas. For that, we'd also need to factor in the loss parameter as you noted: "Generalizing that to work with other loss functions, and figuring out whether it actually works in practice would take some more effort."

Closes #96

- Follows the same implementation pattern as `RFNodeTransformer`
- Uses @aazuspan's logic for setting gradient boosting model tree
  weights based on loss reduction
This commit also fixes one error and one warning in
`TreeNodeTransformer` found during testing.  In `transform`,
returning `X` from the call to `_validate_data` is necessary because
`GradientBoostingRegressor.apply` expects that `X` is a numpy array
(calls `.shape()`).

In `_fit`, returing `X` from the call to `_validate_data` returns
`X` as a numpy array and removes the warning about fitting with
feature names.
- Follows the same implementation pattern as `RFNNRegressor`
- Does not yet have `tree_weights` parameter that allows user control
  over setting gradient boosting models tree weights.
- Move `delta_loss` into separate function to calculate tree weights for
  a single gradient boosting model
- Create `ensemble_delta_loss` to iterate over `delta_loss`
- Create `tree_weighting_method` as argument to GBNodeTransformer with
  `delta_loss` and `uniform` choices
- Simplify `_set_tree_weights` to call a standalone function
@grovduck grovduck self-assigned this Aug 29, 2025
@grovduck grovduck added enhancement New feature or request estimator Related to one or more estimators labels Aug 29, 2025
@grovduck grovduck requested review from aazuspan and Copilot August 29, 2025 19:21
@grovduck grovduck marked this pull request as draft August 29, 2025 19:21
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements Gradient Boosting Nearest Neighbors (GBNN), a novel approach that uses ensembles of gradient boosting models in a k-nearest neighbors context. The implementation extends the existing RFNN concept by using gradient boosting estimators instead of random forests.

  • Adds GBNNRegressor as a new estimator using gradient boosting models for nearest neighbor calculations
  • Implements GBNodeTransformer to capture node indexes from gradient boosting trees
  • Introduces tree weighting methods including delta loss weighting based on loss reduction between successive trees

Reviewed Changes

Copilot reviewed 10 out of 18 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/sknnr/_gbnn.py New GBNNRegressor implementation with full parameter support for gradient boosting
src/sknnr/transformers/_gbnode_transformer.py New transformer for capturing gradient boosting node indexes with delta loss weighting
src/sknnr/transformers/_tree_node_transformer.py Refactored base class to support uniform weighting and improved validation
src/sknnr/transformers/_rfnode_transformer.py Updated to use refactored uniform weighting function
src/sknnr/_weighted_trees.py Updated type hints to support multiple tree transformer types
tests/test_transformers.py Added comprehensive tests for GBNodeTransformer functionality
tests/test_estimators.py Integration of GBNNRegressor into existing test suite
tests/test_regressions.py Added GBNN to regression test mappings
src/sknnr/transformers/init.py Export new GBNodeTransformer
src/sknnr/init.py Export new GBNNRegressor

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@grovduck
Copy link
Member Author

Similar to that, I also haven't handled the multi-classification weighting that we raised here. I'll continue to work on that.

Currently, this won't even work - slipped through the tests as I wasn't testing with multi-class data.

from sklearn.datasets import make_classification

from sknnr import GBNNRegressor
from sknnr.transformers import GBNodeTransformer

# Create a sample dataset with three classes
X, y = make_classification(
    n_samples=200,
    n_features=10,
    n_informative=5,
    n_redundant=2,
    n_classes=3,
    n_clusters_per_class=1,
    random_state=42,
)

# Reclassify `y` as string to force `GradientBoostingClassifier`
y = np.reshape(y.astype(str), (-1, 1))

# Create a GBNodeTransformer
transformer = GBNodeTransformer().fit(X, y)
print(transformer.transform(X).shape) # Returns (200, 100, 3) - need it to return (200, 300)

# Create a GBNNRegressor
est = GBNNRegressor().fit(X, y)

This errors with:

ValueError: Found array with dim 3, while dim <= 2 is required by GBNNRegressor.

Fixing the node indices shouldn't be too bad, we will just need to reshape when an estimator has n_classes_ > 1. And the current delta_loss actually works in contexts with n_classes_ > 1 as train_score_ returns an array of shape (n_estimators_, 1). So if we simply broadcast that to a shape of (n_estimators_, n_classes_), that would be identical weights for each DecsionTreeRegressor (one per class) for each tree level, e.g.

[[0.57 0.57 0.57]
 [0.42 0.42 0.42]
 [0.32 0.32 0.32]
 ...
]

Whether or not this is what we want is a different story. We would then reshape to match the node indexes shape.

@aazuspan
Copy link
Contributor

@grovduck, this looks awesome! I'm a little hung up on some of the questions around converting losses to weights, so these are just some quick responses on that topic - I still need to give everything else a proper read.

I've called your method delta_loss, but open to different names as well.

I like delta_loss on its own, but thinking about other potential weighting methods we might add in the future, I wonder if we might want to have an option that uses the oob_improvement_ attribute? If so, maybe we'd want something like train_improvement to match that?

As of now, I am using the squared-loss delta, but you mentioned that it might be better to sqrt that when calculating deltas.

In hindsight, I think trying to make automatic rules about converting losses to weights (e.g. square-root transforming some metrics but not others) seems like it opens a can of worms for us. I'm not sure if I like this suggestion, but maybe we add another weighting method sqrt_delta_loss / sqrt_train_improvement? I guess we could also add another parameter like loss_to_weight_func where you could pass a callable like np.sqrt to make it more flexible, but I don't love that either.

The simplest solution (which we could revisit later) would probably just be to clearly document that the losses are mapped directly to weights, so choose your loss function accordingly (i.e., maybe use absolute_error instead of squared_error).

What are your thoughts on accepting a function as a parameter to tree_weighting_method?

That seems like it's probably a good idea! Any thoughts on what would need to be passed to the weighting function?

@grovduck
Copy link
Member Author

grovduck commented Sep 9, 2025

@aazuspan, I'm feeling a bit high-centered on how to keep this PR moving and hoping you can help unblock us. Based on your feedback, I've got a few questions:

  • How are you feeling about leaning on some of those private attributes/methods in order to calculate loss difference based on training? Is that a worthwhile compromise that would get us unstuck or should we abandon that approach?
  • Related to that, I do like the idea of using the OOB sample to determine loss differences, but I'm not sure that we want to enforce users to use subsampling as you note. What's the possibility of using subsampling internally just to derive these weights even if the user specifies subsample=1.0? Basically fit the model as the user specifies, but also fit it again to derive the weights. I don't really love this approach, though.
  • You bring up some good points on how losses get transformed into weights and how much flexibility we should include there. I'm leaning toward your simple solution ("clearly document that the losses are mapped directly to weights") just because I'm afraid of a proliferation of weighting options and/or extra parameters to specify that transformation. Ultimately, I'm thinking that the forest/tree weights may not have a huge effect on neighbor finding and I'd hate to build in all this flexibility at this point without understanding that behavior.
  • Related to that complexity issue, I'm willing to punt on the custom function as a parameter to the tree_weighting_method for now. As you asked, I'm trying to think of what would need to be passed to such a function and the only thing I've come up with is the array of losses (or potentially loss differences) for each forest which would need to be a 2D array to accommodate the multi-classification issue. But then the same set of questions come up: should those be training set losses or OOB losses or something else?

Sorry that I have nothing productive to show for now - only questions. I think I can make some progress on the multi-classification issue as well as some renaming based on your recommendations.

@aazuspan
Copy link
Contributor

aazuspan commented Sep 9, 2025

How are you feeling about leaning on some of those private attributes/methods in order to calculate loss difference based on training? Is that a worthwhile compromise that would get us unstuck or should we abandon that approach?

I'm happy with moving ahead with the private attributes to implement train loss. Looking at the git blame, those have both been in place and untouched for several years, so I imagine they're pretty stable. OOB loss could be saved for a future feature.

Related to that, I do like the idea of using the OOB sample to determine loss differences, but I'm not sure that we want to enforce users to use subsampling as you note. What's the possibility of using subsampling internally just to derive these weights even if the user specifies subsample=1.0? Basically fit the model as the user specifies, but also fit it again to derive the weights. I don't really love this approach, though.

I like the idea, but I'd be concerned about doubling the fitting time, and I wonder if you might have situations where the weights learned with subsampling wouldn't be representative of the trees that were fit without it.

Ultimately, I'm thinking that the forest/tree weights may not have a huge effect on neighbor finding and I'd hate to build in all this flexibility at this point without understanding that behavior.

Yeah, that's a great point. If we can get something usable and minimal here, that's probably smarter than trying to get things perfect on the first pass without fully understanding how/if this is actually going to work.

Gradient boosting classifiers with a multiclass target (i.e. more than
two distinct labels) behave differently than either continuous
regression or binary classification.  At each iteration, a separate tree
is built for each class such that the final node matrix for a
multi-class forest will have shape (n_samples, n_estimators, n_classes).

In order to fit the NN paradigm with Hamming distance finding, these
forests must be accommodated.  This commit makes the following changes:

- Adds a new estimator attribute called `n_trees_per_iteration_` which
  is a list of size `self.n_forests_` and captures the number of
  parallel trees created per iteration.  For GB multi-class forests,
  this is set to n_classes (> 2).  For all other forests this is set to
  1.
- Adjusts the data structure of the estimator attribute called
  `tree_weights_`.  Previously, this was a 2D numpy array of shape
  (`n_forests_`, `n_estimators`), but because multiclass forests will
  have (`n_estimators` * `n_classes`) trees, the weights shape may vary
  from forest to forest.  Now this is a list with size `n_forests_` of
  numpy arrays with shape (`n_estimators` *
  `n_trees_per_iteration_[i]`).
- Modified forest weights when applied to multi-class forests.  Because
  there are more trees in these forests, the forest weight needs to be
  adjusted (i.e. divided by `n_trees_per_iteration_[i]`) when
  calculating final weight.  This ensures that each forest continues to
  have the user-specified (or equal) weights.
- Removed `ensemble_delta_loss` and consolidated it into
  `GBNodeTransformer._set_tree_weights`
- Created a new test to ensure correct estimator attributes on
  `GBNodeTransformer` objects.
@grovduck
Copy link
Member Author

grovduck commented Oct 7, 2025

@aazuspan, finally made a bit of progress on this PR associated with handling multi-class targets for gradient boosting classification. I tried to write a fairly complete commit note on what had to happen, but would appreciate you checking my logic on: 1) data structures I'm using to hold the tree weights; and 2) the decision (for now) not to include an option to weight individual class's trees at each iteration, mainly because I'm not sure how to extract this information on loss from sklearn. We still have ability to weight both the forests and the set of trees at each iteration. Here is the commit message for 416e0e3:


Gradient boosting classifiers with a multiclass target (i.e. more than two distinct labels) behave differently than either continuous regression or binary classification. At each iteration, a separate tree is built for each class such that the final node matrix for a multi-class forest will have shape (n_samples, n_estimators, n_classes).

In order to fit the NN paradigm with Hamming distance finding, these forests must be accommodated. This commit makes the following changes:

  • Adds a new estimator attribute called n_trees_per_iteration_ which is a list of size self.n_forests_ and captures the number of parallel trees created per iteration. For GB multi-class forests, this is set to n_classes (> 2). For all other forests this is set to 1.
  • Adjusts the data structure of the estimator attribute called tree_weights_. Previously, this was a 2D numpy array of shape (n_forests_, n_estimators), but because multiclass forests will have (n_estimators * n_classes) trees, the weights shape may vary from forest to forest. Now this is a list with size n_forests_ of numpy arrays with shape (n_estimators * n_trees_per_iteration_[i]).
  • Modified forest weights when applied to multi-class forests. Because there are more trees in these forests, the forest weight needs to be adjusted (i.e. divided by n_trees_per_iteration_[i]) when calculating final weight. This ensures that each forest continues to have the user-specified (or equal) weights.
  • Removed ensemble_delta_loss and consolidated it into GBNodeTransformer._set_tree_weights
  • Created a new test to ensure correct estimator attributes on GBNodeTransformer objects.

I still need to get to a few of your suggestions before it's probably ready for a full review.

@aazuspan
Copy link
Contributor

@grovduck, it took me a while to wrap my head around the issue with multi-class classification targets generating an extra dimension of trees, and I'm probably still not fully understanding all the nuances (despite your detailed commit message), so take my responses here with a grain of salt!

  1. data structures I'm using to hold the tree weights;

Using a list of arrays for tree_weights_ seems like a good solution to me, given that each forest's trees can vary in shape in the multi-class case, and there are probably never enough forests that looping over them in Python would meaningfully impact performance.

I did wonder whether there would be any advantage/disadvantage to pre-normalizing tree_weights_ based on n_trees_per_iteration_ in the transformer (in _set_tree_weights), as opposed to normalizing the forests in the regressor (in _get_hamming_weights)? I suppose the weights aren't actually being used in the transformer itself so there's no problem with having the trees of a multi-class estimator be "over-weighted" until they actual get the the regressor, but maybe it would simplify things slightly to apply that normalization earlier? There might be a very good reason why that doesn't work, though.

  1. the decision (for now) not to include an option to weight individual class's trees at each iteration, mainly because I'm not sure how to extract this information on loss from sklearn. We still have ability to weight both the forests and the set of trees at each iteration.

I might be completely misunderstanding, but is the idea that when a multi-class estimator has 3 classes, that the trees associated with each of those classes could be weighted differently? I could see that as a potential feature for manually weighting less important classes lower, but I'm not sure how/if that could be done automatically based on loss, unless maybe the loss function could be weighted?

@grovduck
Copy link
Member Author

I did wonder whether there would be any advantage/disadvantage to pre-normalizing tree_weights_ based on n_trees_per_iteration_ in the transformer (in _set_tree_weights), as opposed to normalizing the forests in the regressor (in _get_hamming_weights)? I suppose the weights aren't actually being used in the transformer itself so there's no problem with having the trees of a multi-class estimator be "over-weighted" until they actual get the the regressor, but maybe it would simplify things slightly to apply that normalization earlier? There might be a very good reason why that doesn't work, though.

I like this idea and it makes sense to me intuitively that these should be "scaled" as tree weights rather than when integrating with the forest weights. Should we further say that a forest's weights should sum to 1.0? Currently, we're setting weights for uniform to be 1.0, but we could easily divide by n_estimators in the case of RF forests and by n_estimators * n_trees_per_iteration_[i] in the case of GB forests. Even though it may not be completely necessary, it might be a nice property to have?

As I started working on this change, though, I thought I better set up a test that uses a multi-class GradientBoostingClassifier. When I did this, I realized that this line in delta_loss no longer works:

initial_loss = (
    est._loss(np.asarray(y, dtype="float64"), est._raw_predict_init(X)) * 2
)

y might be an array of strings, so I think we need to first crosswalk y to its encoded values. Not fully understanding how this works, I've started with this:

y = np.searchsorted(est.classes_, y) if hasattr(est, "classes_") else y
initial_loss = (
    est._loss(np.asarray(y, dtype="float64"), est._raw_predict_init(X)) * 2
)

but didn't know if that's still correct in the classification case? Do you have any insight on this? I'll admit to being a bit confused if whether the function stored in est._loss will return a comparable result in both regression and classification cases. I'm still digging into this.

I might be completely misunderstanding, but is the idea that when a multi-class estimator has 3 classes, that the trees associated with each of those classes could be weighted differently? I could see that as a potential feature for manually weighting less important classes lower, but I'm not sure how/if that could be done automatically based on loss, unless maybe the loss function could be weighted?

Yes, that was the idea and it was a random thought that I brought up in 96 and you thought that the situation I had envisioned might be able to be addressed with the existing sample_weight parameter in fit. I don't think I ever responded to that thought!

But I think we're in agreement on not allowing the concept of class weights in a multi-class problem because apportioning the loss to the different class's forest doesn't seem to be built-in functionality.

Sorry, this PR keeps getting more and more complicated ...

@aazuspan
Copy link
Contributor

Should we further say that a forest's weights should sum to 1.0? Currently, we're setting weights for uniform to be 1.0, but we could easily divide by n_estimators in the case of RF forests and by n_estimators * n_trees_per_iteration_[i] in the case of GB forests. Even though it may not be completely necessary, it might be a nice property to have?

Yeah, I like this idea! Probably not necessary as you said, but it does seem like it makes the weights easier to interpret and compare.

y might be an array of strings, so I think we need to first crosswalk y to its encoded values. Not fully understanding how this works, I've started with this:

Ah, good catch. It looks like GradientBoostingClassifier uses _encode_y for the same purpose, but once classes_ are defined your solution looks a lot simpler.

but didn't know if that's still correct in the classification case? Do you have any insight on this? I'll admit to being a bit confused if whether the function stored in est._loss will return a comparable result in both regression and classification cases.

I think this is equivalent to how GradientBoostingClassifier sets its train_score_. I tried fitting a GradientBoostingClassifier with a string y and added a debug print where the train_loss_ is updated here, and the encoded labels passed to loss_ match your encodings in the modified delta_loss.

The only potential differences that I noticed are

  1. Their encodings are floats rather than ints. That might just be a side effect of using LabelEncoder in _encode_y, and seems like it should be irrelevant.
  2. They set their multiplication factor based on the loss function rather than hard-coding to 2.
  3. They pass the optional sample_weight to the loss function. I haven't wrapped my head around how that might affect the loss. We could store it during fit and pass it on to delta_loss, but I'm not sure whether that's necessary or not.

As a result of a merge conflict when updating to `main`, I chose the
incorrect weighting on `_get_hamming_weights` from `main` rather than
this branch.  Corrected it and all tests now pass.
@aazuspan
Copy link
Contributor

aazuspan commented Feb 6, 2026

Thanks for the detailed update, @grovduck! I don't think either of us could have foreseen all the related issues this was going to surface, but from an observer's perspective it seems like it's gone about as smoothly as it could have, all things considered!

It feels a bit wrong to put so much effort into making this one test pass, so if you have other suggestions, I am all ears!

Your testing changes all make sense to me. I certainly trust your approach, and I'm not worried about aggregating classes or loosening some tolerances to get things passing, especially since you have a good understanding of why those changes were needed.

Don't let me derail the PR, but just to raise this point for future discussion, I noticed that the new GBNN tests have slowed down the test suite pretty considerably: running all tests with 6 cores on my laptop went from 42.71s on main to 99.12s on gbnn. The fact that GBNN is slower than the other estimators is probably unavoidable and not a surprise, so I'm not suggesting any changes here, but we may want to think about strategies to speed up testing in general in the future.

I think the roadmap going forward to completing this issue is:

Your list makes sense, and I don't think I have anything to add.

Honestly, we probably have some additional work on getting this all to work as we would want, but I want to get it to the point where we can test whether the whole system of tree weighting is impactnull. For that reason, I'll probably hold off on writing a section for the usage guide that goes into detail about tree/forest weights, but that would be a future PR should the weights prove useful.

Yeah, getting a minimal working version merged before diving too deep into documentation or fine-tuning seems like the best strategy. It would be hard to write good documentation without really understanding how the estimator works in practice.

There are two levels of standardization in this commit: first, forest
weights are standardized such that their sum is 1.0 and second, tree
weights within each forest are standardized such that their sum is equal
to their corresponding (standardized) forest weights.  In the case of a
multi-class GBNN classifier, each class's tree weights will be
(forest_weight / n_classes).  Additionally:

- add tests to confirm weighting logic for both RFNN and GBNN
- fix transposed weights for GBNN (using `np.tile` instead of
  `np.repeat`)t forest weights are standardized such that their sum is
  1.0 and second, tree weights within each forest are standardized such
  that their sum is equal to their corresponding (standardized) forest
  weights.  In the case of a multi-class GBNN classifier, each class
 's tree weights will be (forest_weight / n_classes).  Additionally:

 - add tests to confirm weighting logic for both RFNN and GBNN
 - fix transposed weights for GBNN (using `np.tile` instead of
   `np.repeat`).  This required regeneration of the tests with GBNN
   multiclass estimators
This commit replaces the multiplicative factor of scaling the initial
loss calculation.  Previously, we used a hard-coded value of 2, but now
base the logic on scikit-learn's implementation in
`BaseGradientBoosting._fit_stages`.  This required regenerating
regression files for two test instances because they use a
`HalfMultnomialLoss` which should have a factor of 1.
@grovduck
Copy link
Member Author

grovduck commented Feb 10, 2026

@aazuspan, I think I'm ready to mark this as ready to review. I have no illusions that this will be quick, so please take your time with all the moving parts associated with this PR.

A few follow-up notes based on the roadmap that I laid out:

Rename delta_loss to train_improvement

I've changed the function name and the parameter name in both GBNNRegressor and GBNodeTransformer. I've duplicated the same "Notes" section for both of those docstrings which hopefully captures the warning you noted.

Use sklearn logic in calculating intial loss rather that simply multiplying by 2

I've done this as a ternary expression directly in train_improvement. I initially had some failing regression tests that I couldn't quite pinpoint, but realized that in test_estimators_with_mixed_type_forests, we run both kinds of GB and each has a different loss function (HalfSquaredError for regression that has a factor of 2, HalfMultinomialLoss for multi-class classification that has a factor of 1). It's a bit strange to me that a loss function with Half in its name would have a factor of 1, but I just followed the sklearn code on this one.

Add documentation section for GBNN

Added sections for GBNN documentation and added a fairly vague section on tree weights for now under the RFNN and GBNN Distance Metric section.

Write tests to ensure that Hamming weights are standardized (sum to 1.0 across all forests and trees)

On this, I actually had a bug in the code, where the weights array and the node arrays were the same size, but I had used np.repeat rather than np.tile to repeat weights in a multi-class problem. Once I corrected this, the tests line up to show the relationship between forest and tree weights and that they all sum to one. This led to new regression files being pushed (again). Particular tests to note are:

  • test_estimators.py/test_tree_estimators_handle_forest_weights
  • test_estimators.py/test_gbnn_multiclass_weights
  • test_estimators.py/test_hamming_weights_sum_to_one

Failing tests (still!)

The test_estimators_with_mixed_type_forests test is going to be the death of me. You'll see that we still have some very slight differences in the latest CI run for this test. I can replicate this locally between Windows and WSL and, again, it is actually different trees being built on each platform. So rather than trying to hack a fix yet again, I'm wondering if we should xfail this particular test or continue to try to figure out how to fix it. One thing I read was that specifying min_samples_leaf to be > 1 might be more forgiving on lumping multiple reference plots into the same terminal node ID which might be a way to fix it. I'd value your opinion on the path we should take here.

Don't let me derail the PR, but just to raise this point for future discussion, I noticed that the new GBNN tests have slowed down the test suite pretty considerably ... we may want to think about strategies to speed up testing in general in the future.

I totally agree. The multiclass classification tests are especially brutal because it has to built n_classes forests. The other thing I noticed (not specifically with our test suite, but in a case where I was using lots of plots) is that by getting independent_prediction_ and independent_score_ during fit could be slowing things down as well. It may be worth thinking about making those properties (calculated when needed) rather than estimator attributes? But, as you say, for another PR!

(If you want to hold off review for a bit, I'll probably ask for another Copilot review just to find some stupid issues).

@grovduck grovduck marked this pull request as ready for review February 10, 2026 00:33
@grovduck grovduck requested a review from Copilot February 10, 2026 00:34
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 29 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

@aazuspan aazuspan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed rundown, @grovduck! I made a first review pass, although I shied away from the Copilot comments for now. Most of my comments are the same minor performance nit in different spots, but I did find what I think is a bug in the Hamming distance weighting and algorithm selection, which I'll detail at the bottom. Other responses are below.

I initially had some failing regression tests that I couldn't quite pinpoint, but realized that in test_estimators_with_mixed_type_forests, we run both kinds of GB and each has a different loss function (HalfSquaredError for regression that has a factor of 2, HalfMultinomialLoss for multi-class classification that has a factor of 1). It's a bit strange to me that a loss function with Half in its name would have a factor of 1, but I just followed the sklearn code on this one.

That is odd, but I agree that following sklearn is safe. It seems like the important thing is just that we calculate our initial loss the same way they calculate train_score_ to avoid accidentally doubling/halving the loss delta.

So rather than trying to hack a fix yet again, I'm wondering if we should xfail this particular test or continue to try to figure out how to fix it. One thing I read was that specifying min_samples_leaf to be > 1 might be more forgiving on lumping multiple reference plots into the same terminal node ID which might be a way to fix it. I'd value your opinion on the path we should take here.

I'm totally fine with xfailing this for now. It looks like we could even do a platform-specific xfail using a condition if this is reliably passing in Windows.

min_samples_leaf sounds worth exploring as well, but that could be saved for a future PR. This test seems like it's been a massive headache, so I fully understand if you want to put it on the backburner for a while!

It may be worth thinking about making those properties (calculated when needed) rather than estimator attributes? But, as you say, for another PR!

Cached properties for these sounds like a great idea!

Weighting bug

Here's code to reproduce the bug I ran into. It looks like an incompatibility where the fit method gets set automatically to ball tree which doesn't want a w parameter, but that's as far as I dug.

from sklearn.datasets import make_classification
from sknnr import GBNNRegressor

X, y = make_classification()
est = GBNNRegressor(n_estimators=10).fit(X, y)

# TypeError: __init__() got an unexpected keyword argument 'w'

The small n_estimators seems to be critical - I'm assuming that's used as a heuristic when selecting the fit method.

EDIT: I marked this as "request changes", but if this turns out to be a tricky bug I'm fine leaving it for a future PR.

@grovduck
Copy link
Member Author

Here's code to reproduce the bug I ran into. It looks like an incompatibility where the fit method gets set automatically to ball tree which doesn't want a w parameter, but that's as far as I dug.

@aazuspan, great catch. That would have been bad to put out there!

The good news is that I think it's a pretty simple fix and we discussed this previously when I was having all kinds of issues with Hamming weights. I think the fix is to force algorithm=brute for tree-based methods and taking away that parameter (like we do with the metric parameter). I don't necessarily see a downside to that given that Hamming distance doesn't play well with kd_tree or ball_tree. Let me know if you think differently.

@grovduck
Copy link
Member Author

I'm totally fine with xfailing this for now. It looks like we could even do a platform-specific xfail using a condition if this is reliably passing in Windows.

I took a bit different tack on this one and changed the test around to include different precision values for each combination. Let me know if you think this is too funky and/or if it doesn't pass on your (linux) system.

@pytest.mark.parametrize(
    ("estimator", "reference", "precision"),
    [
        # Set a higher precision threshold for GBNN with target-based forests
        # because the tree weights are less accurate when using a
        # multi-class classification forest
        (RFNNRegressor, True, 1e-8),
        (RFNNRegressor, False, 1e-8),
        (GBNNRegressor, True, 1e-8),
        (GBNNRegressor, False, 1e-2),
    ],
    ids=[
        "reference_randomForest",
        "target_randomForest",
        "reference_gbnn",
        "target_gbnn",
    ],
def test_estimators_with_mixed_type_forests(
    ndarrays_regression,
    moscow_stjoes_test_data,
    estimator,
    reference,
    precision,
):

@aazuspan
Copy link
Contributor

The good news is that I think it's a pretty simple fix and we #100 (comment) when I was having all kinds of issues with Hamming weights. I think the fix is to force algorithm=brute for tree-based methods and taking away that parameter (like we do with the metric parameter). I don't necessarily see a downside to that given that Hamming distance doesn't play well with kd_tree or ball_tree. Let me know if you think differently.

It sounded familiar, but I couldn't remember where we landed - thanks for digging up the conversation. Hard-coding algorithm sounds like the right choice to me.

Let me know if you think this is too funky and/or if it doesn't pass on your (linux) system.

Clever solution, and it passes on my setup!

Hamming distance is best suited for using brute-force finding and other
algorithms (e.g. `BallTree`, `KDTree`) are not compatible.  Remove
`algorithm` and `leaf_size` (only applicable to `BallTree` and `KDTree`)
as user parameters.
@grovduck
Copy link
Member Author

@aazuspan, I've taken out algorithm (and leaf_size which was a BallTree/KDTree specific parameter) from both tree-based estimators and the WeightedTreesNNRegressor superclass. I think this is ready for another review when you get the chance. Thanks!

Copy link
Contributor

@aazuspan aazuspan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great, @grovduck! I just had a few more comments, but all pretty minor.

Copy link
Contributor

@aazuspan aazuspan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ready to merge whenever you are! This was a huge lift with all the tricky edge cases and reproducibility/performance hurdles, and it turned out great. I'm excited to start testing out GBNN for real 🎉

@grovduck grovduck merged commit 967c39b into main Feb 17, 2026
11 checks passed
@grovduck grovduck deleted the gbnn branch February 17, 2026 01:21
@grovduck
Copy link
Member Author

Whoo, this one was an absolute beast (started in August!). Thank you so much for persevering through all the pain. I'm not sure I'm ready to call it "great" 😄, but it's a solid start. Onward!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request estimator Related to one or more estimators

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Use GradientBoosting in a KNearestNeighbors context

3 participants