Conversation
- Follows the same implementation pattern as `RFNodeTransformer` - Uses @aazuspan's logic for setting gradient boosting model tree weights based on loss reduction
This commit also fixes one error and one warning in `TreeNodeTransformer` found during testing. In `transform`, returning `X` from the call to `_validate_data` is necessary because `GradientBoostingRegressor.apply` expects that `X` is a numpy array (calls `.shape()`). In `_fit`, returing `X` from the call to `_validate_data` returns `X` as a numpy array and removes the warning about fitting with feature names.
- Follows the same implementation pattern as `RFNNRegressor` - Does not yet have `tree_weights` parameter that allows user control over setting gradient boosting models tree weights.
- Move `delta_loss` into separate function to calculate tree weights for a single gradient boosting model - Create `ensemble_delta_loss` to iterate over `delta_loss` - Create `tree_weighting_method` as argument to GBNodeTransformer with `delta_loss` and `uniform` choices - Simplify `_set_tree_weights` to call a standalone function
There was a problem hiding this comment.
Pull Request Overview
This PR implements Gradient Boosting Nearest Neighbors (GBNN), a novel approach that uses ensembles of gradient boosting models in a k-nearest neighbors context. The implementation extends the existing RFNN concept by using gradient boosting estimators instead of random forests.
- Adds
GBNNRegressoras a new estimator using gradient boosting models for nearest neighbor calculations - Implements
GBNodeTransformerto capture node indexes from gradient boosting trees - Introduces tree weighting methods including delta loss weighting based on loss reduction between successive trees
Reviewed Changes
Copilot reviewed 10 out of 18 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| src/sknnr/_gbnn.py | New GBNNRegressor implementation with full parameter support for gradient boosting |
| src/sknnr/transformers/_gbnode_transformer.py | New transformer for capturing gradient boosting node indexes with delta loss weighting |
| src/sknnr/transformers/_tree_node_transformer.py | Refactored base class to support uniform weighting and improved validation |
| src/sknnr/transformers/_rfnode_transformer.py | Updated to use refactored uniform weighting function |
| src/sknnr/_weighted_trees.py | Updated type hints to support multiple tree transformer types |
| tests/test_transformers.py | Added comprehensive tests for GBNodeTransformer functionality |
| tests/test_estimators.py | Integration of GBNNRegressor into existing test suite |
| tests/test_regressions.py | Added GBNN to regression test mappings |
| src/sknnr/transformers/init.py | Export new GBNodeTransformer |
| src/sknnr/init.py | Export new GBNNRegressor |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
Currently, this won't even work - slipped through the tests as I wasn't testing with multi-class data. This errors with: Fixing the node indices shouldn't be too bad, we will just need to reshape when an estimator has Whether or not this is what we want is a different story. We would then reshape to match the node indexes shape. |
|
@grovduck, this looks awesome! I'm a little hung up on some of the questions around converting losses to weights, so these are just some quick responses on that topic - I still need to give everything else a proper read.
I like
In hindsight, I think trying to make automatic rules about converting losses to weights (e.g. square-root transforming some metrics but not others) seems like it opens a can of worms for us. I'm not sure if I like this suggestion, but maybe we add another weighting method The simplest solution (which we could revisit later) would probably just be to clearly document that the losses are mapped directly to weights, so choose your loss function accordingly (i.e., maybe use
That seems like it's probably a good idea! Any thoughts on what would need to be passed to the weighting function? |
|
@aazuspan, I'm feeling a bit high-centered on how to keep this PR moving and hoping you can help unblock us. Based on your feedback, I've got a few questions:
Sorry that I have nothing productive to show for now - only questions. I think I can make some progress on the multi-classification issue as well as some renaming based on your recommendations. |
I'm happy with moving ahead with the private attributes to implement train loss. Looking at the git blame, those have both been in place and untouched for several years, so I imagine they're pretty stable. OOB loss could be saved for a future feature.
I like the idea, but I'd be concerned about doubling the fitting time, and I wonder if you might have situations where the weights learned with subsampling wouldn't be representative of the trees that were fit without it.
Yeah, that's a great point. If we can get something usable and minimal here, that's probably smarter than trying to get things perfect on the first pass without fully understanding how/if this is actually going to work. |
Gradient boosting classifiers with a multiclass target (i.e. more than two distinct labels) behave differently than either continuous regression or binary classification. At each iteration, a separate tree is built for each class such that the final node matrix for a multi-class forest will have shape (n_samples, n_estimators, n_classes). In order to fit the NN paradigm with Hamming distance finding, these forests must be accommodated. This commit makes the following changes: - Adds a new estimator attribute called `n_trees_per_iteration_` which is a list of size `self.n_forests_` and captures the number of parallel trees created per iteration. For GB multi-class forests, this is set to n_classes (> 2). For all other forests this is set to 1. - Adjusts the data structure of the estimator attribute called `tree_weights_`. Previously, this was a 2D numpy array of shape (`n_forests_`, `n_estimators`), but because multiclass forests will have (`n_estimators` * `n_classes`) trees, the weights shape may vary from forest to forest. Now this is a list with size `n_forests_` of numpy arrays with shape (`n_estimators` * `n_trees_per_iteration_[i]`). - Modified forest weights when applied to multi-class forests. Because there are more trees in these forests, the forest weight needs to be adjusted (i.e. divided by `n_trees_per_iteration_[i]`) when calculating final weight. This ensures that each forest continues to have the user-specified (or equal) weights. - Removed `ensemble_delta_loss` and consolidated it into `GBNodeTransformer._set_tree_weights` - Created a new test to ensure correct estimator attributes on `GBNodeTransformer` objects.
|
@aazuspan, finally made a bit of progress on this PR associated with handling multi-class targets for gradient boosting classification. I tried to write a fairly complete commit note on what had to happen, but would appreciate you checking my logic on: 1) data structures I'm using to hold the tree weights; and 2) the decision (for now) not to include an option to weight individual class's trees at each iteration, mainly because I'm not sure how to extract this information on loss from Gradient boosting classifiers with a multiclass target (i.e. more than two distinct labels) behave differently than either continuous regression or binary classification. At each iteration, a separate tree is built for each class such that the final node matrix for a multi-class forest will have shape (n_samples, n_estimators, n_classes). In order to fit the NN paradigm with Hamming distance finding, these forests must be accommodated. This commit makes the following changes:
I still need to get to a few of your suggestions before it's probably ready for a full review. |
|
@grovduck, it took me a while to wrap my head around the issue with multi-class classification targets generating an extra dimension of trees, and I'm probably still not fully understanding all the nuances (despite your detailed commit message), so take my responses here with a grain of salt!
Using a list of arrays for I did wonder whether there would be any advantage/disadvantage to pre-normalizing
I might be completely misunderstanding, but is the idea that when a multi-class estimator has 3 classes, that the trees associated with each of those classes could be weighted differently? I could see that as a potential feature for manually weighting less important classes lower, but I'm not sure how/if that could be done automatically based on loss, unless maybe the loss function could be weighted? |
I like this idea and it makes sense to me intuitively that these should be "scaled" as tree weights rather than when integrating with the forest weights. Should we further say that a forest's weights should sum to 1.0? Currently, we're setting weights for uniform to be 1.0, but we could easily divide by As I started working on this change, though, I thought I better set up a test that uses a multi-class
but didn't know if that's still correct in the classification case? Do you have any insight on this? I'll admit to being a bit confused if whether the function stored in
Yes, that was the idea and it was a random thought that I brought up in 96 and you thought that the situation I had envisioned might be able to be addressed with the existing But I think we're in agreement on not allowing the concept of class weights in a multi-class problem because apportioning the loss to the different class's forest doesn't seem to be built-in functionality. Sorry, this PR keeps getting more and more complicated ... |
Yeah, I like this idea! Probably not necessary as you said, but it does seem like it makes the weights easier to interpret and compare.
Ah, good catch. It looks like
I think this is equivalent to how The only potential differences that I noticed are
|
As a result of a merge conflict when updating to `main`, I chose the incorrect weighting on `_get_hamming_weights` from `main` rather than this branch. Corrected it and all tests now pass.
|
Thanks for the detailed update, @grovduck! I don't think either of us could have foreseen all the related issues this was going to surface, but from an observer's perspective it seems like it's gone about as smoothly as it could have, all things considered!
Your testing changes all make sense to me. I certainly trust your approach, and I'm not worried about aggregating classes or loosening some tolerances to get things passing, especially since you have a good understanding of why those changes were needed. Don't let me derail the PR, but just to raise this point for future discussion, I noticed that the new GBNN tests have slowed down the test suite pretty considerably: running all tests with 6 cores on my laptop went from 42.71s on
Your list makes sense, and I don't think I have anything to add.
Yeah, getting a minimal working version merged before diving too deep into documentation or fine-tuning seems like the best strategy. It would be hard to write good documentation without really understanding how the estimator works in practice. |
There are two levels of standardization in this commit: first, forest weights are standardized such that their sum is 1.0 and second, tree weights within each forest are standardized such that their sum is equal to their corresponding (standardized) forest weights. In the case of a multi-class GBNN classifier, each class's tree weights will be (forest_weight / n_classes). Additionally: - add tests to confirm weighting logic for both RFNN and GBNN - fix transposed weights for GBNN (using `np.tile` instead of `np.repeat`)t forest weights are standardized such that their sum is 1.0 and second, tree weights within each forest are standardized such that their sum is equal to their corresponding (standardized) forest weights. In the case of a multi-class GBNN classifier, each class 's tree weights will be (forest_weight / n_classes). Additionally: - add tests to confirm weighting logic for both RFNN and GBNN - fix transposed weights for GBNN (using `np.tile` instead of `np.repeat`). This required regeneration of the tests with GBNN multiclass estimators
This commit replaces the multiplicative factor of scaling the initial loss calculation. Previously, we used a hard-coded value of 2, but now base the logic on scikit-learn's implementation in `BaseGradientBoosting._fit_stages`. This required regenerating regression files for two test instances because they use a `HalfMultnomialLoss` which should have a factor of 1.
|
@aazuspan, I think I'm ready to mark this as ready to review. I have no illusions that this will be quick, so please take your time with all the moving parts associated with this PR. A few follow-up notes based on the roadmap that I laid out: Rename
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 15 out of 29 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Thanks for the detailed rundown, @grovduck! I made a first review pass, although I shied away from the Copilot comments for now. Most of my comments are the same minor performance nit in different spots, but I did find what I think is a bug in the Hamming distance weighting and algorithm selection, which I'll detail at the bottom. Other responses are below.
I initially had some failing regression tests that I couldn't quite pinpoint, but realized that in test_estimators_with_mixed_type_forests, we run both kinds of GB and each has a different loss function (HalfSquaredError for regression that has a factor of 2, HalfMultinomialLoss for multi-class classification that has a factor of 1). It's a bit strange to me that a loss function with Half in its name would have a factor of 1, but I just followed the sklearn code on this one.
That is odd, but I agree that following sklearn is safe. It seems like the important thing is just that we calculate our initial loss the same way they calculate train_score_ to avoid accidentally doubling/halving the loss delta.
So rather than trying to hack a fix yet again, I'm wondering if we should xfail this particular test or continue to try to figure out how to fix it. One thing I read was that specifying min_samples_leaf to be > 1 might be more forgiving on lumping multiple reference plots into the same terminal node ID which might be a way to fix it. I'd value your opinion on the path we should take here.
I'm totally fine with xfailing this for now. It looks like we could even do a platform-specific xfail using a condition if this is reliably passing in Windows.
min_samples_leaf sounds worth exploring as well, but that could be saved for a future PR. This test seems like it's been a massive headache, so I fully understand if you want to put it on the backburner for a while!
It may be worth thinking about making those properties (calculated when needed) rather than estimator attributes? But, as you say, for another PR!
Cached properties for these sounds like a great idea!
Weighting bug
Here's code to reproduce the bug I ran into. It looks like an incompatibility where the fit method gets set automatically to ball tree which doesn't want a w parameter, but that's as far as I dug.
from sklearn.datasets import make_classification
from sknnr import GBNNRegressor
X, y = make_classification()
est = GBNNRegressor(n_estimators=10).fit(X, y)
# TypeError: __init__() got an unexpected keyword argument 'w'The small n_estimators seems to be critical - I'm assuming that's used as a heuristic when selecting the fit method.
EDIT: I marked this as "request changes", but if this turns out to be a tricky bug I'm fine leaving it for a future PR.
@aazuspan, great catch. That would have been bad to put out there! The good news is that I think it's a pretty simple fix and we discussed this previously when I was having all kinds of issues with Hamming weights. I think the fix is to force |
I took a bit different tack on this one and changed the test around to include different precision values for each combination. Let me know if you think this is too funky and/or if it doesn't pass on your (linux) system. @pytest.mark.parametrize(
("estimator", "reference", "precision"),
[
# Set a higher precision threshold for GBNN with target-based forests
# because the tree weights are less accurate when using a
# multi-class classification forest
(RFNNRegressor, True, 1e-8),
(RFNNRegressor, False, 1e-8),
(GBNNRegressor, True, 1e-8),
(GBNNRegressor, False, 1e-2),
],
ids=[
"reference_randomForest",
"target_randomForest",
"reference_gbnn",
"target_gbnn",
],
def test_estimators_with_mixed_type_forests(
ndarrays_regression,
moscow_stjoes_test_data,
estimator,
reference,
precision,
): |
It sounded familiar, but I couldn't remember where we landed - thanks for digging up the conversation. Hard-coding
Clever solution, and it passes on my setup! |
Hamming distance is best suited for using brute-force finding and other algorithms (e.g. `BallTree`, `KDTree`) are not compatible. Remove `algorithm` and `leaf_size` (only applicable to `BallTree` and `KDTree`) as user parameters.
|
@aazuspan, I've taken out |
aazuspan
left a comment
There was a problem hiding this comment.
Ready to merge whenever you are! This was a huge lift with all the tricky edge cases and reproducibility/performance hurdles, and it turned out great. I'm excited to start testing out GBNN for real 🎉
|
Whoo, this one was an absolute beast (started in August!). Thank you so much for persevering through all the pain. I'm not sure I'm ready to call it "great" 😄, but it's a solid start. Onward! |
This PR adds support for using ensembles of
GradientBoostingRegressorandGradientBoostingClassifiermodels in a kNN context. As far as we're aware, this is the first implementation of using gradient-boosting models paired with nearest neighbor imputation, but is a natural extension ofRFNNfirst developed by Crookston and Finley and implemented insknnr.Separate gradient-boosting estimators are built for each target feature present in
y(ory_fit) and the type of estimator (either regression or classification) is determined by the data type of each target (floating-point and integer use regression estimators, string andpd.Categoricaluse classification estimators). As with RFNN, node IDs from the reference samples that were used to fit the model are captured. New samples are run through the fit estimators and their node IDs are compared to reference node IDs using a weighted Hamming distance to identify nearest neighbors.Our intuition is that the individual (and successive) trees in each gradient-boosting model should have separate weights associated with them. @aazuspan first suggested weighting trees in an estimator proportional to the change in loss that is explained with each step. This is the default option for the
tree_weighting_methodparameter forGBNNRegressor, althoughuniformweighting is also available as an option.@aazuspan, maybe a lot to wade through on this PR, but a few specific areas of interest/concern:
uniformweighting which can be used by both RF and GB). I've called your methoddelta_loss, but open to different names as well. What are your thoughts on accepting a function as a parameter totree_weighting_method?test_estimators::test_gridsearchcv), I ran into the situation where the GB models failed to fit. I don't thinkGridSearchCVfails on these fits, but instead the values forest.train_score_seem to all be set to 0.0. This was causing the normalizing step (e.g.loss_delta / np.sum(loss_delta)) to fail, so for now I'm just putting in equal weights when this condition happens. My thought is that the model fitting would raise an exception, so it wouldn't typically get to this step, but that's not a good solution for users that want to use it in a grid search context.delta_loss. We can't test on monotonicity based on your example, but is there anything we should be testing on for "correctness" here?lossparameter as you noted: "Generalizing that to work with other loss functions, and figuring out whether it actually works in practice would take some more effort."Closes #96