Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
3a0d5ac
Change signature on `_set_tree_weights` to accommodate `X` and `y`
grovduck Aug 25, 2025
589aa67
Add `GBNodeTransformer`
grovduck Aug 25, 2025
6b4ef5a
Include transformer tests for `GBNodeTransformer`
grovduck Aug 25, 2025
8858c59
Add `GBNNRegressor`
grovduck Aug 25, 2025
5ff6bb1
Move uniform tree weighting into separate function
grovduck Aug 29, 2025
d0cba48
Reorganize tree weighting functions in GBNodeTransformer
grovduck Aug 29, 2025
64d0362
Add `tree_weighting_method` parameter to `GBNNRegressor`
grovduck Aug 29, 2025
5dc3718
Add `GBNNRegressor` to tests with regression data
grovduck Aug 29, 2025
733b08c
Change type hint information to superclass rather than union
grovduck Aug 29, 2025
416e0e3
Handle weights and nodes for multi-class GB classifiers
grovduck Oct 7, 2025
c8f3fff
Alter test for loss_delta
grovduck Oct 7, 2025
4e18211
Accommodate non-numeric types in GB classification targets in loss ca…
grovduck Jan 8, 2026
62bc70d
Merge branch 'main' into gbnn
grovduck Jan 19, 2026
2607b03
Update to correct logic for Hamming weights
grovduck Jan 19, 2026
eca6ddb
Merge branch 'main' into gbnn
grovduck Jan 31, 2026
c46cfcb
Update regression tests
grovduck Feb 2, 2026
fd3d8f5
Coarsen classes in regression test to avoid numerical imprecision
grovduck Feb 5, 2026
8ff13e3
Ease tolerance on distances in mixed-type forest test
grovduck Feb 5, 2026
8bd80ed
Further ease tolerance on distances in mixed-type forest test
grovduck Feb 5, 2026
4a95a13
Merge branch 'main' into gbnn
grovduck Feb 5, 2026
d45c9ad
Standardize Hamming weights to sum to 1.0 across all forests
grovduck Feb 7, 2026
437faa0
Replace hard-coded value for factor with sklearn logic
grovduck Feb 9, 2026
8cdf51a
Rename `delta_loss` to `train_improvement` and document intended use
grovduck Feb 9, 2026
39a7cbc
Add documentation pages for `GBNN` / `GBNodeTransformer`
grovduck Feb 9, 2026
119f31c
Swap order of `n_estimators` and `n_classes` axes from `apply`
grovduck Feb 14, 2026
6be7f8c
Simplify stacking of `y` array in test
grovduck Feb 14, 2026
ee51728
Fix feature names to accommodate classes
grovduck Feb 14, 2026
3f3f872
Better error checking on user-supplied forest weights
grovduck Feb 16, 2026
cac1a06
Respond to comments, change "float64" to `np.float64`
grovduck Feb 16, 2026
5d82730
Specialize precision on `atol` for tree-based tests
grovduck Feb 16, 2026
dc5b05d
Force `algorithm` to `brute` for tree-based methods.
grovduck Feb 16, 2026
3ca82d8
More explicit parameter naming in test function
grovduck Feb 16, 2026
770c841
Apply fixes
grovduck Feb 16, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/abbreviations.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,5 +2,6 @@
*[MSN]: Most Similar Neighbor
*[kNN]: k-nearest neighbor
*[RFNN]: Random Forest Nearest Neighbor
*[GBNN]: Gradient Boosting Nearest Neighbor
*[CCorA]: Canonical Correlation Analysis
*[CCA]: Canonical Correspondence Analysis
2 changes: 2 additions & 0 deletions docs/mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,12 +15,14 @@ nav:
- GNNRegressor: api/estimators/gnn.md
- MSNRegressor: api/estimators/msn.md
- RFNNRegressor: api/estimators/rfnn.md
- GBNNRegressor: api/estimators/gbnn.md
- Transformers:
- StandardScalerWithDOF: api/transformers/standardscalerwithdof.md
- MahalanobisTransformer: api/transformers/mahalanobis.md
- CCATransformer: api/transformers/cca.md
- CCorATransformer: api/transformers/ccora.md
- RFNodeTransformer: api/transformers/rfnode.md
- GBNodeTransformer: api/transformers/gbnode.md
- Datasets:
- Dataset: api/datasets/dataset.md
- "Moscow Mountain / St. Joes": api/datasets/moscow_stjoes.md
Expand Down
1 change: 1 addition & 0 deletions docs/pages/api/estimators/gbnn.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
::: sknnr.GBNNRegressor
1 change: 1 addition & 0 deletions docs/pages/api/transformers/gbnode.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
::: sknnr.transformers.GBNodeTransformer
16 changes: 10 additions & 6 deletions docs/pages/usage.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,14 @@
## Estimators

`sknnr` provides six estimators that are fully compatible, drop-in replacements for `scikit-learn` estimators:
`sknnr` provides seven estimators that are fully compatible, drop-in replacements for `scikit-learn` estimators:

- [RawKNNRegressor](api/estimators/raw.md)
- [EuclideanKNNRegressor](api/estimators/euclidean.md)
- [MahalanobisKNNRegressor](api/estimators/mahalanobis.md)
- [GNNRegressor](api/estimators/gnn.md)
- [MSNRegressor](api/estimators/msn.md)
- [RFNNRegressor](api/estimators/rfnn.md)
- [GBNNRegressor](api/estimators/gbnn.md)

These estimators can be used like any other `sklearn` regressor (or [classifier](#regression-and-classification))[^sklearn-docs].

Expand Down Expand Up @@ -128,11 +129,11 @@ print(y.loc[neighbor_ids[0]])

### Y-Fit Data

The [GNNRegressor](api/estimators/gnn.md), [MSNRegressor](api/estimators/msn.md), and [RFNNRegressor](api/estimators/rfnn.md) estimators can be fit with `X` and `y` data, but they also accept an optional `y_fit` parameter. If provided, `y_fit` is used to fit the ordination transformer while `y` is used to fit the kNN regressor.
The [GNNRegressor](api/estimators/gnn.md), [MSNRegressor](api/estimators/msn.md), [RFNNRegressor](api/estimators/rfnn.md), and [GBNNRegressor](api/estimators/gbnn.md) estimators can be fit with `X` and `y` data, but they also accept an optional `y_fit` parameter. If provided, `y_fit` is used to fit the transformer while `y` is used to fit the kNN regressor.

In forest attribute estimation, the underlying ordination transformations for two of these estimators (CCA for GNN and CCorA for MSN) typically use a matrix of species abundances or presence/absence information to relate the species data to environmental covariates, but often the user wants predictions based not on these features, but rather attributes that describe forest structure (e.g. biomass) or composition (e.g. species richness). In this case, the species matrix would be specified as `y_fit` and the stand attributes would be specified as `y`.
In forest attribute estimation, the underlying transformations for two of these estimators (CCA for GNN and CCorA for MSN) typically use a matrix of species abundances or presence/absence information to relate the species data to environmental covariates, but often the user wants predictions based not on these features, but rather attributes that describe forest structure (e.g. biomass) or composition (e.g. species richness). In this case, the species matrix would be specified as `y_fit` and the stand attributes would be specified as `y`.

For RFNN, the `y_fit` parameter can be used to specify the attributes for which individual random forests will be created (one forest per feature). As with GNN and MSN, the `y` parameter can then be used to specify the attributes that will be predicted by the nearest neighbors.
For RFNN and GBNN, the `y_fit` parameter can be used to specify the attributes for which individual forests will be created (one forest per feature). As with GNN and MSN, the `y` parameter can then be used to specify the attributes that will be predicted by the nearest neighbors.

```python
from sknnr import GNNRegressor
Expand All @@ -152,9 +153,11 @@ est = GNNRegressor(n_components=3).fit(X, y)

The maximum number of components depends on the input data and the estimator. Specifying `n_components` greater than the maximum number of components will raise an error.

### RFNN Distance Metric
### RFNN and GBNN Distance Metric

For all estimators other than [RFNNRegressor](api/estimators/rfnn.md), the distance metric used to determine nearest neighbors is the Euclidean distance between samples in the transformed space. RFNN, on the other hand, first builds a random forest for each feature in the `y` (or `y_fit`) matrix and then captures the node IDs (_not_ values) for each sample on every forest and tree. The distance between samples is calculated using [Hamming Distance](https://en.wikipedia.org/wiki/Hamming_distance), which captures the number of node IDs that are different between the target and reference samples and then divided by the total number of nodes. Therefore, a target and reference sample that share _all_ node IDs would have a distance of 0, whereas a target and reference sample that share _no_ node IDs would have a distance of 1.
For all estimators other than [RFNNRegressor](api/estimators/rfnn.md) and [GBNNRegressor](api/estimators/gbnn.md), the distance metric used to determine nearest neighbors is the Euclidean distance between samples in the transformed space. RFNN and GBNN, on the other hand, first build a forest for each feature in the `y` (or `y_fit`) matrix and then capture the node IDs (_not_ values) for each sample on every forest and tree. The distance between samples is calculated using [Hamming Distance](https://en.wikipedia.org/wiki/Hamming_distance), which captures the number of node IDs that are different between the target and reference samples and then divided by the total number of nodes. Therefore, a target and reference sample that share _all_ node IDs would have a distance of 0, whereas a target and reference sample that share _no_ node IDs would have a distance of 1.

Additionally, GBNN allows users to specify the `tree_weighting_method` parameter, which applies weights to the Hamming distance calculation based on the importance of the tree stage in training. When `tree_weighting_method` is set to `"train_improvement"`, tree stages that contribute more to reducing training loss are weighted more heavily. When `tree_weighting_method` is set to `"uniform"`, all trees are weighted equally.

### Custom Transformers

Expand All @@ -174,6 +177,7 @@ print(cca.fit_transform(X, y))
- [CCATransformer](api/transformers/cca.md)
- [CCorATransformer](api/transformers/ccora.md)
- [RFNodeTransformer](api/transformers/rfnode.md)
- [GBNodeTransformer](api/transformers/gbnode.md)

## Datasets

Expand Down
2 changes: 2 additions & 0 deletions src/sknnr/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
from .__about__ import __version__ # noqa: F401
from ._base import RawKNNRegressor
from ._euclidean import EuclideanKNNRegressor
from ._gbnn import GBNNRegressor
from ._gnn import GNNRegressor
from ._mahalanobis import MahalanobisKNNRegressor
from ._msn import MSNRegressor
Expand All @@ -13,4 +14,5 @@
"MSNRegressor",
"GNNRegressor",
"RFNNRegressor",
"GBNNRegressor",
]
246 changes: 246 additions & 0 deletions src/sknnr/_gbnn.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,246 @@
from __future__ import annotations

from typing import Callable, Literal

from numpy.typing import ArrayLike
from sklearn.base import BaseEstimator, TransformerMixin

from ._weighted_trees import WeightedTreesNNRegressor
from .transformers import GBNodeTransformer


class GBNNRegressor(WeightedTreesNNRegressor):
"""
Regression using Gradient Boosting Nearest Neighbors (GBNN) imputation.

New data is predicted by similarity of its node indexes to training
set node indexes when run through multiple univariate gradient boosting
models. A gradient boosting model is fit to each target in the training
set and node indexes are captured for each tree in each forest for each
training sample. Node indexes are then captured for inference data and
distance is calculated as the dissimilarity between node indexes.

Gradient boosting models are constructed using either scikit-learn's
`GradientBoostingRegressor` or `GradientBoostingClassifier` classes based on
the data type of each target (`y` or `y_fit`) in the training set. If the
target is numeric (e.g. `int` or `float`), a `GradientBoostingRegressor` is
used. If the target is categorical (e.g. `str` or `pd.Categorical`), a
`GradientBoostingClassifier` is used. The
`sknnr.transformers.GBNodeTransformer` class is responsible for constructing
the gradient boosting models and capturing the node indexes.

See `sklearn.neighbors.KNeighborsRegressor` for more detail on
parameters associated with nearest neighbors. See
`sklearn.ensemble.GradientBoostingRegressor` and
`sklearn.ensemble.GradientBoostingClassifier` for more detail on parameters
associated with gradient boosting. Note that some parameters (e.g.
`loss` and `alpha`) are specified separately for regression and
classification and have `_reg` and `_clf` suffixes.

Parameters
----------
loss_reg : {"squared_error", "absolute_error", "huber", "quantile"},
default="squared_error"
Loss function to be optimized for regression.
loss_clf : {"log_loss", "exponential"}, default="log_loss"
The loss function to be used for classification.
learning_rate : float, default=0.1
Learning rate shrinks the contribution of each tree by `learning_rate`.
n_estimators : int, default=100
The number of boosting stages to perform.
subsample : float, default=1.0
The fraction of samples to be used for fitting the individual base
learners.
criterion : {"friedman_mse", "squared_error"}, default="friedman_mse"
The function to measure the quality of a split.
min_samples_split : int or float, default=2
The minimum number of samples required to split an internal node.
min_samples_leaf : int or float, default=1
The minimum number of samples required to be at a leaf node.
min_weight_fraction_leaf : float, default=0.0
The minimum weighted fraction of the sum total of weights (of all the
input samples) required to be at a leaf node.
max_depth : int or None, default=3
Maximum depth of the individual regression estimators.
min_impurity_decrease : float, default=0.0
A node will be split if this split induces a decrease of the impurity
greater than or equal to this value.
init : estimator, "zero" or None, default=None
An estimator object that is used to compute the initial predictions.
random_state : int, RandomState instance or None, default=None
Controls the random seed given to each Tree estimator at each boosting
iteration.
max_features : {"sqrt", "log2"}, int or float, default=None
The number of features to consider when looking for the best split.
alpha_reg : float, default=0.9
The alpha-quantile of the huber loss function and the quantile loss
function.
verbose : int, default=0
Enable verbose output.
max_leaf_nodes : int or None, default=None
Grow trees with `max_leaf_nodes` in best-first fashion.
warm_start : bool, default=False
When set to `True`, reuse the solution of the previous call to fit and
add more estimators to the ensemble, otherwise, just erase the previous
solution.
validation_fraction : float, default=0.1
The proportion of training data to set aside as validation set for
early stopping.
n_iter_no_change : int or None, default=None
`n_iter_no_change` is used to decide if early stopping will be used to
terminate training when validation score is not improving.
tol : float, default=1e-4
Tolerance for the early stopping.
ccp_alpha : non-negative float, default=0.0
Complexity parameter used for Minimal Cost-Complexity Pruning.
forest_weights: {"uniform"}, array-like of shape (n_targets), default="uniform"
Weights assigned to each target in the training set when calculating
Hamming distance between node indexes. This allows for differential
weighting of targets when calculating distances. Note that all trees
associated with a target will receive the same weight. If "uniform",
each tree is assigned equal weight.
tree_weighting_method: {"train_improvement", "uniform"},
default="train_improvement"
The method used to weight the trees in each gradient boosting model.
n_neighbors : int, default=5
Number of neighbors to use by default for `kneighbors` queries.
weights : {"uniform", "distance"}, callable or None, default="uniform"
Weight function used in prediction.
n_jobs : int or None, default=None
The number of jobs to run in parallel.

Attributes
----------
effective_metric_ : str
Always set to 'hamming'.
effective_metric_params_ : dict
Always empty.
hamming_weights_ : np.array
When `fit`, provides the weights on each tree in each forest when
calculating the Hamming distance.
independent_prediction_ : np.array
When `fit`, provides the prediction for training data not allowing
self-assignment during neighbor search.
independent_score_ : double
When `fit`, the mean coefficient of determination of the independent
prediction across all features.
n_features_in_ : int
Number of features that the transformer outputs. This is equal to the
number of features in `y` (or `y_fit`) * `n_estimators_per_forest`.
n_samples_fit_ : int
Number of samples in the fitted data.
transformer_ : GBNodeTransformer
The fitted transformer which holds the built gradient boosting models
for each feature.
y_fit_ : np.array or pd.DataFrame
When `y_fit` is passed to `fit`, the data used to construct the
individual gradient boosting models. Note that all `y` data is used
for prediction.

Notes
-----
The `tree_weighting_method` parameter determines how the trees in each
forest are weighted when calculating distances between node indexes.
If `tree_weighting_method` is set to "train_improvement", tree weights are
calculated as a function of the change in loss between successive trees
in the gradient boosting estimator. As such, weights are directly
proportional to the loss function specified and the user may want to
choose the appropriate loss function (i.e. `loss_reg` or `loss_clf`)
for their task.

If `tree_weighting_method` is set to "uniform", all trees are weighted
equally.
"""

def __init__(
self,
*,
loss_reg: Literal[
"squared_error", "absolute_error", "huber", "quantile"
] = "squared_error",
loss_clf: Literal["log_loss", "exponential"] = "log_loss",
learning_rate: float = 0.1,
n_estimators: int = 100,
subsample: float = 1.0,
criterion: Literal["friedman_mse", "squared_error"] = "friedman_mse",
min_samples_split: int | float = 2,
min_samples_leaf: int | float = 1,
min_weight_fraction_leaf: float = 0.0,
max_depth: int | None = 3,
min_impurity_decrease: float = 0.0,
init: BaseEstimator | Literal["zero"] | None = None,
random_state: int | None = None,
max_features: Literal["sqrt", "log2"] | int | float | None = None,
alpha_reg: float = 0.9,
verbose: int = 0,
max_leaf_nodes: int | None = None,
warm_start: bool = False,
validation_fraction: float = 0.1,
n_iter_no_change: int | None = None,
tol: float = 0.0001,
ccp_alpha: float = 0.0,
forest_weights: Literal["uniform"] | ArrayLike[float] = "uniform",
tree_weighting_method: Literal[
"train_improvement", "uniform"
] = "train_improvement",
n_neighbors: int = 5,
weights: Literal["uniform", "distance"] | Callable = "uniform",
n_jobs: int | None = None,
):
self.loss_reg = loss_reg
self.loss_clf = loss_clf
self.learning_rate = learning_rate
self.n_estimators = n_estimators
self.subsample = subsample
self.criterion = criterion
self.min_samples_split = min_samples_split
self.min_samples_leaf = min_samples_leaf
self.min_weight_fraction_leaf = min_weight_fraction_leaf
self.max_depth = max_depth
self.min_impurity_decrease = min_impurity_decrease
self.init = init
self.random_state = random_state
self.max_features = max_features
self.alpha_reg = alpha_reg
self.verbose = verbose
self.max_leaf_nodes = max_leaf_nodes
self.warm_start = warm_start
self.validation_fraction = validation_fraction
self.n_iter_no_change = n_iter_no_change
self.tol = tol
self.ccp_alpha = ccp_alpha
self.forest_weights = forest_weights
self.tree_weighting_method = tree_weighting_method

super().__init__(
n_neighbors=n_neighbors,
weights=weights,
n_jobs=n_jobs,
)

def _get_transformer(self) -> TransformerMixin:
return GBNodeTransformer(
loss_reg=self.loss_reg,
loss_clf=self.loss_clf,
learning_rate=self.learning_rate,
n_estimators=self.n_estimators,
subsample=self.subsample,
criterion=self.criterion,
min_samples_split=self.min_samples_split,
min_samples_leaf=self.min_samples_leaf,
min_weight_fraction_leaf=self.min_weight_fraction_leaf,
max_depth=self.max_depth,
min_impurity_decrease=self.min_impurity_decrease,
init=self.init,
random_state=self.random_state,
max_features=self.max_features,
alpha_reg=self.alpha_reg,
verbose=self.verbose,
max_leaf_nodes=self.max_leaf_nodes,
warm_start=self.warm_start,
validation_fraction=self.validation_fraction,
n_iter_no_change=self.n_iter_no_change,
tol=self.tol,
ccp_alpha=self.ccp_alpha,
tree_weighting_method=self.tree_weighting_method,
)
8 changes: 0 additions & 8 deletions src/sknnr/_rfnn.py
Original file line number Diff line number Diff line change
Expand Up @@ -109,10 +109,6 @@ class RFNNRegressor(WeightedTreesNNRegressor):
Number of neighbors to use by default for `kneighbors` queries.
weights : {"uniform", "distance"}, callable or None, default="uniform"
Weight function used in prediction.
algorithm : {"auto", "ball_tree", "kd_tree", "brute"}, default="auto"
Algorithm used to compute the nearest neighbors.
leaf_size : int, default=30
Leaf size passed to `BallTree` or `KDTree`.

Attributes
----------
Expand Down Expand Up @@ -184,8 +180,6 @@ def __init__(
forest_weights: Literal["uniform"] | ArrayLike[float] = "uniform",
n_neighbors: int = 5,
weights: Literal["uniform", "distance"] | Callable = "uniform",
algorithm: Literal["auto", "ball_tree", "kd_tree", "brute"] = "auto",
leaf_size: int = 30,
):
self.n_estimators = n_estimators
self.criterion_reg = criterion_reg
Expand Down Expand Up @@ -213,8 +207,6 @@ def __init__(
super().__init__(
n_neighbors=n_neighbors,
weights=weights,
algorithm=algorithm,
leaf_size=leaf_size,
n_jobs=self.n_jobs,
)

Expand Down
Loading