Skip to content
1 change: 1 addition & 0 deletions docs/pages/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
## From PyPI

!!! info

`sknnr` is currently in pre-release on PyPI, so you'll need to use the `--pre` flag to install it.

```bash
Expand Down
88 changes: 74 additions & 14 deletions docs/pages/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,17 +30,73 @@ In addition to their core functionality of fitting, predicting, and scoring, `sk

### Regression and Classification

The estimators in `sknnr` are all initialized with an optional parameter `n_neighbors` that determines how many plots a target plot's attributes will be predicted from. When `n_neighbors` > 1, a plot's attributes are calculated as optionally-weighted averages of each of its _k_ nearest neighbors. Predicted values can fall anywhere between the observed plot values, making this "regression mode" suitable for continuous attributes (e.g. basal area). To maintain categorical attributes (e.g. dominant species type), the estimators can be run in "classification mode" with `n_neighbors` = 1, where each attribute is imputed directly from its nearest neighbor. To predict a combination of continuous and categorical attributes, it's possible to use two estimators and concatenate their predictions manually.
The estimators in `sknnr` are all initialized with an optional parameter `n_neighbors` that determines how many plots a target plot's attributes will be predicted from. When `n_neighbors` > 1, a plot's attributes are calculated as optionally-weighted averages of each of its _k_ nearest neighbors. Predicted values can fall anywhere between the observed plot values, making this "regression mode" suitable for continuous attributes (e.g. basal area). To maintain categorical attributes (e.g. dominant species type), the estimators can be run in "classification mode" with `n_neighbors` = 1, where each attribute is imputed directly from its nearest neighbor. To predict a combination of continuous and categorical attributes, it's possible to use two estimators and concatenate their predictions manually.

### Independent Scores and Predictions

When an independent test set is not available, the accuracy of a kNN regressor can be estimated by comparing each sample in the training set to its second-nearest neighbor, i.e. the closest point *excluding itself*. All `sknnr` estimators set `independent_prediction_` and `independent_score_` attributes when they are fit, which store the predictions and scores of this independent evaluation.
When an independent test set is not available, the accuracy of a kNN regressor can be estimated by comparing each sample in the training set to its second-nearest neighbor, i.e. the closest point _excluding itself_. All `sknnr` estimators set `independent_prediction_` and `independent_score_` attributes when they are fit, which store the predictions and scores of this independent evaluation.

```python
print(est.independent_score_)
# 0.10243925752772305
```

### Deterministic Neighbor Ordering

`scikit-learn`'s `KNeighborsRegressor` [warns](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html) that:

> in case of multiple neighbors being at the same distance, the result will depend on the order of the samples in the training data.

In `sknnr`, we allow the user to enforce strict ordering of neighbors with deterministic tie-breaking when calling `kneighbors` by using the `use_deterministic_ordering` parameter. When this value is `True`, neighbors are sorted using the following logical order:

1. **Scaled and rounded distances**: Neighbors are first sorted by their scaled and rounded distances. Scaling is performed per query point such that each neighbor's distance is first normalized by dividing by the maximum distance for that query point (or 1.0 if the maximum distance is less than 1.0), then rounded to `RawKNNRegressor.DISTANCE_PRECISION_DECIMALS` decimal places (currently set to 10). Some floating point operations in distance determination (notably `numpy.dot`) can introduce very small numerical differences across platforms, which is effectively handled by this rounding.
2. **Difference between query point row index and neighbors indexes**: If two or more neighbors have identical rounded distances, they are further sorted by the absolute difference between their row index in the training data and the row index of the query point. This ensures that when a sample is its own nearest neighbor, it will always be selected first.
3. **Neighbor index**: If two or more neighbors are still tied based on the two above criteria, they are finally sorted by their row index in the training data.

As an example, consider the following training data with three samples:

```python
import numpy as np
from sklearn.neighbors import KNeighborsRegressor

X = np.array([
[1, 2, 3],
[4, 5, 6],
[1, 2, 3]
])
y = np.array([10, 20, 30])
est = KNeighborsRegressor(n_neighbors=2).fit(X, y)

print(est.kneighbors(X, return_distance=False))
# [[0 2]
# [1 0]
# [0 2]] - Not returning itself as first neighbor
```

Using `sknnr`'s `RawKNNRegressor` with deterministic ordering:

```python
from sknnr import RawKNNRegressor
est = RawKNNRegressor(n_neighbors=2).fit(X, y)
print(est.kneighbors(X, return_distance=False, use_deterministic_ordering=True))
# [[0 2]
# [1 0]
# [2 0]] - Returning itself as first neighbor
```

The `use_deterministic_ordering` parameter defaults to `True`, but can revert to `scikit-learn`'s default behavior when calling `kneighbors`:

```python
distances, neighbors = est.kneighbors(
X_test,
use_deterministic_ordering=False
)
```

!!! warning

There may be potential to lose meaningful precision when rounding distances, especially with datasets that include samples with very large distances. In these situations, we suggest either increasing `RawKNNRegressor.DISTANCE_PRECISION_DECIMALS` or disabling `use_deterministic_ordering` at the expense of cross-platform reproducibility.

### Retrieving Dataframe Indexes

In `sklearn`, the `KNeighborsRegressor.kneighbors` method can identify the array index of the nearest neighbor to a given sample. Estimators in `sknnr` offer an additional parameter `return_dataframe_index` that allows neighbor samples to be identified directly by their index.
Expand All @@ -56,16 +112,18 @@ distances, neighbor_ids = est.kneighbors(X.iloc[:1], return_dataframe_index=True
print(y.loc[neighbor_ids[0]])
```

| | ABAM_COV | ABGRC_COV | ABPRSH_COV | ACMA3_COV | ALRH2_COV |
|------:|-----------:|------------:|-------------:|------------:|------------:|
| 52481 | 0 | 0 | 39.3469 | 0 | 0 |
| 60089 | 0 | 0 | 22.1199 | 0 | 0 |
| 56253 | 0 | 0 | 22.8948 | 0 | 0 |
| | ABAM_COV | ABGRC_COV | ABPRSH_COV | ACMA3_COV | ALRH2_COV |
| ----: | -------: | --------: | ---------: | --------: | --------: |
| 52481 | 0 | 0 | 39.3469 | 0 | 0 |
| 60089 | 0 | 0 | 22.1199 | 0 | 0 |
| 56253 | 0 | 0 | 22.8948 | 0 | 0 |

!!! warning

An estimator must be fit with a `DataFrame` in order to use `return_dataframe_index=True`.

!!! tip

In forestry applications, users typically store a unique inventory plot identification number as the index in the dataframe.

### Y-Fit Data
Expand All @@ -91,6 +149,7 @@ est = GNNRegressor(n_components=3).fit(X, y)
```

!!! warning

The maximum number of components depends on the input data and the estimator. Specifying `n_components` greater than the maximum number of components will raise an error.

### RFNN Distance Metric
Expand Down Expand Up @@ -149,13 +208,14 @@ X_df, y_df = load_swo_ecoplot(return_X_y=True, as_frame=True)
print(X_df.head())
```

| | ANNPRE | ANNTMP | AUGMAXT | CONTPRE | CVPRE | DECMINT | DIFTMP | SMRTMP | SMRTP | ASPTR | DEM | PRR | SLPPCT | TPI450 | TC1 | TC2 | TC3 | NBR |
|------:|---------:|---------:|----------:|----------:|--------:|----------:|---------:|---------:|--------:|--------:|--------:|--------:|---------:|---------:|--------:|---------:|---------:|--------:|
| 52481 | 740 | 514.667 | 2315 | 517.667 | 8971.67 | -583.111 | 2899.11 | 1136.11 | 212.222 | 197.667 | 1870.11 | 13196.7 | 48.3333 | 33.7778 | 218.778 | 68.5556 | -86.2222 | 343.556 |
| 52482 | 742 | 563.556 | 2354.33 | 502 | 9124.33 | -543.556 | 2898.89 | 1179.44 | 221.111 | 190.222 | 1713.11 | 16355.8 | 5.4444 | 6.4444 | 210.222 | 60.3333 | -96.6667 | 261.667 |
| 52484 | 738.556 | 639.111 | 2468.89 | 545.889 | 8897.22 | -479.111 | 2949 | 1266.22 | 236 | 194.556 | 1612.11 | 15132.6 | 15.5556 | -1.2222 | 157 | 110.222 | -17.4444 | 721 |
| 52485 | 730.333 | 622.667 | 2405.33 | 555 | 8829.78 | -481.222 | 2887.56 | 1244.22 | 234 | 196.444 | 1682.33 | 15146.7 | 19.8889 | -16.8889 | 152.556 | 86.1111 | -31.6667 | 597.111 |
| 52494 | 720 | 778.556 | 2678.11 | 658.556 | 8638 | -386.667 | 3065.78 | 1396 | 262 | 191.778 | 1345.67 | 16672.1 | 2 | 0.4444 | 214.667 | 58.5556 | -88.1111 | 294.222 |
| | ANNPRE | ANNTMP | AUGMAXT | CONTPRE | CVPRE | DECMINT | DIFTMP | SMRTMP | SMRTP | ASPTR | DEM | PRR | SLPPCT | TPI450 | TC1 | TC2 | TC3 | NBR |
| ----: | ------: | ------: | ------: | ------: | ------: | -------: | ------: | ------: | ------: | ------: | ------: | ------: | ------: | -------: | ------: | ------: | -------: | ------: |
| 52481 | 740 | 514.667 | 2315 | 517.667 | 8971.67 | -583.111 | 2899.11 | 1136.11 | 212.222 | 197.667 | 1870.11 | 13196.7 | 48.3333 | 33.7778 | 218.778 | 68.5556 | -86.2222 | 343.556 |
| 52482 | 742 | 563.556 | 2354.33 | 502 | 9124.33 | -543.556 | 2898.89 | 1179.44 | 221.111 | 190.222 | 1713.11 | 16355.8 | 5.4444 | 6.4444 | 210.222 | 60.3333 | -96.6667 | 261.667 |
| 52484 | 738.556 | 639.111 | 2468.89 | 545.889 | 8897.22 | -479.111 | 2949 | 1266.22 | 236 | 194.556 | 1612.11 | 15132.6 | 15.5556 | -1.2222 | 157 | 110.222 | -17.4444 | 721 |
| 52485 | 730.333 | 622.667 | 2405.33 | 555 | 8829.78 | -481.222 | 2887.56 | 1244.22 | 234 | 196.444 | 1682.33 | 15146.7 | 19.8889 | -16.8889 | 152.556 | 86.1111 | -31.6667 | 597.111 |
| 52494 | 720 | 778.556 | 2678.11 | 658.556 | 8638 | -386.667 | 3065.78 | 1396 | 262 | 191.778 | 1345.67 | 16672.1 | 2 | 0.4444 | 214.667 | 58.5556 | -88.1111 | 294.222 |

!!! note

`pandas` must be installed to use `as_frame=True`.
108 changes: 106 additions & 2 deletions src/sknnr/_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,9 @@ class RawKNNRegressor(

Attributes
----------
DISTANCE_PRECISION_DECIMALS : int, class attribute
Number of decimal places used when rounding scaled distances to ensure
deterministic neighbor ordering. Default is 10.
effective_metric_ : str
The distance metric to use. It will be same as the metric parameter
or a synonym of it, e.g. 'euclidean' if the metric parameter set to
Expand All @@ -124,6 +127,8 @@ class RawKNNRegressor(
Number of samples in the fitted data.
"""

DISTANCE_PRECISION_DECIMALS = 10

def fit(self, X, y):
"""Override fit to set attributes using mixins."""
self._set_dataframe_index_in(X)
Expand All @@ -137,12 +142,66 @@ def kneighbors(
n_neighbors=None,
return_distance=True,
return_dataframe_index=False,
use_deterministic_ordering=True,
):
"""Override kneighbors to optionally return dataframe indexes."""
"""
Find the K-neighbors of a point or points in the dataset and optionally
return dataframe indexes rather than array indices when the model was
fitted with a dataframe.

Parameters
----------
X : array-like of shape (n_queries, n_features), default=None
The query point or points. If not provided, neighbors of each
indexed point are returned. In this case, the query point is not
considered its own neighbor.
n_neighbors : int, default=None
Number of neighbors required for each sample. The default is the
value passed to the constructor.
return_distance : bool, default=True
Whether or not to return the distances.
return_dataframe_index : bool, default=False
Whether or not to return dataframe indexes instead of array indices.
Only applicable if the model was fitted with a dataframe.
use_deterministic_ordering : bool, default=True
Whether to use deterministic ordering of neighbors when distances
are nearly identical. If True, neighbors with nearly identical
distances (up to DISTANCE_PRECISION_DECIMALS decimal places) are
ordered lexicographically by:
(1) their scaled and rounded distances,
(2) the absolute difference between a query point's row index
and the neighbor index (so that a sample, when present, is
returned before other equally distant samples), and
(3) the neighbor index iself.
If False, use the default ordering from
`KNeighborsRegressor.kneighbors`. See the
[usage guide](`../../../usage/#deterministic-neighbor-ordering`)
for more details.

Returns
-------
neigh_dist : array-like of shape (n_queries, n_neighbors)
Array representing the lengths to points, only present if
return_distance=True.
neigh_ind : array-like of shape (n_queries, n_neighbors)
Array indices or dataframe indexes of the nearest points in the
population matrix.
"""
neigh_dist, neigh_ind = super().kneighbors(
X=X, n_neighbors=n_neighbors, return_distance=True
)

if use_deterministic_ordering:
row_scale = np.maximum(neigh_dist.max(axis=1, keepdims=True), 1.0)
rounded = np.round(
neigh_dist / row_scale, decimals=self.DISTANCE_PRECISION_DECIMALS
)
neigh_ind_diff = np.abs(neigh_ind - np.arange(len(neigh_ind))[:, None])
sorted_indices = np.lexsort((neigh_ind, neigh_ind_diff, rounded), axis=1)

neigh_dist = np.take_along_axis(neigh_dist, sorted_indices, axis=1)
neigh_ind = np.take_along_axis(neigh_ind, sorted_indices, axis=1)

if return_dataframe_index:
msg = "Dataframe indexes can only be returned when fitted with a dataframe."
check_is_fitted(self, "dataframe_index_in_", msg=msg)
Expand Down Expand Up @@ -257,14 +316,59 @@ def kneighbors(
n_neighbors=None,
return_distance=True,
return_dataframe_index=False,
use_deterministic_ordering=True,
):
"""Return neighbor indices and distances using transformed feature data."""
"""
Find the K-neighbors of a point or points of transformed feature data
and optionally return dataframe indexes rather than array indices when
the model was fitted with a dataframe.

Parameters
----------
X : array-like of shape (n_queries, n_features), default=None
The query point or points. Points are first transformed using the
fitted transformer. If not provided, neighbors of each indexed
point are returned. In this case, the query point is not
considered its own neighbor.
n_neighbors : int, default=None
Number of neighbors required for each sample. The default is the
value passed to the constructor.
return_distance : bool, default=True
Whether or not to return the distances.
return_dataframe_index : bool, default=False
Whether or not to return dataframe indexes instead of array indices.
Only applicable if the model was fitted with a dataframe.
use_deterministic_ordering : bool, default=True
Whether to use deterministic ordering of neighbors when distances
are nearly identical. If True, neighbors with nearly identical
distances (up to DISTANCE_PRECISION_DECIMALS decimal places) are
ordered lexicographically by:
(1) their scaled and rounded distances,
(2) the absolute difference between a query point's row index
and the neighbor index (so that a sample, when present, is
returned before other equally distant samples), and
(3) the neighbor index iself.
If False, use the default ordering from
`KNeighborsRegressor.kneighbors`. See the
[usage guide](`../../../usage/#deterministic-neighbor-ordering`)
for more details.

Returns
-------
neigh_dist : array-like of shape (n_queries, n_neighbors)
Array representing the lengths to points, only present if
return_distance=True.
neigh_ind : array-like of shape (n_queries, n_neighbors)
Array indices or dataframe indexes of the nearest points in the
population matrix.
"""
X_transformed = self._transform_X(X)
return self.regressor_.kneighbors(
X=X_transformed,
n_neighbors=n_neighbors,
return_distance=return_distance,
return_dataframe_index=return_dataframe_index,
use_deterministic_ordering=use_deterministic_ordering,
)

def predict(self, X):
Expand Down
75 changes: 75 additions & 0 deletions tests/test_estimators.py
Original file line number Diff line number Diff line change
Expand Up @@ -282,6 +282,81 @@ def test_n_features_in(estimator, X_y_yfit):
assert est.n_features_in_ == len(transformed_features)


@pytest.mark.parametrize(
("use_deterministic_ordering", "expected_idx_order"),
[(False, [1, 0]), (True, [0, 1])],
)
def test_kneighbors_deterministic_ordering(
use_deterministic_ordering, expected_idx_order
):
"""
Test that the use_deterministic_ordering parameter affects the order
of neighbors when distances are nearly identical.
"""
X = np.array([1e-11, 1e-12, 1.0]).reshape(-1, 1)
y = np.array([0, 1, 2])

X_query = np.array([[0.0]])

_, idx = (
RawKNNRegressor(n_neighbors=2)
.fit(X, y)
.kneighbors(X_query, use_deterministic_ordering=use_deterministic_ordering)
)
assert_array_equal(idx[0], expected_idx_order)


def test_kneighbors_uses_index_difference():
"""
Test that when distances are considered to be identical, the absolute index
difference is used before indexes to order neighbors.
"""
X = np.array([1e-11, 1e-12, 1.0]).reshape(-1, 1)
y = np.array([0, 1, 2])

# Use two identical query points which should have different
# neighbors due to their row indexes
X_query = np.array([[0.0], [0.0]])

_, idx = (
RawKNNRegressor(n_neighbors=2)
.fit(X, y)
.kneighbors(X_query, use_deterministic_ordering=True)
)
assert_array_equal(idx[0], [0, 1])
assert_array_equal(idx[1], [1, 0])


@pytest.mark.parametrize(
("precision_decimals", "expected_idx_order"),
[(8, [2, 1, 0]), (5, [1, 2, 0]), (2, [0, 1, 2])],
)
def test_kneighbors_precision_decimals(
monkeypatch, precision_decimals, expected_idx_order
):
"""
Test that changing DISTANCE_PRECISION_DECIMALS affects the order
of neighbors on small precision differences.
"""
monkeypatch.setattr(
RawKNNRegressor, "DISTANCE_PRECISION_DECIMALS", precision_decimals
)

# Create features that differ by small amounts such that
# precision_decimals falls between them
X = np.array([1e-3, 1e-6, 1e-9, 1.0]).reshape(-1, 1)
y = np.array([0, 1, 2, 3])

X_query = np.array([[0.0]])

_, idx = (
RawKNNRegressor(n_neighbors=3)
.fit(X, y)
.kneighbors(X_query, use_deterministic_ordering=True)
)
assert_array_equal(idx[0], expected_idx_order)


@pytest.mark.parametrize("forest_weights", ["uniform", [0.5, 1.5], (1.0, 2.0)])
def test_rfnn_handles_forest_weights(forest_weights):
"""Test that RFNNRegressor handles forest weights correctly."""
Expand Down
Binary file not shown.
Binary file not shown.
Binary file modified tests/test_regressions/test_mixed_rfnn_forests_reference_.npz
Binary file not shown.
Binary file modified tests/test_regressions/test_mixed_rfnn_forests_target_.npz
Binary file not shown.