lemma-osu · grovduck · Jan 31, 2026 · Jan 22, 2026 · Jan 22, 2026 · Jan 26, 2026
diff --git a/docs/pages/installation.md b/docs/pages/installation.md
@@ -3,6 +3,7 @@
 ## From PyPI
 
 !!! info
+
     `sknnr` is currently in pre-release on PyPI, so you'll need to use the `--pre` flag to install it.
 
 ```bash

diff --git a/docs/pages/usage.md b/docs/pages/usage.md
@@ -30,17 +30,73 @@ In addition to their core functionality of fitting, predicting, and scoring, `sk
 
 ### Regression and Classification
 
-The estimators in `sknnr` are all initialized with an optional parameter `n_neighbors` that determines how many plots a target plot's attributes will be predicted from. When `n_neighbors` > 1, a plot's attributes are calculated as optionally-weighted averages of each of its _k_ nearest neighbors. Predicted values can fall anywhere between the observed plot values, making this "regression mode" suitable for continuous attributes (e.g. basal area). To maintain categorical attributes (e.g. dominant species type), the estimators can be run in "classification mode" with `n_neighbors` = 1, where each attribute is imputed directly from its nearest neighbor. To predict a combination of continuous and categorical attributes, it's possible to use two estimators and concatenate their predictions manually. 
+The estimators in `sknnr` are all initialized with an optional parameter `n_neighbors` that determines how many plots a target plot's attributes will be predicted from. When `n_neighbors` > 1, a plot's attributes are calculated as optionally-weighted averages of each of its _k_ nearest neighbors. Predicted values can fall anywhere between the observed plot values, making this "regression mode" suitable for continuous attributes (e.g. basal area). To maintain categorical attributes (e.g. dominant species type), the estimators can be run in "classification mode" with `n_neighbors` = 1, where each attribute is imputed directly from its nearest neighbor. To predict a combination of continuous and categorical attributes, it's possible to use two estimators and concatenate their predictions manually.
 
 ### Independent Scores and Predictions
 
-When an independent test set is not available, the accuracy of a kNN regressor can be estimated by comparing each sample in the training set to its second-nearest neighbor, i.e. the closest point *excluding itself*. All `sknnr` estimators set `independent_prediction_` and `independent_score_` attributes when they are fit, which store the predictions and scores of this independent evaluation.
+When an independent test set is not available, the accuracy of a kNN regressor can be estimated by comparing each sample in the training set to its second-nearest neighbor, i.e. the closest point _excluding itself_. All `sknnr` estimators set `independent_prediction_` and `independent_score_` attributes when they are fit, which store the predictions and scores of this independent evaluation.
 
 ```python
 print(est.independent_score_)
 # 0.10243925752772305
 ```
 
+### Deterministic Neighbor Ordering
+
+`scikit-learn`'s `KNeighborsRegressor` [warns](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html) that:
+
+> in case of multiple neighbors being at the same distance, the result will depend on the order of the samples in the training data.
+
+In `sknnr`, we allow the user to enforce strict ordering of neighbors with deterministic tie-breaking when calling `kneighbors` by using the `use_deterministic_ordering` parameter. When this value is `True`, neighbors are sorted using the following logical order:
+
+1. **Scaled and rounded distances**: Neighbors are first sorted by their scaled and rounded distances. Scaling is performed per query point such that each neighbor's distance is first normalized by dividing by the maximum distance for that query point (or 1.0 if the maximum distance is less than 1.0), then rounded to `RawKNNRegressor.DISTANCE_PRECISION_DECIMALS` decimal places (currently set to 10). Some floating point operations in distance determination (notably `numpy.dot`) can introduce very small numerical differences across platforms, which is effectively handled by this rounding.
+2. **Difference between query point row index and neighbors indexes**: If two or more neighbors have identical rounded distances, they are further sorted by the absolute difference between their row index in the training data and the row index of the query point. This ensures that when a sample is its own nearest neighbor, it will always be selected first.
+3. **Neighbor index**: If two or more neighbors are still tied based on the two above criteria, they are finally sorted by their row index in the training data.
+
+As an example, consider the following training data with three samples:
+
+```python
+import numpy as np
+from sklearn.neighbors import KNeighborsRegressor
+
+X = np.array([
+    [1, 2, 3],
+    [4, 5, 6],
+    [1, 2, 3]
+])
+y = np.array([10, 20, 30])
+est = KNeighborsRegressor(n_neighbors=2).fit(X, y)
+
+print(est.kneighbors(X, return_distance=False))
+# [[0 2]
+#  [1 0]
+#  [0 2]] - Not returning itself as first neighbor
+```
+
+Using `sknnr`'s `RawKNNRegressor` with deterministic ordering:
+
+```python
+from sknnr import RawKNNRegressor
+est = RawKNNRegressor(n_neighbors=2).fit(X, y)
+print(est.kneighbors(X, return_distance=False, use_deterministic_ordering=True))
+# [[0 2]
+#  [1 0]
+#  [2 0]] - Returning itself as first neighbor
+```
+
+The `use_deterministic_ordering` parameter defaults to `True`, but can revert to `scikit-learn`'s default behavior when calling `kneighbors`:
+
+```python
+distances, neighbors = est.kneighbors(
+    X_test,
+    use_deterministic_ordering=False
+)
+```
+
+!!! warning
+
+    There may be potential to lose meaningful precision when rounding distances, especially with datasets that include samples with very large distances.  In these situations, we suggest either increasing `RawKNNRegressor.DISTANCE_PRECISION_DECIMALS` or disabling `use_deterministic_ordering` at the expense of cross-platform reproducibility.
+
 ### Retrieving Dataframe Indexes
 
 In `sklearn`, the `KNeighborsRegressor.kneighbors` method can identify the array index of the nearest neighbor to a given sample. Estimators in `sknnr` offer an additional parameter `return_dataframe_index` that allows neighbor samples to be identified directly by their index.
@@ -56,16 +112,18 @@ distances, neighbor_ids = est.kneighbors(X.iloc[:1], return_dataframe_index=True
 print(y.loc[neighbor_ids[0]])
 ```
 
-|       |   ABAM_COV |   ABGRC_COV |   ABPRSH_COV |   ACMA3_COV |   ALRH2_COV |
-|------:|-----------:|------------:|-------------:|------------:|------------:|
-| 52481 |          0 |           0 |      39.3469 |           0 |           0 |
-| 60089 |          0 |           0 |      22.1199 |           0 |           0 |
-| 56253 |          0 |           0 |      22.8948 |           0 |           0 |
+|       | ABAM_COV | ABGRC_COV | ABPRSH_COV | ACMA3_COV | ALRH2_COV |
+| ----: | -------: | --------: | ---------: | --------: | --------: |
+| 52481 |        0 |         0 |    39.3469 |         0 |         0 |
+| 60089 |        0 |         0 |    22.1199 |         0 |         0 |
+| 56253 |        0 |         0 |    22.8948 |         0 |         0 |
 
 !!! warning
+
     An estimator must be fit with a `DataFrame` in order to use `return_dataframe_index=True`.
 
 !!! tip
+
     In forestry applications, users typically store a unique inventory plot identification number as the index in the dataframe.
 
 ### Y-Fit Data
@@ -91,6 +149,7 @@ est = GNNRegressor(n_components=3).fit(X, y)
 ```
 
 !!! warning
+
     The maximum number of components depends on the input data and the estimator. Specifying `n_components` greater than the maximum number of components will raise an error.
 
 ### RFNN Distance Metric
@@ -149,13 +208,14 @@ X_df, y_df = load_swo_ecoplot(return_X_y=True, as_frame=True)
 print(X_df.head())
 ```
 
-|       |   ANNPRE |   ANNTMP |   AUGMAXT |   CONTPRE |   CVPRE |   DECMINT |   DIFTMP |   SMRTMP |   SMRTP |   ASPTR |     DEM |     PRR |   SLPPCT |   TPI450 |     TC1 |      TC2 |      TC3 |     NBR |
-|------:|---------:|---------:|----------:|----------:|--------:|----------:|---------:|---------:|--------:|--------:|--------:|--------:|---------:|---------:|--------:|---------:|---------:|--------:|
-| 52481 |  740     |  514.667 |   2315    |   517.667 | 8971.67 |  -583.111 |  2899.11 |  1136.11 | 212.222 | 197.667 | 1870.11 | 13196.7 |  48.3333 |  33.7778 | 218.778 |  68.5556 | -86.2222 | 343.556 |
-| 52482 |  742     |  563.556 |   2354.33 |   502     | 9124.33 |  -543.556 |  2898.89 |  1179.44 | 221.111 | 190.222 | 1713.11 | 16355.8 |   5.4444 |   6.4444 | 210.222 |  60.3333 | -96.6667 | 261.667 |
-| 52484 |  738.556 |  639.111 |   2468.89 |   545.889 | 8897.22 |  -479.111 |  2949    |  1266.22 | 236     | 194.556 | 1612.11 | 15132.6 |  15.5556 |  -1.2222 | 157     | 110.222  | -17.4444 | 721     |
-| 52485 |  730.333 |  622.667 |   2405.33 |   555     | 8829.78 |  -481.222 |  2887.56 |  1244.22 | 234     | 196.444 | 1682.33 | 15146.7 |  19.8889 | -16.8889 | 152.556 |  86.1111 | -31.6667 | 597.111 |
-| 52494 |  720     |  778.556 |   2678.11 |   658.556 | 8638    |  -386.667 |  3065.78 |  1396    | 262     | 191.778 | 1345.67 | 16672.1 |   2      |   0.4444 | 214.667 |  58.5556 | -88.1111 | 294.222 |
+|       |  ANNPRE |  ANNTMP | AUGMAXT | CONTPRE |   CVPRE |  DECMINT |  DIFTMP |  SMRTMP |   SMRTP |   ASPTR |     DEM |     PRR |  SLPPCT |   TPI450 |     TC1 |     TC2 |      TC3 |     NBR |
+| ----: | ------: | ------: | ------: | ------: | ------: | -------: | ------: | ------: | ------: | ------: | ------: | ------: | ------: | -------: | ------: | ------: | -------: | ------: |
+| 52481 |     740 | 514.667 |    2315 | 517.667 | 8971.67 | -583.111 | 2899.11 | 1136.11 | 212.222 | 197.667 | 1870.11 | 13196.7 | 48.3333 |  33.7778 | 218.778 | 68.5556 | -86.2222 | 343.556 |
+| 52482 |     742 | 563.556 | 2354.33 |     502 | 9124.33 | -543.556 | 2898.89 | 1179.44 | 221.111 | 190.222 | 1713.11 | 16355.8 |  5.4444 |   6.4444 | 210.222 | 60.3333 | -96.6667 | 261.667 |
+| 52484 | 738.556 | 639.111 | 2468.89 | 545.889 | 8897.22 | -479.111 |    2949 | 1266.22 |     236 | 194.556 | 1612.11 | 15132.6 | 15.5556 |  -1.2222 |     157 | 110.222 | -17.4444 |     721 |
+| 52485 | 730.333 | 622.667 | 2405.33 |     555 | 8829.78 | -481.222 | 2887.56 | 1244.22 |     234 | 196.444 | 1682.33 | 15146.7 | 19.8889 | -16.8889 | 152.556 | 86.1111 | -31.6667 | 597.111 |
+| 52494 |     720 | 778.556 | 2678.11 | 658.556 |    8638 | -386.667 | 3065.78 |    1396 |     262 | 191.778 | 1345.67 | 16672.1 |       2 |   0.4444 | 214.667 | 58.5556 | -88.1111 | 294.222 |
 
 !!! note
+
     `pandas` must be installed to use `as_frame=True`.
diff --git a/src/sknnr/_base.py b/src/sknnr/_base.py
@@ -102,6 +102,9 @@ class RawKNNRegressor(
 
     Attributes
     ----------
+    DISTANCE_PRECISION_DECIMALS : int, class attribute
+        Number of decimal places used when rounding scaled distances to ensure
+        deterministic neighbor ordering. Default is 10.
     effective_metric_ : str
         The distance metric to use. It will be same as the metric parameter
         or a synonym of it, e.g. 'euclidean' if the metric parameter set to
@@ -124,6 +127,8 @@ class RawKNNRegressor(
         Number of samples in the fitted data.
     """
 
+    DISTANCE_PRECISION_DECIMALS = 10
+
     def fit(self, X, y):
         """Override fit to set attributes using mixins."""
         self._set_dataframe_index_in(X)
@@ -137,12 +142,66 @@ def kneighbors(
         n_neighbors=None,
         return_distance=True,
         return_dataframe_index=False,
+        use_deterministic_ordering=True,
     ):
-        """Override kneighbors to optionally return dataframe indexes."""
+        """
+        Find the K-neighbors of a point or points in the dataset and optionally
+        return dataframe indexes rather than array indices when the model was
+        fitted with a dataframe.
+
+        Parameters
+        ----------
+        X : array-like of shape (n_queries, n_features), default=None
+            The query point or points. If not provided, neighbors of each
+            indexed point are returned. In this case, the query point is not
+            considered its own neighbor.
+        n_neighbors : int, default=None
+            Number of neighbors required for each sample. The default is the
+            value passed to the constructor.
+        return_distance : bool, default=True
+            Whether or not to return the distances.
+        return_dataframe_index : bool, default=False
+            Whether or not to return dataframe indexes instead of array indices.
+            Only applicable if the model was fitted with a dataframe.
+        use_deterministic_ordering : bool, default=True
+            Whether to use deterministic ordering of neighbors when distances
+            are nearly identical.  If True, neighbors with nearly identical
+            distances (up to DISTANCE_PRECISION_DECIMALS decimal places) are
+            ordered lexicographically by:
+            (1) their scaled and rounded distances,
+            (2) the absolute difference between a query point's row index
+                and the neighbor index (so that a sample, when present, is
+                returned before other equally distant samples), and
+            (3) the neighbor index iself.
+            If False, use the default ordering from
+            `KNeighborsRegressor.kneighbors`. See the
+            [usage guide](`../../../usage/#deterministic-neighbor-ordering`)
+            for more details.
+
+        Returns
+        -------
+        neigh_dist : array-like of shape (n_queries, n_neighbors)
+            Array representing the lengths to points, only present if
+            return_distance=True.
+        neigh_ind : array-like of shape (n_queries, n_neighbors)
+            Array indices or dataframe indexes of the nearest points in the
+            population matrix.
+        """
         neigh_dist, neigh_ind = super().kneighbors(
             X=X, n_neighbors=n_neighbors, return_distance=True
         )
 
+        if use_deterministic_ordering:
+            row_scale = np.maximum(neigh_dist.max(axis=1, keepdims=True), 1.0)
+            rounded = np.round(
+                neigh_dist / row_scale, decimals=self.DISTANCE_PRECISION_DECIMALS
+            )
+            neigh_ind_diff = np.abs(neigh_ind - np.arange(len(neigh_ind))[:, None])
+            sorted_indices = np.lexsort((neigh_ind, neigh_ind_diff, rounded), axis=1)
+
+            neigh_dist = np.take_along_axis(neigh_dist, sorted_indices, axis=1)
+            neigh_ind = np.take_along_axis(neigh_ind, sorted_indices, axis=1)
+
         if return_dataframe_index:
             msg = "Dataframe indexes can only be returned when fitted with a dataframe."
             check_is_fitted(self, "dataframe_index_in_", msg=msg)
@@ -257,14 +316,59 @@ def kneighbors(
         n_neighbors=None,
         return_distance=True,
         return_dataframe_index=False,
+        use_deterministic_ordering=True,
     ):
-        """Return neighbor indices and distances using transformed feature data."""
+        """
+        Find the K-neighbors of a point or points of transformed feature data
+        and optionally return dataframe indexes rather than array indices when
+        the model was fitted with a dataframe.
+
+        Parameters
+        ----------
+        X : array-like of shape (n_queries, n_features), default=None
+            The query point or points. Points are first transformed using the
+            fitted transformer. If not provided, neighbors of each indexed
+            point are returned. In this case, the query point is not
+            considered its own neighbor.
+        n_neighbors : int, default=None
+            Number of neighbors required for each sample. The default is the
+            value passed to the constructor.
+        return_distance : bool, default=True
+            Whether or not to return the distances.
+        return_dataframe_index : bool, default=False
+            Whether or not to return dataframe indexes instead of array indices.
+            Only applicable if the model was fitted with a dataframe.
+        use_deterministic_ordering : bool, default=True
+            Whether to use deterministic ordering of neighbors when distances
+            are nearly identical.  If True, neighbors with nearly identical
+            distances (up to DISTANCE_PRECISION_DECIMALS decimal places) are
+            ordered lexicographically by:
+            (1) their scaled and rounded distances,
+            (2) the absolute difference between a query point's row index
+                and the neighbor index (so that a sample, when present, is
+                returned before other equally distant samples), and
+            (3) the neighbor index iself.
+            If False, use the default ordering from
+            `KNeighborsRegressor.kneighbors`. See the
+            [usage guide](`../../../usage/#deterministic-neighbor-ordering`)
+            for more details.
+
+        Returns
+        -------
+        neigh_dist : array-like of shape (n_queries, n_neighbors)
+            Array representing the lengths to points, only present if
+            return_distance=True.
+        neigh_ind : array-like of shape (n_queries, n_neighbors)
+            Array indices or dataframe indexes of the nearest points in the
+            population matrix.
+        """
         X_transformed = self._transform_X(X)
         return self.regressor_.kneighbors(
             X=X_transformed,
             n_neighbors=n_neighbors,
             return_distance=return_distance,
             return_dataframe_index=return_dataframe_index,
+            use_deterministic_ordering=use_deterministic_ordering,
         )
 
     def predict(self, X):

diff --git a/tests/test_estimators.py b/tests/test_estimators.py
@@ -282,6 +282,81 @@ def test_n_features_in(estimator, X_y_yfit):
     assert est.n_features_in_ == len(transformed_features)
 
 
+@pytest.mark.parametrize(
+    ("use_deterministic_ordering", "expected_idx_order"),
+    [(False, [1, 0]), (True, [0, 1])],
+)
+def test_kneighbors_deterministic_ordering(
+    use_deterministic_ordering, expected_idx_order
+):
+    """
+    Test that the use_deterministic_ordering parameter affects the order
+    of neighbors when distances are nearly identical.
+    """
+    X = np.array([1e-11, 1e-12, 1.0]).reshape(-1, 1)
+    y = np.array([0, 1, 2])
+
+    X_query = np.array([[0.0]])
+
+    _, idx = (
+        RawKNNRegressor(n_neighbors=2)
+        .fit(X, y)
+        .kneighbors(X_query, use_deterministic_ordering=use_deterministic_ordering)
+    )
+    assert_array_equal(idx[0], expected_idx_order)
+
+
+def test_kneighbors_uses_index_difference():
+    """
+    Test that when distances are considered to be identical, the absolute index
+    difference is used before indexes to order neighbors.
+    """
+    X = np.array([1e-11, 1e-12, 1.0]).reshape(-1, 1)
+    y = np.array([0, 1, 2])
+
+    # Use two identical query points which should have different
+    # neighbors due to their row indexes
+    X_query = np.array([[0.0], [0.0]])
+
+    _, idx = (
+        RawKNNRegressor(n_neighbors=2)
+        .fit(X, y)
+        .kneighbors(X_query, use_deterministic_ordering=True)
+    )
+    assert_array_equal(idx[0], [0, 1])
+    assert_array_equal(idx[1], [1, 0])
+
+
+@pytest.mark.parametrize(
+    ("precision_decimals", "expected_idx_order"),
+    [(8, [2, 1, 0]), (5, [1, 2, 0]), (2, [0, 1, 2])],
+)
+def test_kneighbors_precision_decimals(
+    monkeypatch, precision_decimals, expected_idx_order
+):
+    """
+    Test that changing DISTANCE_PRECISION_DECIMALS affects the order
+    of neighbors on small precision differences.
+    """
+    monkeypatch.setattr(
+        RawKNNRegressor, "DISTANCE_PRECISION_DECIMALS", precision_decimals
+    )
+
+    # Create features that differ by small amounts such that
+    # precision_decimals falls between them
+    X = np.array([1e-3, 1e-6, 1e-9, 1.0]).reshape(-1, 1)
+    y = np.array([0, 1, 2, 3])
+
+    X_query = np.array([[0.0]])
+
+    _, idx = (
+        RawKNNRegressor(n_neighbors=3)
+        .fit(X, y)
+        .kneighbors(X_query, use_deterministic_ordering=True)
+    )
+    assert_array_equal(idx[0], expected_idx_order)
+
+
 @pytest.mark.parametrize("forest_weights", ["uniform", [0.5, 1.5], (1.0, 2.0)])
 def test_rfnn_handles_forest_weights(forest_weights):
     """Test that RFNNRegressor handles forest weights correctly."""

diff --git a/tests/test_regressions/test_kneighbors_reference_full_randomForest_k5_ids_.npz b/tests/test_regressions/test_kneighbors_reference_full_randomForest_k5_ids_.npz
diff --git a/tests/test_regressions/test_kneighbors_reference_full_randomForest_k5_index_.npz b/tests/test_regressions/test_kneighbors_reference_full_randomForest_k5_index_.npz
diff --git a/tests/test_regressions/test_mixed_rfnn_forests_reference_.npz b/tests/test_regressions/test_mixed_rfnn_forests_reference_.npz
diff --git a/tests/test_regressions/test_mixed_rfnn_forests_target_.npz b/tests/test_regressions/test_mixed_rfnn_forests_target_.npz
-Original file line number
+Diff line change
@@ Expand Up / @@ -3,6 +3,7 @@ @@
     ## From PyPI
     !!! info
         `sknnr` is currently in pre-release on PyPI, so you'll need to use the `--pre` flag to install it.
     ```bash
@@ Expand Down @@