Scale NN distances and sort by distance and index#105
Conversation
Due to floating-point differences between operating systems when using `np.dot`, very close neighbor distances can swap position leading to differences in neighbor indexes and, thus, predictions. This commit first scales and dicretizes distances such that very similar or exact distances will share the same value in which case they are subsequently sorted by neighbor index.
There was a problem hiding this comment.
Pull request overview
This PR addresses floating-point precision issues in nearest neighbor distance calculations by implementing a deterministic sorting approach. When distances are very similar due to platform-specific floating-point differences in np.dot, the new implementation scales and discretizes distances, then sorts first by scaled distance and then by neighbor index to ensure consistent ordering across platforms.
Changes:
- Modified
RawKNNRegressor.kneighborsto scale distances by epsilon (1e-10) and use lexicographic sorting by distance then index - Updated four RFNN regression test files to reflect the new deterministic neighbor ordering
Reviewed changes
Copilot reviewed 1 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
src/sknnr/_base.py |
Added distance scaling and lexicographic sorting to ensure deterministic neighbor ordering when distances are very close |
tests/test_regressions/test_mixed_rfnn_forests_target_.npz |
Updated expected test outputs for RFNN with new sorting behavior |
tests/test_regressions/test_mixed_rfnn_forests_reference_.npz |
Updated expected test outputs for RFNN reference data with new sorting behavior |
tests/test_regressions/test_kneighbors_reference_full_randomForest_k5_index_.npz |
Updated expected test outputs for kneighbors with new sorting behavior |
tests/test_regressions/test_kneighbors_reference_full_randomForest_k5_ids_.npz |
Updated expected test outputs for kneighbors with dataframe ids and new sorting behavior |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@grovduck, I am slightly hesitant about the fix, but I'm more than willing to defer to your experience here if you think this is the right solution. I think most of my reservations are because this seems like we're changing behavior of the library code for the sake of the test code, but maybe that's off-base and it would be valuable more generally to improve reproducibility across platforms? In any case, I'm also not sure what harm this could cause, aside from a negligible loss in precision in neighbor comparisons (presumably the difference between two neighbors past 10 decimal places is effectively meaningless).
If we had some kind of configuration API (something like You could consider adding a parameter to enable/disable it, but I'm not sure whether that's worth it either.
It looks nicely vectorized to me. I was curious about the performance impact and ran some quick tests with 4000 samples, 10 features, and 10 neighbors, and the added code for scaling, rounding, sorting, and re-indexing only increased the
Yeah, I agree that being consistent across estimators makes sense.
I don't think I'm totally following... Does this mean that you might somehow compare a feature against itself during cross-validation, whereas you'd otherwise get an independent feature? Or is it the difference between which of two identical features you'd identify as the first neighbor? |
@aazuspan, I can completely understand your hesitancy here and I appreciate you making me elaborate on why I think it's needed. Definitely the impetus for introducing the fix was failing tests and I concede that point. But I think the bigger issue is that we need consistency across platforms for neighbor order, which impacts predictions and scores. My thought is that the neighbor assignment needs to be deterministic across all platforms/users and the tests exposed that this wasn't the case. We're obviously seeing this behavior of tied distances much more in the estimators that use Hamming distance because its simply a proportion of matched nodes between target and reference (as we've discussed before). If we use equally weighted trees (the default in RFNN and an option for GBNN), the probability of tied distances is even higher.
Looking at this example with two identical features: import numpy as np
from sklearn.neighbors import KNeighborsRegressor
X = np.array([
[1, 2, 3],
[4, 5, 6],
[1, 2, 3]
])
y = np.array([10, 20, 30])
est = KNeighborsRegressor(n_neighbors=2).fit(X, y)
# Independent neighbors
print(est.kneighbors()[1])
# [[2 1]
# [0 2]
# [0 1]]
# Dependent neighbors
print(est.kneighbors(X)[1])
# [[0 2]
# [1 0]
# [0 2]] - Not returning itself as first neighborI feel what I'm proposing here is relaxing the criterion on what makes a tie distance to correct for numerical imprecision. I think my main argument is that, given the exact same inputs and parameters, estimators should be deterministic. Perhaps what you are thinking is that the platform, package versions, etc. are part of those inputs/parameters and that they are deterministic in that sense. What I'm proposing in this fix is a "broadening" of that determinism to account for some of these differences (and, of course, to make tests pass!) But please push back if you don't think this the right way to go about this. I just haven't come up with other options.
Yes, I think this is what I'd argue as well. Distances past x decimal places are meaningless, so we may as well set neighbors in a way that will be reproduceable.
Your second thought here is intriguing. That would allow tests to pass and allay my concerns about reproducibility (and I think I'd make the proposed fix the default), but would still allow a user to use the true calculated distances. We'd have to surface the parameter on
Thanks for going to the trouble of doing that testing. Good to know that the effect is negligible.
Sorry, that explanation wasn't totally clear and using cross-validation as a term in my explanation didn't help. But it's the latter clause that could cause issues. I'll try to give a clear example that isn't necessarily related to cross-validation. In our GNN modeling, we capture dependent neighbors/predictions as well as independent neighbors/predictions. Dependent predictions are when a plot can use itself as a neighbor. As one of our tests to identify plots for screening, we want to know how "deep" in the neighbor stack a plot had to go to self-impute1. In In our experience, it's very rare to have samples that share all feature values so this wouldn't come up often. There are a couple of other corner cases that come up, but I'm recognizing that my explanation is pretty rambling at this point and it may be easier to talk it through. Footnotes
|
|
Thanks for the details on the motivation and how you see this fitting into
That was my thought. I imagine the parameter won't have any effect in most cases, but as a user I think I would feel more comfortable knowing it was there just in case there was something special about my data that caused issues (the only thing I can think of would be a dataset where most of the distances are very small with a few very large distances that lead to a huge scaler and a loss of meaningful precision at the low end, but I'm not sure that would ever arise in reality). Having to pass the parameter through
Thanks, I think I'm mostly understanding now. Just as a thought, what if we broke distance ties based on the absolute "distance" between their indices, so that if plot ID 3000 had identical features to 1000 and 4000, the nearest neighbors would be [3000, 4000, 1000]? I had to tweak the example there since there's a chance to tie with the distance between indices as well, so I suppose we'd need a third tie-breaker with the index itself. That may be too targeted of a fix for such a specific issue and moves further from |
When ordering neighbors, distances are used as the first screen. If distances are considered to be equal and `use_deterministic_ordering` is `True`, use the absolute neighbor index difference as the second-level ordering. This ensures that when `kneighbors` is called with the `X` used to fit the estimator (i.e. non-independent cross-validation), the first neighbor returned for each query point will be itself.
|
@aazuspan, I've taken another hack at this to address the three remaining issues:
Thanks for your continued patience. |
aazuspan
left a comment
There was a problem hiding this comment.
@grovduck, this looks great! I agree with all your decisions and naming choices, and the new tests are very thorough.
The only change I'd propose is that it might be worth adding a new subheading to the Usage guide to explain how the deterministic ordering works and where it might differ from the default sklearn behavior, since that seems like a notable new feature. The parameter docstring could then repeat some of that information and/or link to the usage guide for more info.
This is a great idea. I've now added a section to the user's guide and linked from the docstrings. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 7 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@aazuspan, thanks for the final review. I think I've addressed your helpful comments and will merge when you've had a chance to quickly review. |
aazuspan
left a comment
There was a problem hiding this comment.
LGTM! This was a sticky issue, so it's great to have it solved :)
Due to floating-point differences between operating systems when using
np.dot, very close neighbor distances can swap position leading to differences in neighbor indexes and, thus, predictions. This PR first scales and discretizes distances such that very similar or exact distances will share the same value in which case they are subsequently sorted by neighbor index.This change affected the neighbor sorting of four
RFNNtests such that, in the case of very close or tied distances, the neighbor with the smaller index was chosen as the nearer neighbor. The fix is the result of usingnp.dotto calculate Hamming distances. No other estimators' regression tests were affected by this change.@aazuspan, let me know what you think of this fix. I went back and forth on surfacing the
EPSvalue as a parameter. For now, it is not configurable, but I don't have a strong feeling as to whether a user would want this flexibility. Also, if you have a sense whether this is an inefficient calculation, please let me know. I didn't notice that tests slowed down much, but there may be some optimization that I'm not aware of.Lastly, even though I think this should only impact estimators that use Hamming distance, I decided to implement in
RawKNNRegressor.kneighborsthinking that this defined sorting behavior would be nice for all estimators. The one place that I can see where this would trip users up is in cross-validation if one or more samples have identical feature values. The sample with the lower index will still be its own nearest neighbor, whereas the sample with the higher index would now be at best the second nearest neighbor. In our experience, co-located plots can sometimes exhibit this behavior.PS: Even though we were surprised not to see this bug when working on #101, it did come back to bite us in current work on #99. I'd like to get this fix in place before going back to that PR.