-
Notifications
You must be signed in to change notification settings - Fork 183
Add NearestNeighbors SPMD API #2557
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add NearestNeighbors SPMD API #2557
Conversation
Codecov Report❌ Patch coverage is
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 41 files with indirect coverage changes 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Due to time constraints I have tried to push most of my changes into follow-on PRs. There are some nagging issues we should talk through, but I see no reason to hold off getting this in.
) | ||
|
||
# Run each estimator without an input to kneighbors() and ensure functionality and equivalence | ||
for CurrentEstimator in [KNeighborsClassifier, KNeighborsRegressor, NearestNeighbors]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see why this was done, but is a bit painful to analyze if there is a failure. Ideally it would be parametrized over, but really isn't possible by the way it is imported. Would be worth adding some sort of message to figure out which is the CurrentEstimator (rather than having to dig through the pytest log for the CurrentEstimator current value was).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah - I am pretty open to ideas on this one. The loop is great because I run the exact same test on all 3 classes, but you are correct that analysis on a fail is trickier. I think scikit-learn may do things like this, I could check how they do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess its easier there because in sklearn they import at top of file
spmd_dists, spmd_indcs = spmd_model.kneighbors(local_dpt_X_train) | ||
batch_dists, batch_indcs = batch_model.kneighbors(X_train) | ||
|
||
tol = 0.005 if dtype == np.float32 else 1e-6 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yikes on this float32 setting. Any info on it? Especially because there is a skip associated with it above (meaning an even worse value occurs?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's true, and good observation. It's pretty tricky because this assert all close functionality will fail even if a single element is not within the threshold, hence why it is so loose - it would be nice if there was some sort of customization of that.
It's possible that we could still run the indices check for this case, but distances are more fragile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the only place in spmd test scope where drastically low thresholds are needed to support float32 tests passing though
Description
Adds spmd NearestNeighbors to API with necessary modifications to sklearnex/spmd and onedal/spmd, along with minor revisions in onedal4py and onedal itself (uxlfoundation/oneDAL#3262 - which is prerequisite for merging this). Test also added for validation.
Full list of changes:
PR completeness and readability
Testing
Performance