-
Notifications
You must be signed in to change notification settings - Fork 46
Improve DBSCAN performance #325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| # During cluster expansion, we store points to be visited on | ||
| # the stack. Each point can be at the stack at most once, so | ||
| # the number of points is the upper bound on stack size. | ||
| stack = Nx.broadcast(-1, {Nx.axis_size(indices, 0)}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The max stack size was heavily overestimated. The stack is effectively a set of points and should never exceed N points. We just need to make sure to never add duplicates into the stack, which the other line change does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My first attempt was to remove the innermost while loop (putting all relevant indices onto the stack at once), and I was able to with a trick. But I started to think that it may break in an edge case if we get to a close stack. Then analyzing the stack size, I realised it should be way smaller, and that was the actual fix :p
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
One thing we might want is more unit tests for these algorithms. For example, ten points at one location (duplicates) and ten points at another one far away. I could do this as I expect to have some time at the end of the year.
josevalim
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome job!
|
I came up with a rewrite that paralelizes much better, I am merging this for reference, but will submit another PR soon :) |
Definitely, PRs for that would be great! |
Closes #324.
For the reproduction from #324 (comment):