Skip to content

Conversation

@jonatanklosko
Copy link
Member

Closes #324.

For the reproduction from #324 (comment):

before after
0.140113 s 0.148038 s
0.08597 s 0.070765 s
0.663867 s 0.083935 s
16.370221 s 0.084186 s
109.480284 s 0.087279 s
381.106067 s 0.1175 s
0.147327 s
0.174654 s

# During cluster expansion, we store points to be visited on
# the stack. Each point can be at the stack at most once, so
# the number of points is the upper bound on stack size.
stack = Nx.broadcast(-1, {Nx.axis_size(indices, 0)})
Copy link
Member Author

@jonatanklosko jonatanklosko Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The max stack size was heavily overestimated. The stack is effectively a set of points and should never exceed N points. We just need to make sure to never add duplicates into the stack, which the other line change does.

Copy link
Member Author

@jonatanklosko jonatanklosko Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My first attempt was to remove the innermost while loop (putting all relevant indices onto the stack at once), and I was able to with a trick. But I started to think that it may break in an edge case if we get to a close stack. Then analyzing the stack size, I realised it should be way smaller, and that was the actual fix :p

@krstopro krstopro self-requested a review December 18, 2025 20:31
Copy link
Member

@krstopro krstopro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

One thing we might want is more unit tests for these algorithms. For example, ten points at one location (duplicates) and ten points at another one far away. I could do this as I expect to have some time at the end of the year.

Copy link
Contributor

@josevalim josevalim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome job!

@jonatanklosko
Copy link
Member Author

I came up with a rewrite that paralelizes much better, I am merging this for reference, but will submit another PR soon :)

@jonatanklosko
Copy link
Member Author

One thing we might want is more unit tests for these algorithms. For example, ten points at one location (duplicates) and ten points at another one far away. I could do this as I expect to have some time at the end of the year.

Definitely, PRs for that would be great!

@jonatanklosko jonatanklosko merged commit 0bce6e4 into main Dec 19, 2025
2 checks passed
@jonatanklosko jonatanklosko deleted the jk-dbscan branch December 19, 2025 17:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Slow DBSCAN runtimes

4 participants