How is jax.lax.top_k so much faster than np.argpartition for large collections? #21687

xhluca · 2024-06-06T04:18:04Z

xhluca
Jun 6, 2024

In this gist, i compare jax vs numpy-based topk. Note that my implementation of topk relies on np.argpartition, which implements the introselect algorithm.

Here's the timing results on my local machine:

mat sparsity: 0.99970004517
sum sparsity: 0.9417972
sum time:  0.36010091699426994
topk [k=100]:  0.8273633749922737
topk [k=1000]:  0.8134868750057649
jax.lax.top_k [k=100]:  0.3516858329967363
jax.lax.top_k [k=1000]:  0.35173929198936094
argpartition [k=100]:  0.3687077920039883
argpartition [k=1000]:  0.3657445840071887
argpartition [k=4999900]:  0.8220591660065111
argpartition [k=4999000]:  0.8159538750041975

Notice that the numpy implementation of topk (based on argpartition where kth=n-k) is significantly slower than jax.lax.top_k, although it uses a theoretical worst-case of O(N). Moreover, the best case scenario of quickselect (upon which introselect is based on) is a 2-pass algorithm (since you would recurse on only one half, so it's 1/2 + 1/4 + ... which is a geometric series converging to 1); where each pass is 4 operations.

So given numpy np.argpartition is written in C++ and has a fairly efficient 2-pass algorithm in best case, what makes Jax so much faster than numpy?

EDIT: Based on Jake's reply, I've added a publicly reproducible notebook version of the gist above: https://www.kaggle.com/xhlulu/numpy-argpartition-vs-jax-lax-top-k

When running on a Intel(R) Xeon(R) CPU @ 2.20GHz, with the number of elements increased to 200M, we get the following performance:

Summation time:
78.8 ms ± 860 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

jax.lax.top_k [k=1000]:
38.5 ms ± 171 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

topk using numpy [k=1000]:
49.6 ms ± 983 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

argpartition [k=1000]:
27.8 ms ± 379 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

argpartition [k=4999000]:
48.7 ms ± 279 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

So it is interesting to see that it is note as "clear cut" as above, but still a significant difference (38.5 vs 48.7).

Answered by jakevdp

Jun 6, 2024

If it's CPU you're curious about, you can find the implementation here: https://github.com/openxla/xla/blob/f868730d8fc557f9e26c983a015f6b63d5b241b4/xla/service/cpu/runtime_topk.cc#L27-L69

It looks like it's implemented via C++ std:partial_sort.

View full answer

xhluca · 2024-06-06T04:20:13Z

xhluca
Jun 6, 2024
Author

Side note: I've looked at the source code on jax but all I was able to find is that jax.lax binds a Primitive('top_k'), but could not find the reference implementation in jax. When I look up XLA's top_k implementation, I can only find tensorflow docs and jax Github issues; unrelated to XLA's top_K implementation.

0 replies

jakevdp · 2024-06-06T17:02:57Z

jakevdp
Jun 6, 2024
Maintainer

Can you say more about the environment where you're running this? For example, if JAX is using a GPU and NumPy is using a CPU, that could easily lead to the kinds of performance differences you're seeing.

Also, for JAX benchmarks, keep in mind the tips at JAX FAQ: benchmarking JAX code. In particular, your benchmarks don't account for JIT compilation or asynchronous dispatch, so if you're on a backend that supports it you may just be measuring compile + dispatch time rather than actual runtime.

5 replies

jakevdp Jun 6, 2024
Maintainer

Also JAX FAQ: Is JAX faster than NumPy? is a good general discussion of the type of question you're exploring here.

xhluca Jun 6, 2024
Author

I'm not compiling JAX ahead of time, although it's something I'd be happy to do if it improves the performance. I am also using purely CPU in this example. My hardware is M3 CPU, 16GB RAM, running in python3.10. However, i've observed improvements in other scenarios.

That said, I'm curious about why JAX is faster than Numpy rather than how much faster it could get - do you have any idea about the underlying top-k implementation used? Is it Musser's introselect, or does it use a different variant of quickselect? Or perhaps it's simply XLA being faster than C++, but I don't know enough about the underlying implementation of JAX/Numpy to tell.

jakevdp Jun 6, 2024
Maintainer

I'm not compiling JAX ahead of time, although it's something I'd be happy to do if it improves the performance.

Sure, but JAX library functions are compiled, so even if you don't use jax.jit, there will be a compile-time cost that is cached and not present in subsequent calls. If you really want to understand the performance characteristics of your code, it's useful to separate these two concerns.

do you have any idea about the underlying top-k implementation used?

Are you asking about CPU, GPU or TPU?

jakevdp Jun 6, 2024
Maintainer

If it's CPU you're curious about, you can find the implementation here: https://github.com/openxla/xla/blob/f868730d8fc557f9e26c983a015f6b63d5b241b4/xla/service/cpu/runtime_topk.cc#L27-L69

It looks like it's implemented via C++ std:partial_sort.

Answer selected by xhluca

xhluca Jun 6, 2024
Author

Thanks! Yes I was asking for CPU. Thank you for the link - interesting that it uses the standard library's std:partial_sort!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How is jax.lax.top_k so much faster than np.argpartition for large collections? #21687

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How is jax.lax.top_k so much faster than np.argpartition for large collections? #21687

Uh oh!

Uh oh!

xhluca Jun 6, 2024

Replies: 2 comments · 5 replies

Uh oh!

xhluca Jun 6, 2024 Author

Uh oh!

jakevdp Jun 6, 2024 Maintainer

Uh oh!

jakevdp Jun 6, 2024 Maintainer

Uh oh!

xhluca Jun 6, 2024 Author

Uh oh!

Uh oh!

jakevdp Jun 6, 2024 Maintainer

Uh oh!

jakevdp Jun 6, 2024 Maintainer

Uh oh!

xhluca Jun 6, 2024 Author

xhluca
Jun 6, 2024

Replies: 2 comments 5 replies

xhluca
Jun 6, 2024
Author

jakevdp
Jun 6, 2024
Maintainer

jakevdp Jun 6, 2024
Maintainer

xhluca Jun 6, 2024
Author

jakevdp Jun 6, 2024
Maintainer

jakevdp Jun 6, 2024
Maintainer

xhluca Jun 6, 2024
Author