-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New JMH benchmark method - vdot8s that implement int8 dotProduct in C… #13572
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Do we even need to use intrinsics? function is so simple that the compiler seems to do the right thing, e.g. use https://godbolt.org/z/KG1dPnrqn https://developer.arm.com/documentation/102651/a/What-are-dot-product-intructions- |
|
I haven't benchmarked, just seems my cheap 2021 phone has You can use it directly via intrinsic, too, no need to use add/multiply intrinsic: https://arm-software.github.io/acle/neon_intrinsics/advsimd.html#dot-product But unless it is really faster than what GCC does with simple C, no need. |
With the updated compile flags, the performance of auto-vectorized code is slightly better than explicitly vectorized code (see results). Interesting thing to note is that both C-based implementations have |
This seems correct to me. The java Vector API is not performant for the integer case. Hotspot doesn't much understand ARM at all and definitely doesn't even have instructions such as |
lucene/core/build.gradle
Outdated
| cCompiler.withArguments { args -> | ||
| args << "--shared" | ||
| << "-O3" | ||
| << "-march=armv8.2-a+dotprod" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for simple case of benchmarking i would just try -march=native to compile the best it can for the specific cpu/microarch. Then you could test your dot8s on ARM 256-bit SVE (graviton3, graviton4) which might be interesting...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried -march=native on an m7g.4xlarge (Graviton3) instance type, OS, JDK, GCC were unchanged. dot8s and vdot8s are now ~3.5x faster than java implementation, compared to being 10x better on m6g.4xlarge (Graviton2)
| Benchmark | (size) | Mode | Cnt | Score | Error | Units |
|---|---|---|---|---|---|---|
| VectorUtilBenchmark.binaryDotProductVector | 768 | thrpt | 15 | 10.570 | ± 0.003 | ops/us |
| VectorUtilBenchmark.dot8s | 768 | thrpt | 15 | 37.659 | ± 0.422 | ops/us |
| VectorUtilBenchmark.vdot8s | 768 | thrpt | 15 | 37.123 | ± 0.237 | ops/us |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this happens because with the gcc10+sve, the implementation is not really unrolled. it should be possible to be 10x faster, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, the other likely explanation on the performance is that the integer dot product in java is not AS HORRIBLE on the 256-bit SVE as it is on the 128-bit neon. it more closely resembles the logic of how it behaves on AVX-256: two 8x8 bit integers ("64-bit vectors") are multiplied into intermediate 8x16-bit result (128-bit vector) and added to 8x32-bit (256-bit vector). Of course, it does not use SDOT instruction which is sad as it is CPU instruction intended precisely for this purpose.
On the 128-bit neon there is not a possibility with java's vector api to process 4x8 bit integers ("32-bit vectors") like the SDOT instruction does: https://developer.arm.com/documentation/102651/a/What-are-dot-product-intructions-
Nor is it even performant to take 64-bit vector and process "part 0" then "part 1". The situation is really sad, and the performance reflects that.
Same. The performance benefit is large here. We do similar in Elasticsearch, and have impls for both AArch64 and x64. I remember @rmuir making a similar comment before about the auto-vectorization - which is correct. I avoided it at the time given the toolchain that we were using, but it's a good option which I'll reevaluate. https://github.com/elastic/elasticsearch/blob/main/libs/simdvec/native/src/vec/c/aarch64/vec.c
++ Yes. This is not a good state of affairs. I'll make sure to get an issue filed with OpenJDK for it. |
|
Could Lucene ever have this directly in one of its modules? We currently use the Note: we currently do dot product and square distance on int7, since this gives a bit more flexibility on x64, and Lucene's scalar quantization is in the range of 0 - 127. (#13336) |
It should work well with any modern gcc (@goankur uses gcc 10 here). If using clang, I think you need to add some pragmas for it, that's all, see https://llvm.org/docs/Vectorizers.html and also there are some hints in the ARM docs. One issue would be which to compile for/detect: IMO for arm there would be two good ones:
For x86 probably at most three?:
|
|
Here is my proposal visually: https://godbolt.org/z/6fcjPWojf As you can see, by passing (these variants should all be benchmarked of course first) edit: I modified it to use same GCC compiler version across all variants. This newer version unrolls the SVE, too. |
|
And i see from playing around with compiler versions, the advantage of intrinsics approach: although I worry how many variants we'd maintain. it would give stability across releasing lucene without worrying about change to underlying C compiler etc, no crazy aggressive compiler flags needed, etc. But we should at least START with autovectorizer and look/benchmark various approaches it uses, such as VNNI one in the example above. |
|
I definitely want to play around more with @goankur 's PR here and see what performance looks like across machines, but will be out of town for a bit. There is a script to run the benchmarks across every aws instance type in parallel and gather the results: https://github.com/apache/lucene/tree/main/dev-tools/aws-jmh I want to tweak https://github.com/apache/lucene/blob/main/dev-tools/aws-jmh/group_vars/all.yml to add the graviton4 (or any other new instance types since it was last touched) and also add c-compiler for testing this out, but we should be able to quickly iterate with it and make progress. |
|
Just to expand a little on a previous comment I made above.
An alternative option to putting this in |
There are a few issues with distributing native-built methods that I can see. First, building becomes more complicated - you need a compiler, the build becomes more different on different platforms (can we support windows?), and when building you might have complex needs like targeting a different platform (or more platforms) than you are building on. Aside from building, there is the question of distributing the binaries and linking them into the runtime. I guess some command-line option is required in order to locate the .so (on linux) or .dll or whatever? And there can be environment-variable versions of this. So we would have to decide how much hand-holding to do. Finally we'd want to build the java code so it can fall back in a reasonable way when the native impl is not present. But none of this seems insurmountable. I believe Lucene previously had some native code libraries in its distribution and we could do so again. I'm not sure about the core vs misc distinction, I'd be OK either way, although I expect we want to keep core "clean"? |
Thanks Chris. Please share the link to OpenJDK issue when you get a chance. |
Let me try to squeeze some cycles out next week and see if I can make progress on this front. |
9e9b33b to
e45224b
Compare
lucene/core/src/c/dotProduct.c
Outdated
| * Looks like Apple M3 does not implement SVE and Apple's official documentation | ||
| * is not explicit about this or at least I could not find it. | ||
| * | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should remove the ifdef, this does not happen with -march=native, correct? The problem is only you try to "force" SVE? afaik, M3 doesnt support it and so -march=native should automatically take the neon path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simply having -march=native compiler flag and no ifdef directive makes clang (15.0) throw compilation error: SVE support not enabled. So, I think, guarding the SVE code with ifdef is necessary. That said there is no need for extra check on __APPLE__ and I have removed that part from ifdef.
lucene/core/src/c/dotProduct.h
Outdated
|
|
||
| int32_t vdot8s_sve(int8_t* vec1[], int8_t* vec2, int32_t limit); | ||
| int32_t vdot8s_neon(int8_t* vec1[], int8_t* vec2[], int32_t limit); | ||
| int32_t dot8s(int8_t* a, int8_t* b, int32_t limit); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we fix these prototypes to all be the same? Can we include from the .c file? Maybe also dding -Wall -Werror will help keep the code tidy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prototypes fixed and dotProduct.h included in dotProduct.c. I did not add -Wall -Werror yet as after including changes from your patch the compilation of dotProduct.c file fails for me on Graviton boxes. Let's continue that discussion in the separate thread.
lucene/core/src/c/dotProduct.c
Outdated
| // REDUCE: Add every vector element in target and write result to scalar | ||
| result = svaddv_s32(svptrue_b8(), acc1); | ||
|
|
||
| // Scalar tail. TODO: Use FMA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can remove TODO, since aarch64 "mul" is really "madd", i expect it already emits single instruction. look at assembler if you are curious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been replaced with Vector tail as suggested in a later comment.
lucene/core/src/c/dotProduct.c
Outdated
| acc2 = svdot_s32(acc2, va2, vb2); | ||
| acc3 = svdot_s32(acc3, va3, vb3); | ||
| acc4 = svdot_s32(acc4, va4, vb4); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe consider "vector tail", since 4 * vector length can be significant: and vector dimensions may not be exact multiple of that. It limits the worst case processing that "scalar tail" must do. Example from the java vector code: https://github.com/apache/lucene/blob/main/lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java#L153-L158
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Converted scalar tail to vector and throughput score on Graviton3 dropped from 45.5 -> 44.5 which is a bit surprising. My guess is that auto-vectorized and unrolled code for scalar tail is more efficient than vector tail. Comparing scalar and vector tail in godbolt - https://godbolt.org/z/sjo6KMnP7, the whilelo instruction generated by auto-vectorization, stands out to me but I don't understand why that would cause a meaningful drop in performance. Let me try and dig a little deeper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Graviton 3
Benchmark (size) Mode Cnt Score Error Units
VectorUtilBenchmark.binaryDotProductVector 768 thrpt 15 10.570 ± 0.002 ops/us
VectorUtilBenchmark.dot8s 768 thrpt 15 44.562 ± 0.448 ops/us
lucene/core/src/c/dotProduct.c
Outdated
| // Scalar tail. TODO: Use FMA | ||
| for (; i < limit; i++) { | ||
| result += vec1[i] * vec2[i]; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for any "tails" like this where we manually unroll and vectorize the main loop, we can add pragmas to prevent GCC/LLVM from trying to unroll and vectorize the tail. It is not strictly necessary but will lead to tighter code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and maybe its just enough to disable autovectorization, as unrolling might still be a win for scalar tail: can't remember what hotspot is doing on the java vector code for this.
|
go @goankur, awesome progress here. It is clear we gotta do something :) I left comments just to try to help. Do you mind avoiding rebase for updates? I am going to take a stab at the x86 side of the house. |
lucene/core/src/c/dotProduct.c
Outdated
| for (i = 0; i + 4 * vec_length <= limit; i += 4 * vec_length) { | ||
| // Load vectors into the Z registers which can range from 128-bit to 2048-bit wide | ||
| // The predicate register - P determines which bytes are active | ||
| // svptrue_b8() returns a predictae in which every element is true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With gcc autovectorization, i see use of SVE whilelo (predicate as counter) instead. I think it basically works on loop counter, maybe look at autogenerated assembly with godbolt for inspiration? Sorry, I'm not well-versed in SVE but trying to help improve the performance as it would be important for Graviton3/Graviton4/etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me look into this tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rmuir -- I tried using svwhilelt_b8_u32(i, limit) intrinsic for generating the predicate in both the unrolled-loop and vector tail but the performance was actually worse :-(.
To give you an idea of what I did, here is link to the ARM documentation with code sample
https://developer.arm.com/documentation/102699/0100/Optimizing-with-intrinsics
|
We have measured performance using knnPerfTest.py in lucene util with this PR commit as candidate branch. cmdLucene_BaselineLucene_CandidateThe latency has dropped from 0.333ms to 0.26ms. |
|
nice improvement! I do see the index time increased and wonder if it is due to creating too many heavyweight RandomVectorScorers. Maybe we can make this easier by simplifying the whole RandomVectorScorer creation pathway. Although that is out of scope here, it might be good to try and work around it if possible as I suggested above |
3079c5d to
9adcb8b
Compare
| * MemorySegment backing on-heap memory to native code we get | ||
| * "java.lang.UnsupportedOperationException: Not a native address" | ||
| * | ||
| * <p>Stack overflow thread: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks this is helpful, but let's not permanently record the SO thread in the code comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok! removed in the next revision.
| * https://stackoverflow.com/questions/69521289/jep-412-pass-a-on-heap-byte-array-to-native-code-getting-unsupportedoperatione | ||
| * explains the issue in more detail. | ||
| * | ||
| * <p>Q1. Why did we enable the native code path here if its inefficient ? A1. So that it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the unit-testing use case is a good one -- unit tests should be testing the actual production code path. I see that the required test setup would be complex, but IMO it's better to have simpler production code and complex unit tests than complex unused production code and simple unit tests!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense! The changes here were undone and TestVectorUtilSupport, BaseVectorizationTestCase were modified in the next revision to exercise the native code path by obtaining and invoking method handle for dotProduct(MemorySegment a, MemorySegment b) where 'a' and 'b' are constructed as off-heap MemorySegments.
| * heap. For target-vector, copying to off-heap memory will still be needed but allocation can | ||
| * happen once per scorer. | ||
| * | ||
| * <p>Q3. Should JMH benchmarks measure the performance of this method? A3. No, because they would |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These comments are helpful for the PR, but I think in the actual code we would want to simplify and maybe say something like: do not call in production. Indeed we could possibly even add an assert false: "inefficient implementation, do not use" ? And in production fall back to the non-native impl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assert false:... won't be necessary after undoing the changes as the old code wraps input byte[] into on-heap MemorySegment and native implementation is not exercised with that condition.
| * @param numBytes Dimension of the byte vector | ||
| * @return offHeap memory segment | ||
| */ | ||
| public static MemorySegment offHeapByteVector(int numBytes) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe put "random" in the name so we can easily tell this is creating a random value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the implementation slightly to simply copy the input byte[] to off-heap MemorySegment. The process of populating the byte[] with random bytes will happen outside this method in the next revision.
| try { | ||
| int limit = (int) a.byteSize(); | ||
| return (int) NativeMethodHandles.DOT_PRODUCT_IMPL.invokeExact(a, b, limit); | ||
| } catch (Throwable ex$) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are we trying to catch here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WrongMethodTypeException and anything else (Throwable) propagated by underlying method handle. WrongMethodTypeException is a subclass of Throwable.
| @@ -0,0 +1,407 @@ | |||
| /* | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lmk
78b7c1f to
0694ac6
Compare
…lement int8 dotProduct in C using Neon and SVE intrinsics respectively. Fallback to Neon if SVE instructions are not supported by target platform
…lement int8 dotProduct in C using Neon and SVE intrinsics respectively. Fallback to Neon if SVE instructions are not supported by target platform
…if native dot-product is enabled. Simplifyy JMH benchmark code that tests native dot product. Incorporate other review feedback
Quick Update
What's nextA lot of code in It is for this (and performance) reasons that a MemorySegment slice containing vector bytes is obtained from underlying MemorySegmentAccessInput and passed to native dotProduct. A flip side is that query vector has to be copied to an off-heap buffer. I don't like the fact that so much code had to be duplicated to make this work, so I am going to take a stab to see if there is an opportunity to minimize the duplication. I will also try to improve the coverage of unit tests although the changes in |
|
Hi, we see some good performance gains with this PR in the recent tests we performed (inline with above conversations). I ran the Setup :
For candidate, I had to add the Summary :
Baseline (with
|
| recall | latency(ms) | netCPU | avgCpuCount | nDoc | topK | fanout | maxConn | beamWidth | quantized | index(s) | index_docs/s | num_segments | index_size(MB) | vec_disk(MB) | vec_RAM(MB) | indexType |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.873 | 10.045 | 10.024 | 0.998 | 500000 | 100 | 50 | 64 | 250 | 7 bits | 641.24 | 779.74 | 3 | 1870.95 | 1832.962 | 368.118 | HNSW |
Candidate (with baseline index and -numSearchThread as 1) :
| recall | latency(ms) | netCPU | avgCpuCount | nDoc | topK | fanout | maxConn | beamWidth | quantized | index(s) | index_docs/s | num_segments | index_size(MB) | vec_disk(MB) | vec_RAM(MB) | indexType |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.873 | 5.323 | 5.159 | 0.969 | 500000 | 100 | 50 | 64 | 250 | 7 bits | 0.00 | Infinity | 3 | 1870.95 | 1832.962 | 368.118 | HNSW |
Baseline (with -reindex and -numSearchThread as CPU cores) :
| recall | latency(ms) | netCPU | avgCpuCount | nDoc | topK | fanout | maxConn | beamWidth | quantized | index(s) | index_docs/s | num_segments | index_size(MB) | vec_disk(MB) | vec_RAM(MB) | indexType |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.875 | 5.696 | 10.052 | 1.765 | 500000 | 100 | 50 | 64 | 250 | 7 bits | 633.99 | 788.66 | 3 | 1871.02 | 1832.962 | 368.118 | HNSW |
Candidate (with baseline index and -numSearchThread as CPU cores) :
| recall | latency(ms) | netCPU | avgCpuCount | nDoc | topK | fanout | maxConn | beamWidth | quantized | index(s) | index_docs/s | num_segments | index_size(MB) | vec_disk(MB) | vec_RAM(MB) | indexType |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.875 | 3.262 | 5.242 | 1.607 | 500000 | 100 | 50 | 64 | 250 | 7 bits | 0.00 | Infinity | 3 | 1871.02 | 1832.962 | 368.118 | HNSW |
Some other runs :
Candidate (with -reindex and -numSearchThread as 1) :
| recall | latency(ms) | netCPU | avgCpuCount | nDoc | topK | fanout | maxConn | beamWidth | quantized | index(s) | index_docs/s | num_segments | index_size(MB) | vec_disk(MB) | vec_RAM(MB) | indexType |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.870 | 4.348 | 4.208 | 0.968 | 500000 | 100 | 50 | 64 | 250 | 7 bits | 808.15 | 618.70 | 2 | 1871.61 | 1832.962 | 368.118 | HNSW |
Candidate (with -reindex and -numSearchThread as CPU cores) :
| recall | latency(ms) | netCPU | avgCpuCount | nDoc | topK | fanout | maxConn | beamWidth | quantized | index(s) | index_docs/s | num_segments | index_size(MB) | vec_disk(MB) | vec_RAM(MB) | indexType |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.872 | 3.355 | 4.434 | 1.322 | 500000 | 100 | 50 | 64 | 250 | 7 bits | 802.49 | 623.06 | 2 | 1872.37 | 1832.962 | 368.118 | HNSW |
I'm curious if there is anyway forward with this change here in Lucene, obviously not in the core (as Uwe also mentioned) but are we ok or not ok with having it in misc module? Thanks!
|
Thanks for bringing this up again! Now that Lucene main branch is requiring JDK24 we should be able to simplify a lot of the java interface (no more need for mrjar, I think?) and we should try to find a way to go forward. The performance improvement is just too good; if we don't include this in Lucene, consumers are all going to build local forks to shoehorn it in. it seems to me the main concerns from a user/consumer perspective are:
From a maintainer's POV I think we basically don't want to maintain a complex system of fallbacks and conditional loading that goes beyond what we already have (SPI, mrjars). Hopefully w/JDK24 this is now possible? As a side note, why are we maintaining lucene/core/src/java24 if JDK24 is now the required version? |
|
@shubhamvishu are you interested to get this merged up to the latest on main branch? It will be a bit of a project since things have moved away from lucene/core/src/java21 and there may be other changes |
|
Thanks @shubhamvishu -- these results look incredible, if they hold up. Odd that Panama Vector API (which Lucene uses for Oh, actually, in one run (baseline) you had 3 segments, and then later with candidate 2 segments, odd. I really want the simple Python tool that I can run in my prod env and it tells me "yes, Lucene HNSW Is using optimal SIMD instructions in your JDK, Lucene version, OS, CPU architecture/revision, virtualized environment, etc." -- I opened luceneutil issue to try to make progress on this: mikemccand/luceneutil#421 ... maybe |
Thanks for reviewing @msokolov! I agree the performance benefits look huge to just ignore the extra burden of maintaining native code in some module etc and yes I'm looking for getting this merged in main if the community agrees to do so(hence more eyes welcome).
I'm happy to rebase these changes and open a separate PR or update this one against the current main branch code |
@mikemccand I think we can rely on the what Uwe pointed out about the info message, we can take that as a definitive indicator (as it mention if its enabled or not and also the bit size for that machine). I opened mikemccand/luceneutil#423 to add the functionality to get back the disassembled code using Baseline : Candidate : |
WHOA! This looks awesome! So it uses |
|
Hi, Here at Amazon (customer facing product search), we’ve been testing this native dot product implementation in our production environment(ARM - Graviton 2 and 3) and we see 5-14x faster dot product computations in JMH benchmarks and we observed semantic latency improving from 62 msec to 28 msec (avg) for 4K embeddings(4.5 MM). Overall we saw 10-60% improvement on end-end avg search latencies in different scenarios (different sized vectors, vector-focused search vs search combined with other workloads). We haven’t tested all other CPUs types yet. I'm working on a draft PR on top of this PR with following changes and planning to raise it soon :
We kept the native code isolated in the misc package and not getting it in the core module which we know is highly discouraged. Additionally, PR #15285 would later help eliminate some code duplication and enable a cleaner implementation similar to Our benchmarking suggests substantial optimization potential for ARM-based deployments, and we believe this could benefit the broader Lucene community. We hope to make it easy for any Lucene user to opt-in to this alternative vector implementation ideally. We're committed to refining this implementation based on community feedback and addressing any concerns during the review process. I'm eager to hear the community's thoughts on this change, as there appears to be significant optimization potential for ARM architectures that could benefit many users. Thank you! |
|
Thanks Shubham for discovering As an introductory note, both Shubham and I work at Amazon in the team that designs, builds and operates Amazon's product search engine. Given the compelling performance improvements we are observing in production, especially with large (1K+ dimension) vectors, we'd love to work with the community to contribute it back so that larger Lucene community can benefit from this change. As Shubham mentioned, we are committed to supporting this alongside rest of the Lucene community. Please advise how to proceed. To reiterate, the credit for the original idea goes to this blog post from Elastic |
msokolov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This issue has been open for some time now; thanks for refreshing it, @shubhamvishu, and sharing the test results. For one thing -- the issue title is confusing -- this has morphed from a new benchmark to support for native dot-product in vector search.
There is a lot of change here, but most of it has been in review for some time now. I think we should go forward soon and merge this improvement. It looks like the PR here has some merge conflicts making it hard to tell what's new. Could you work on resolving those so we can have a clean change set here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious - is this part of the PR? Why did we have to remove this assert?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not required actually. Maybe just some stale change.
|
Hi @ChrisHegarty, following up on your comment about opening a bug issue for this in OpenJDK. I heard from an OpenJDK contributor that there isn't a bug filed for this yet. Should we create one now? |
Credit: https://www.elastic.co/search-labs/blog/vector-similarity-computations-ludicrous-speed
Description (WIP - Need to be updated)
Implements vectorized dot product for
byte[]vectors in native C code using SVE and Neon intrinsics.TODOs
nativemodule and tests incorenow depend on it. Ideally I'd prefer to only enable this dependency on supported platforms but not sure how to do this in a clean way.aarch64architecture and applicable only tobyte[]vectors. Adding additional conditions to enable/disable need to be added.int main(...)indotProduct.c. This code should be converted to unit tests exercised using CUnit support in Gradle.Graviton2andApple M3need to be redone as some bugs in native code were fixed.NOTE
I had to export environment variable
CCin my shell environment for Gradle to pickup thegcc-10compiler toolchain.export CC=/usr/bin/gcc10-ccBuild Lucene
Generates compiled shared library
${PWD}/lucene/native/build/libs/dotProduct/shared/libdotProduct.soand JMH benchmarking JAR
Run Benchmark JAR
IGNORE -- These need to be redone
Summary
int8dot-product.10X(NEON intrinsics)4.3X(SVE intrinsics)TBD8X(NEON intrinsics)NEON intrinsicsis equivalent to auto-vectorized/auto-unrolled code.SVE intrinsicsare not available.SVE intrinsicsprovides+9.8%(score: 45.635) throughput gain on top of auto-vectorized/auto-unrolled code (score: 41.729).Graviton4results look suspicious and need to be re-evaluated.NEON intrinsicsis2.38XFASTER than auto-vectorization/auto-unrolled code.SVE intrinsicsare not available.Test Environment
--shared -O3 -march=native -funroll-loopsm6g.4xlarge)m7g.4xlarge)r8g.4xlarge)Results
Graviton 2 (m6g.4xlarge)
Graviton 3 (m7g.4xlarge)
[TBD] - Graviton 4 (r8g.4xlarge)
Apple M3 Pro