New JMH benchmark method - vdot8s that implement int8 dotProduct in C… #13572

goankur · 2024-07-16T01:51:26Z

Credit: https://www.elastic.co/search-labs/blog/vector-similarity-computations-ludicrous-speed

Description (WIP - Need to be updated)

Implements vectorized dot product for byte[] vectors in native C code using SVE and Neon intrinsics.

TODOs

Some more plumbing pending, marking as draft as work is incomplete
The native 'C' is now moved to a separate native module and tests in core now depend on it. Ideally I'd prefer to only enable this dependency on supported platforms but not sure how to do this in a clean way.
Native dot product is only enabled on aarch64 architecture and applicable only to byte[] vectors. Adding additional conditions to enable/disable need to be added.
Rudimentary correctness testing of native code was done in int main(...) in dotProduct.c. This code should be converted to unit tests exercised using CUnit support in Gradle.
JMH benchmarks on other Graviton2 and Apple M3 need to be redone as some bugs in native code were fixed.
KNN benchmarks on scalar quantized vectors need to be done to assess indexing/querying impact.

NOTE

I had to export environment variable CC in my shell environment for Gradle to pickup the gcc-10 compiler toolchain. export CC=/usr/bin/gcc10-cc

Build Lucene

$ pwd
/home/goankur/open-source/lucene

$./gradlew  build

Generates compiled shared library ${PWD}/lucene/native/build/libs/dotProduct/shared/libdotProduct.so
and JMH benchmarking JAR

Run Benchmark JAR

java --enable-native-access=ALL-UNNAMED \
       --enable-preview \
       -Djava.library.path="./lucene/native/build/libs/dotProduct/shared" \
       -jar lucene/benchmark-jmh/build/benchmarks/lucene-benchmark-jmh-11.0.0-SNAPSHOT.jar \
      regexp "(.*)?(binaryDotProductVector|dot8s)"

IGNORE -- These need to be redone

Summary

Throughput gains relative to Panama API based java implementation of int8 dot-product.
- Graviton2: 10X (NEON intrinsics)
- Graviton3: 4.3X (SVE intrinsics)
- Graviton4: TBD
- Apple M3 Pro: 8X (NEON intrinsics)
On Graviton 2, performance of manually unrolled loop vectorized using NEON intrinsics is equivalent to auto-vectorized/auto-unrolled code. SVE intrinsics are not available.
On Graviton3, manually unrolled loop vectorized using SVE intrinsics provides +9.8% (score: 45.635) throughput gain on top of auto-vectorized/auto-unrolled code (score: 41.729).
[CAUTION]: Graviton4 results look suspicious and need to be re-evaluated.
On Apple M3 Pro, performance of manually unrolled loop vectorized using NEON intrinsics is 2.38X FASTER than auto-vectorization/auto-unrolled code. SVE intrinsics are not available.

Test Environment

GCC/Clang flags: --shared -O3 -march=native -funroll-loops

AWS Instance Type	Operating System	Compiler	JDK	Gradle
Graviton2 (`m6g.4xlarge`)	Amazon Linux 2	GCC 10.5.0	Amazon Corretto 21	Gradle 8.8
Graviton3 (`m7g.4xlarge`)	Amazon Linux 2	GCC 10.5.0	Amazon Corretto 21	Gradle 8.8
Graviton4 (`r8g.4xlarge`)	Amazon Linux 2023	GCC: 11.4.1	Amazon Corretto 21	Gradle 8.8
Apple M3	Mac OS Sonoma (14.5)	Apple clang 15.0.0	OpenJDK 21.0.3	Gradle 8.8

Results

Graviton 2 (m6g.4xlarge)

Benchmark	(size)	Mode	Cnt	Score	Error	Units
VectorUtilBenchmark.binaryDotProductVector	768	thrpt	15	2.857	± 0.014	ops/us
VectorUtilBenchmark.dot8s	768	thrpt	15	28.580	± 0.189	ops/us
VectorUtilBenchmark.neonVdot8s	768	thrpt	15	28.722	± 0.454	ops/us
VectorUtilBenchmark.sveVdot8s	768	thrpt	15	28.892	± 0.340	ops/us

Graviton 3 (m7g.4xlarge)

Benchmark	(size)	Mode	Cnt	Score	Error	Units
VectorUtilBenchmark.binaryDotProductVector	768	thrpt	15	10.563	± 0.004	ops/us
VectorUtilBenchmark.dot8s	768	thrpt	15	41.729	± 1.060	ops/us
VectorUtilBenchmark.neonVdot8s	768	thrpt	15	41.032	± 0.215	ops/us
VectorUtilBenchmark.sveVdot8s	768	thrpt	15	45.635	± 0.337	ops/us

[TBD] - Graviton 4 (r8g.4xlarge)

Apple M3 Pro

Benchmark	(size)	Mode	Cnt	Score	Error	Units
VectorUtilBenchmark.binaryDotProductVector	768	thrpt	15	10.399	± 0.025	ops/us
VectorUtilBenchmark.dot8s	768	thrpt	15	34.288	± 0.062	ops/us
VectorUtilBenchmark.neonVdot8s	768	thrpt	15	81.013	± 0.569	ops/us
VectorUtilBenchmark.sveVdot8s	768	thrpt	15	80.657	± 0.760	ops/us

rmuir · 2024-07-16T05:18:59Z

Do we even need to use intrinsics? function is so simple that the compiler seems to do the right thing, e.g. use SDOT dot production instruction, given the correct flags:

https://godbolt.org/z/KG1dPnrqn

https://developer.arm.com/documentation/102651/a/What-are-dot-product-intructions-

rmuir · 2024-07-16T05:34:40Z

I haven't benchmarked, just seems SDOT is the one to optimize for, and GCC can both recognize the code shape and autovectorize to it without hassle.

my cheap 2021 phone has asimddp feature in /proc/cpuinfo, dot product support seems widespread.

You can use it directly via intrinsic, too, no need to use add/multiply intrinsic: https://arm-software.github.io/acle/neon_intrinsics/advsimd.html#dot-product

But unless it is really faster than what GCC does with simple C, no need.

lucene/core/build.gradle

goankur · 2024-07-18T02:28:45Z

Do we even need to use intrinsics? function is so simple that the compiler seems to do the right thing, e.g. use SDOT dot production instruction, given the correct flags:

https://godbolt.org/z/KG1dPnrqn

https://developer.arm.com/documentation/102651/a/What-are-dot-product-intructions-

I haven't benchmarked, just seems SDOT is the one to optimize for, and GCC can both recognize the code shape and autovectorize to it without hassle.

my cheap 2021 phone has asimddp feature in /proc/cpuinfo, dot product support seems widespread.

You can use it directly via intrinsic, too, no need to use add/multiply intrinsic: https://arm-software.github.io/acle/neon_intrinsics/advsimd.html#dot-product

But unless it is really faster than what GCC does with simple C, no need.

With the updated compile flags, the performance of auto-vectorized code is slightly better than explicitly vectorized code (see results). Interesting thing to note is that both C-based implementations have 10X better throughout compared to the Panama API based java implementation (unless I am not doing apples-to-apples comparison).

rmuir · 2024-07-18T02:58:55Z

With the updated compile flags, the performance of auto-vectorized code is slightly better than explicitly vectorized code (see results). Interesting thing to note is that both C-based implementations have 10X better throughout compared to the Panama API based java implementation (unless I am not doing apples-to-apples comparison).

This seems correct to me. The java Vector API is not performant for the integer case. Hotspot doesn't much understand ARM at all and definitely doesn't even have instructions such as SDOT in its vocabulary at all.

rmuir · 2024-07-18T03:10:52Z

lucene/core/build.gradle

+        cCompiler.withArguments { args ->
+          args << "--shared"
+               << "-O3"
+               << "-march=armv8.2-a+dotprod"


for simple case of benchmarking i would just try -march=native to compile the best it can for the specific cpu/microarch. Then you could test your dot8s on ARM 256-bit SVE (graviton3, graviton4) which might be interesting...

I tried -march=native on an m7g.4xlarge (Graviton3) instance type, OS, JDK, GCC were unchanged. dot8s and vdot8s are now ~3.5x faster than java implementation, compared to being 10x better on m6g.4xlarge (Graviton2)

Benchmark (size) Mode Cnt Score Error Units

VectorUtilBenchmark.binaryDotProductVector 768 thrpt 15 10.570 ± 0.003 ops/us

VectorUtilBenchmark.dot8s 768 thrpt 15 37.659 ± 0.422 ops/us

VectorUtilBenchmark.vdot8s 768 thrpt 15 37.123 ± 0.237 ops/us

I think this happens because with the gcc10+sve, the implementation is not really unrolled. it should be possible to be 10x faster, too.

oh, the other likely explanation on the performance is that the integer dot product in java is not AS HORRIBLE on the 256-bit SVE as it is on the 128-bit neon. it more closely resembles the logic of how it behaves on AVX-256: two 8x8 bit integers ("64-bit vectors") are multiplied into intermediate 8x16-bit result (128-bit vector) and added to 8x32-bit (256-bit vector). Of course, it does not use SDOT instruction which is sad as it is CPU instruction intended precisely for this purpose.

On the 128-bit neon there is not a possibility with java's vector api to process 4x8 bit integers ("32-bit vectors") like the SDOT instruction does: https://developer.arm.com/documentation/102651/a/What-are-dot-product-intructions-
Nor is it even performant to take 64-bit vector and process "part 0" then "part 1". The situation is really sad, and the performance reflects that.

ChrisHegarty · 2024-07-18T13:30:29Z

With the updated compile flags, the performance of auto-vectorized code is slightly better than explicitly vectorized code (see results). Interesting thing to note is that both C-based implementations have 10X better throughout compared to the Panama API based java implementation (unless I am not doing apples-to-apples comparison).

This seems correct to me.

Same. The performance benefit is large here. We do similar in Elasticsearch, and have impls for both AArch64 and x64. I remember @rmuir making a similar comment before about the auto-vectorization - which is correct. I avoided it at the time given the toolchain that we were using, but it's a good option which I'll reevaluate.

https://github.com/elastic/elasticsearch/blob/main/libs/simdvec/native/src/vec/c/aarch64/vec.c

The java Vector API is not performant for the integer case. Hotspot doesn't much understand ARM at all and definitely doesn't even have instructions such as SDOT in its vocabulary at all.

++ Yes. This is not a good state of affairs. I'll make sure to get an issue filed with OpenJDK for it.

ChrisHegarty · 2024-07-18T13:35:12Z

Could Lucene ever have this directly in one of its modules? We currently use the FlatVectorsScorer to plugin the "native code optimized" alternative, when scoring Scalar Quantized vectors. But Lucene misses out on this by default, which is a pity.

Note: we currently do dot product and square distance on int7, since this gives a bit more flexibility on x64, and Lucene's scalar quantization is in the range of 0 - 127. (#13336)

rmuir · 2024-07-18T14:13:59Z

I avoided it at the time given the toolchain that we were using, but it's a good option which I'll reevaluate.

It should work well with any modern gcc (@goankur uses gcc 10 here). If using clang, I think you need to add some pragmas for it, that's all, see https://llvm.org/docs/Vectorizers.html and also there are some hints in the ARM docs.

One issue would be which to compile for/detect:

IMO for arm there would be two good ones:

armv8.2-a+dotprod (NEON)
armv9-a (SVE)

For x86 probably at most three?:

x86-64-v3 (AVX2)
x86-64-v4 (AVX512)
VNNI (should be tested but seems fit for the task? but not widespread? cascade-lake+?)

rmuir · 2024-07-18T14:34:39Z

Here is my proposal visually: https://godbolt.org/z/6fcjPWojf

As you can see, by passing -march=cascadelake it generates VNNI instructions.
IMO, no need for any intrinsics anywhere, for x86 nor ARM. Just a dead-simple C function and being smart about which targets we compile for.

(these variants should all be benchmarked of course first)

edit: I modified it to use same GCC compiler version across all variants. This newer version unrolls the SVE, too.

rmuir · 2024-07-18T15:04:15Z

And i see from playing around with compiler versions, the advantage of intrinsics approach: although I worry how many variants we'd maintain. it would give stability across releasing lucene without worrying about change to underlying C compiler etc, no crazy aggressive compiler flags needed, etc.

But we should at least START with autovectorizer and look/benchmark various approaches it uses, such as VNNI one in the example above.

rmuir · 2024-07-18T15:13:08Z

I definitely want to play around more with @goankur 's PR here and see what performance looks like across machines, but will be out of town for a bit.

There is a script to run the benchmarks across every aws instance type in parallel and gather the results: https://github.com/apache/lucene/tree/main/dev-tools/aws-jmh

I want to tweak https://github.com/apache/lucene/blob/main/dev-tools/aws-jmh/group_vars/all.yml to add the graviton4 (or any other new instance types since it was last touched) and also add c-compiler for testing this out, but we should be able to quickly iterate with it and make progress.

ChrisHegarty · 2024-07-19T09:29:59Z

Just to expand a little on a previous comment I made above.

Could Lucene ever have this directly in one of its modules?

An alternative option to putting this in core, is to put it in say misc, allowing users creating KnnVectorsFormat to hook into it through the Lucene99FlatVectorsFormat and FlatVectorsScorer. That way it would be somewhat optional in the build and distribution.

msokolov · 2024-07-19T12:22:23Z

An alternative option to putting this in core, is to put it in say misc, allowing users creating KnnVectorsFormat to hook into it through the Lucene99FlatVectorsFormat and FlatVectorsScorer. That way it would be somewhat optional in the build and distribution.

There are a few issues with distributing native-built methods that I can see. First, building becomes more complicated - you need a compiler, the build becomes more different on different platforms (can we support windows?), and when building you might have complex needs like targeting a different platform (or more platforms) than you are building on. Aside from building, there is the question of distributing the binaries and linking them into the runtime. I guess some command-line option is required in order to locate the .so (on linux) or .dll or whatever? And there can be environment-variable versions of this. So we would have to decide how much hand-holding to do. Finally we'd want to build the java code so it can fall back in a reasonable way when the native impl is not present. But none of this seems insurmountable. I believe Lucene previously had some native code libraries in its distribution and we could do so again. I'm not sure about the core vs misc distinction, I'd be OK either way, although I expect we want to keep core "clean"?

goankur · 2024-07-20T02:14:23Z

With the updated compile flags, the performance of auto-vectorized code is slightly better than explicitly vectorized code (see results). Interesting thing to note is that both C-based implementations have 10X better throughout compared to the Panama API based java implementation (unless I am not doing apples-to-apples comparison).

This seems correct to me.

Same. The performance benefit is large here. We do similar in Elasticsearch, and have impls for both AArch64 and x64. I remember @rmuir making a similar comment before about the auto-vectorization - which is correct. I avoided it at the time given the toolchain that we were using, but it's a good option which I'll reevaluate.

https://github.com/elastic/elasticsearch/blob/main/libs/simdvec/native/src/vec/c/aarch64/vec.c

The java Vector API is not performant for the integer case. Hotspot doesn't much understand ARM at all and definitely doesn't even have instructions such as SDOT in its vocabulary at all.

++ Yes. This is not a good state of affairs. I'll make sure to get an issue filed with OpenJDK for it.

Thanks Chris. Please share the link to OpenJDK issue when you get a chance.

goankur · 2024-07-20T02:52:53Z

I definitely want to play around more with @goankur 's PR here and see what performance looks like across machines, but will be out of town for a bit.

There is a script to run the benchmarks across every aws instance type in parallel and gather the results: https://github.com/apache/lucene/tree/main/dev-tools/aws-jmh

I want to tweak https://github.com/apache/lucene/blob/main/dev-tools/aws-jmh/group_vars/all.yml to add the graviton4 (or any other new instance types since it was last touched) and also add c-compiler for testing this out, but we should be able to quickly iterate with it and make progress.

Let me try to squeeze some cycles out next week and see if I can make progress on this front.

rmuir · 2024-07-30T19:11:07Z

lucene/core/src/c/dotProduct.c

+ * Looks like Apple M3 does not implement SVE and Apple's official documentation
+ * is not explicit about this or at least I could not find it. 
+ * 
+ */


I think we should remove the ifdef, this does not happen with -march=native, correct? The problem is only you try to "force" SVE? afaik, M3 doesnt support it and so -march=native should automatically take the neon path.

Simply having -march=native compiler flag and no ifdef directive makes clang (15.0) throw compilation error: SVE support not enabled. So, I think, guarding the SVE code with ifdef is necessary. That said there is no need for extra check on __APPLE__ and I have removed that part from ifdef.

rmuir · 2024-07-30T19:13:23Z

lucene/core/src/c/dotProduct.h

+
+int32_t vdot8s_sve(int8_t* vec1[], int8_t* vec2, int32_t limit);
+int32_t vdot8s_neon(int8_t* vec1[], int8_t* vec2[], int32_t limit);
+int32_t dot8s(int8_t* a, int8_t* b, int32_t limit);


can we fix these prototypes to all be the same? Can we include from the .c file? Maybe also dding -Wall -Werror will help keep the code tidy?

Prototypes fixed and dotProduct.h included in dotProduct.c. I did not add -Wall -Werror yet as after including changes from your patch the compilation of dotProduct.c file fails for me on Graviton boxes. Let's continue that discussion in the separate thread.

lucene/core/src/c/dotProduct.c

rmuir · 2024-07-30T19:26:33Z

lucene/core/src/c/dotProduct.c

+    // REDUCE: Add every vector element in target and write result to scalar
+    result = svaddv_s32(svptrue_b8(), acc1);
+
+    // Scalar tail. TODO: Use FMA


I think you can remove TODO, since aarch64 "mul" is really "madd", i expect it already emits single instruction. look at assembler if you are curious.

This has been replaced with Vector tail as suggested in a later comment.

rmuir · 2024-07-30T19:31:14Z

lucene/core/src/c/dotProduct.c

+	acc2 = svdot_s32(acc2, va2, vb2);
+	acc3 = svdot_s32(acc3, va3, vb3);
+	acc4 = svdot_s32(acc4, va4, vb4);
+    }	     


maybe consider "vector tail", since 4 * vector length can be significant: and vector dimensions may not be exact multiple of that. It limits the worst case processing that "scalar tail" must do. Example from the java vector code: https://github.com/apache/lucene/blob/main/lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java#L153-L158

Converted scalar tail to vector and throughput score on Graviton3 dropped from 45.5 -> 44.5 which is a bit surprising. My guess is that auto-vectorized and unrolled code for scalar tail is more efficient than vector tail. Comparing scalar and vector tail in godbolt - https://godbolt.org/z/sjo6KMnP7, the whilelo instruction generated by auto-vectorization, stands out to me but I don't understand why that would cause a meaningful drop in performance. Let me try and dig a little deeper.

Graviton 3

Benchmark (size) Mode Cnt Score Error Units VectorUtilBenchmark.binaryDotProductVector 768 thrpt 15 10.570 ± 0.002 ops/us VectorUtilBenchmark.dot8s 768 thrpt 15 44.562 ± 0.448 ops/us

rmuir · 2024-08-01T01:04:36Z

lucene/core/src/c/dotProduct.c

+    // Scalar tail. TODO: Use FMA
+    for (; i < limit; i++) {
+        result += vec1[i] * vec2[i];
+    }


for any "tails" like this where we manually unroll and vectorize the main loop, we can add pragmas to prevent GCC/LLVM from trying to unroll and vectorize the tail. It is not strictly necessary but will lead to tighter code.

and maybe its just enough to disable autovectorization, as unrolling might still be a win for scalar tail: can't remember what hotspot is doing on the java vector code for this.

rmuir · 2024-08-01T01:07:49Z

go @goankur, awesome progress here. It is clear we gotta do something :) I left comments just to try to help. Do you mind avoiding rebase for updates? I am going to take a stab at the x86 side of the house.

rmuir · 2024-08-01T01:26:33Z

lucene/core/src/c/dotProduct.c

+    for (i = 0; i + 4 * vec_length <= limit; i += 4 * vec_length) {
+	// Load vectors into the Z registers which can range from 128-bit to 2048-bit wide
+	// The predicate register - P determines which bytes are active
+	// svptrue_b8() returns a predictae in which every element is true


With gcc autovectorization, i see use of SVE whilelo (predicate as counter) instead. I think it basically works on loop counter, maybe look at autogenerated assembly with godbolt for inspiration? Sorry, I'm not well-versed in SVE but trying to help improve the performance as it would be important for Graviton3/Graviton4/etc.

See also https://developer.arm.com/documentation/ddi0602/2024-06/SVE-Instructions/WHILELO--predicate-as-counter---While-incrementing-unsigned-scalar-lower-than-scalar--predicate-as-counter--

Let me look into this tomorrow.

@rmuir -- I tried using svwhilelt_b8_u32(i, limit) intrinsic for generating the predicate in both the unrolled-loop and vector tail but the performance was actually worse :-(.

To give you an idea of what I did, here is link to the ARM documentation with code sample

https://developer.arm.com/documentation/102699/0100/Optimizing-with-intrinsics

yugushihuang · 2024-10-24T16:49:44Z

We have measured performance using knnPerfTest.py in lucene util with this PR commit as candidate branch.

cmd

'/usr/lib/jvm/java-21-amazon-corretto/bin/java', '-cp', [...], '--add-modules', 'jdk.incubator.vector', '-Djava.library.path=/home/[user_name]/lucene_candidate/lucene/native/build/libs/dotProduct/shared', 'knn.KnnGraphTester', '-quantize', '-ndoc', '1500000', '-maxConn', '32', '-beamWidthIndex', '50', '-fanout', '6', '-quantizeBits', '7', '-numMergeWorker', '12', '-numMergeThread', '4', '-encoding', 'float32', '-topK', '10', '-dim', '768', '-docs', 'enwiki-20120502-lines-1k-mpnet.vec', '-reindex', '-search-and-stats', 'enwiki-20120502-mpnet.vec', '-forceMerge', '-quiet'

Lucene_Baseline

Graph level=3 size=46, connectedness=1.00
Graph level=2 size=1405, connectedness=1.00
Graph level=1 size=46174, connectedness=1.00
Graph level=0 size=1500000, connectedness=1.00

Results:
recall  latency (ms)     nDoc  topK  fanout  maxConn  beamWidth  quantized  index s  force merge s  num segments  index size (MB)
 0.332         0.333  1500000    10       6       32         50     7 bits   432.69         271.51             1          5558.90

Lucene_Candidate

Graph level=3 size=46, connectedness=1.00
Graph level=2 size=1410, connectedness=1.00
Graph level=1 size=46205, connectedness=1.00
Graph level=0 size=1500000, connectedness=1.00

Results:
recall  latency (ms)     nDoc  topK  fanout  maxConn  beamWidth  quantized  index s  force merge s  num segments  index size (MB)
 0.337         0.260  1500000    10       6       32         50     7 bits   441.25         293.41             1          5558.91

The latency has dropped from 0.333ms to 0.26ms.

msokolov · 2024-10-24T17:59:26Z

nice improvement! I do see the index time increased and wonder if it is due to creating too many heavyweight RandomVectorScorers. Maybe we can make this easier by simplifying the whole RandomVectorScorer creation pathway. Although that is out of scope here, it might be good to try and work around it if possible as I suggested above

msokolov · 2024-11-01T13:43:09Z

lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java

+   * MemorySegment backing on-heap memory to native code we get
+   * "java.lang.UnsupportedOperationException: Not a native address"
+   *
+   * <p>Stack overflow thread:


Thanks this is helpful, but let's not permanently record the SO thread in the code comments

ok! removed in the next revision.

msokolov · 2024-11-01T13:46:15Z

lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java

+   * https://stackoverflow.com/questions/69521289/jep-412-pass-a-on-heap-byte-array-to-native-code-getting-unsupportedoperatione
+   * explains the issue in more detail.
+   *
+   * <p>Q1. Why did we enable the native code path here if its inefficient ? A1. So that it can be


I don't think the unit-testing use case is a good one -- unit tests should be testing the actual production code path. I see that the required test setup would be complex, but IMO it's better to have simpler production code and complex unit tests than complex unused production code and simple unit tests!

makes sense! The changes here were undone and TestVectorUtilSupport, BaseVectorizationTestCase were modified in the next revision to exercise the native code path by obtaining and invoking method handle for dotProduct(MemorySegment a, MemorySegment b) where 'a' and 'b' are constructed as off-heap MemorySegments.

msokolov · 2024-11-01T13:48:27Z

lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java

+   * heap. For target-vector, copying to off-heap memory will still be needed but allocation can
+   * happen once per scorer.
+   *
+   * <p>Q3. Should JMH benchmarks measure the performance of this method? A3. No, because they would


These comments are helpful for the PR, but I think in the actual code we would want to simplify and maybe say something like: do not call in production. Indeed we could possibly even add an assert false: "inefficient implementation, do not use" ? And in production fall back to the non-native impl

The assert false:... won't be necessary after undoing the changes as the old code wraps input byte[] into on-heap MemorySegment and native implementation is not exercised with that condition.

msokolov · 2024-11-01T13:49:44Z

lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java

+   * @param numBytes Dimension of the byte vector
+   * @return offHeap memory segment
+   */
+  public static MemorySegment offHeapByteVector(int numBytes) {


maybe put "random" in the name so we can easily tell this is creating a random value

I changed the implementation slightly to simply copy the input byte[] to off-heap MemorySegment. The process of populating the byte[] with random bytes will happen outside this method in the next revision.

msokolov · 2024-11-01T13:51:10Z

lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java

+      try {
+        int limit = (int) a.byteSize();
+        return (int) NativeMethodHandles.DOT_PRODUCT_IMPL.invokeExact(a, b, limit);
+      } catch (Throwable ex$) {


what are we trying to catch here?

WrongMethodTypeException and anything else (Throwable) propagated by underlying method handle. WrongMethodTypeException is a subclass of Throwable.

msokolov · 2024-11-01T13:55:21Z

...g/apache/lucene/internal/vectorization/Lucene99MemorySegmentScalarQuantizedVectorScorer.java

@@ -0,0 +1,407 @@
+/*


…lement int8 dotProduct in C using Neon and SVE intrinsics respectively. Fallback to Neon if SVE instructions are not supported by target platform

…if native dot-product is enabled. Simplifyy JMH benchmark code that tests native dot product. Incorporate other review feedback

goankur · 2024-11-07T01:54:58Z

Quick Update

Changes to use native dot-product for both int7 scalar-quantized and raw vectors have been integrated in core/src/java21
Lucene99ScalarQuantizedVectorScorer was modified to use Lucene99MemorySegmentFlatVectorScorer when PanamaVectorizationProvider is enabled.
Unit tests in TestVectorUtilSupport have been expanded to exercise native dot-product code.
JMH benchmarks contain methods to benchmark performance of compiler auto-vectorized implementation and manually vectorized implementation.
test.native.dotProduct system property has been added to gradle/testing/randomization.gradle to enable randomized testing of native code.
Running using KnnPerfTest tool (in luceneutil) on 768 dimensional mpnet vectors show ~25% reduction in latency compared to baseline.

What's next

A lot of code in Lucene99MemorySegmentScalarQuantizedVectorScorer was duplicated from Lucene99ScalarQuantizedVectorScorer and modified to enable code path that uses native dotProduct. This is because the native code only works with off-heap memory and new implementation needs to pass vectors as off-heap MemorySegment to native code.

It is for this (and performance) reasons that a MemorySegment slice containing vector bytes is obtained from underlying MemorySegmentAccessInput and passed to native dotProduct. A flip side is that query vector has to be copied to an off-heap buffer.

I don't like the fact that so much code had to be duplicated to make this work, so I am going to take a stab to see if there is an opportunity to minimize the duplication.

I will also try to improve the coverage of unit tests although the changes in Lucene99ScalarQuantizedVectorScorer are gated behind flag -Constants.NATIVE_DOT_PRODUCT_ENABLED which is toggled by randomized testing so it should exercise the existing tests for raw and scalar quantized vectors with and without native dotProduct.

shubhamvishu · 2025-07-08T10:13:11Z

Hi, we see some good performance gains with this PR in the recent tests we performed (inline with above conversations). I ran the luceneutil benchmarks with this PR to measure the overall impact and below are the benchmark results on Graviton2 machine I used (happy to know if I missed something here) :

Setup :

Baseline : current Apache lucene main
Candidate : current Apache lucene main + PR#13572
Dataset : Cohere (768 dim)

For candidate, I had to add the export CC=/usr/bin/gcc10-cc and also -Djava.library.path=../lucene_candidate/lucene/misc/build/libs/dotProduct/shared -Dlucene.useNativeDotProduct=true to the java command for it to be able to find the compiled binary and hence use the new native path for dot product calculations

Summary :

We observe ~42-47% reduction in latency and ~50% reduction in CPUTime (though little slow indexing?, maybe its not apple-apple comparison or we are missing something in squeezing the complete SIMD gains)

Baseline (with `-reindex` and `-numSearchThread` as 1) :

recall	latency(ms)	netCPU	avgCpuCount	nDoc	topK	fanout	maxConn	beamWidth	quantized	index(s)	index_docs/s	num_segments	index_size(MB)	vec_disk(MB)	vec_RAM(MB)	indexType
0.873	10.045	10.024	0.998	500000	100	50	64	250	7 bits	641.24	779.74	3	1870.95	1832.962	368.118	HNSW

Candidate (with baseline index and `-numSearchThread` as 1) :

recall	latency(ms)	netCPU	avgCpuCount	nDoc	topK	fanout	maxConn	beamWidth	quantized	index(s)	index_docs/s	num_segments	index_size(MB)	vec_disk(MB)	vec_RAM(MB)	indexType
0.873	5.323	5.159	0.969	500000	100	50	64	250	7 bits	0.00	Infinity	3	1870.95	1832.962	368.118	HNSW

Baseline (with `-reindex` and `-numSearchThread` as CPU cores) :

recall	latency(ms)	netCPU	avgCpuCount	nDoc	topK	fanout	maxConn	beamWidth	quantized	index(s)	index_docs/s	num_segments	index_size(MB)	vec_disk(MB)	vec_RAM(MB)	indexType
0.875	5.696	10.052	1.765	500000	100	50	64	250	7 bits	633.99	788.66	3	1871.02	1832.962	368.118	HNSW

Candidate (with baseline index and `-numSearchThread` as CPU cores) :

recall	latency(ms)	netCPU	avgCpuCount	nDoc	topK	fanout	maxConn	beamWidth	quantized	index(s)	index_docs/s	num_segments	index_size(MB)	vec_disk(MB)	vec_RAM(MB)	indexType
0.875	3.262	5.242	1.607	500000	100	50	64	250	7 bits	0.00	Infinity	3	1871.02	1832.962	368.118	HNSW

Some other runs :

Candidate (with `-reindex` and `-numSearchThread` as 1) :

recall	latency(ms)	netCPU	avgCpuCount	nDoc	topK	fanout	maxConn	beamWidth	quantized	index(s)	index_docs/s	num_segments	index_size(MB)	vec_disk(MB)	vec_RAM(MB)	indexType
0.870	4.348	4.208	0.968	500000	100	50	64	250	7 bits	808.15	618.70	2	1871.61	1832.962	368.118	HNSW

Candidate (with `-reindex` and `-numSearchThread` as CPU cores) :

recall	latency(ms)	netCPU	avgCpuCount	nDoc	topK	fanout	maxConn	beamWidth	quantized	index(s)	index_docs/s	num_segments	index_size(MB)	vec_disk(MB)	vec_RAM(MB)	indexType
0.872	3.355	4.434	1.322	500000	100	50	64	250	7 bits	802.49	623.06	2	1872.37	1832.962	368.118	HNSW

I'm curious if there is anyway forward with this change here in Lucene, obviously not in the core (as Uwe also mentioned) but are we ok or not ok with having it in misc module? Thanks!

msokolov · 2025-07-08T12:23:33Z

Thanks for bringing this up again! Now that Lucene main branch is requiring JDK24 we should be able to simplify a lot of the java interface (no more need for mrjar, I think?) and we should try to find a way to go forward. The performance improvement is just too good; if we don't include this in Lucene, consumers are all going to build local forks to shoehorn it in. it seems to me the main concerns from a user/consumer perspective are:

Building the C code must be optional, so platforms that don't support or need it can skip it.
The Java interface needs to gracefully fall back when the native module is not present
Instructions for linking to the native module need to be clearly documented

From a maintainer's POV I think we basically don't want to maintain a complex system of fallbacks and conditional loading that goes beyond what we already have (SPI, mrjars). Hopefully w/JDK24 this is now possible?

As a side note, why are we maintaining lucene/core/src/java24 if JDK24 is now the required version?

msokolov · 2025-07-08T12:24:36Z

@shubhamvishu are you interested to get this merged up to the latest on main branch? It will be a bit of a project since things have moved away from lucene/core/src/java21 and there may be other changes

mikemccand · 2025-07-08T19:23:17Z

Thanks @shubhamvishu -- these results look incredible, if they hold up. Odd that Panama Vector API (which Lucene uses for dotProduct) is so much slower on Graviton2. It's also odd that indexing throughput got a bit slower with the change? Much of the indexing cost for HNSW is actually searching (each insert is searching for top K vectors in the graph, so far, and adding those as edges for this vector's node)?

Oh, actually, in one run (baseline) you had 3 segments, and then later with candidate 2 segments, odd.

I really want the simple Python tool that I can run in my prod env and it tells me "yes, Lucene HNSW Is using optimal SIMD instructions in your JDK, Lucene version, OS, CPU architecture/revision, virtualized environment, etc." -- I opened luceneutil issue to try to make progress on this: mikemccand/luceneutil#421 ... maybe perf tool can give us counters of how many of which CPU asm instructions are used or so?

shubhamvishu · 2025-07-09T07:50:43Z

are you interested to get this merged up to the latest on main branch? It will be a bit of a project since things have moved away from lucene/core/src/java21 and there may be other changes

Thanks for reviewing @msokolov! I agree the performance benefits look huge to just ignore the extra burden of maintaining native code in some module etc and yes I'm looking for getting this merged in main if the community agrees to do so(hence more eyes welcome).

It will be a bit of a project since things have moved away from lucene/core/src/java21 and there may be other changes

I'm happy to rebase these changes and open a separate PR or update this one against the current main branch code

shubhamvishu · 2025-07-09T09:53:43Z

I really want the simple Python tool that I can run in my prod env and it tells me "yes, Lucene HNSW Is using optimal SIMD instructions in your JDK, Lucene version, OS, CPU architecture/revision, virtualized environment, etc.

@mikemccand I think we can rely on the what Uwe pointed out about the info message, we can take that as a definitive indicator (as it mention if its enabled or not and also the bit size for that machine).

I opened mikemccand/luceneutil#423 to add the functionality to get back the disassembled code using perf tool. It helps to see into what instructions got executed (and even more events that we like to look into). I used it to verify that for 7-bit case in this PR to ensure on ARM aarch the SDOT instruction (specifically for dot product calculations) is getting used in candidate as expected like below :

Baseline :

>> PAGER=cat perf annotate -i perf1.data -f --stdio | head -100000 | grep "sdot"  // nothing

Candidate :

>> PAGER=cat perf annotate -i perf2.data -f --stdio | head -100000 | grep "sdot"        
    0.00 :   79c:    sdot    v0.4s, v2.16b, v4.16b
    0.00 :   7a8:    sdot    v7.4s, v6.16b, v5.16b
    0.00 :   7b0:    sdot    v22.4s, v3.16b, v16.16b
    0.00 :   7bc:    sdot    v21.4s, v18.16b, v17.16b
    0.28 :   7e4:    sdot    v0.4s, v19.16b, v20.16b
    0.55 :   7ec:    sdot    v7.4s, v23.16b, v30.16b
    0.44 :   7f4:    sdot    v22.4s, v24.16b, v31.16b
    1.81 :   7fc:    sdot    v21.4s, v25.16b, v1.16b
    0.34 :   804:    sdot    v0.4s, v26.16b, v2.16b
    0.58 :   80c:    sdot    v7.4s, v27.16b, v4.16b
    0.61 :   814:    sdot    v22.4s, v28.16b, v5.16b
    1.80 :   81c:    sdot    v21.4s, v29.16b, v3.16b
    0.00 :   8b4:    sdot    v0.4s, v22.16b, v16.16b
    0.00 :   8c0:    sdot    v0.4s, v17.16b, v18.16b
    0.00 :   8cc:    sdot    v0.4s, v19.16b, v23.16b
    0.00 :   8d8:    sdot    v0.4s, v24.16b, v25.16b
    0.00 :   8e4:    sdot    v0.4s, v26.16b, v27.16b
    0.00 :   8f0:    sdot    v0.4s, v28.16b, v29.16b
    0.00 :   914:    sdot    v0.4s, v20.16b, v30.16b
    0.00 :   928:    sdot    v0.4s, v31.16b, v6.16b
    0.00 :   94c:    sdot    v0.4s, v16.16b, v22.16b
    0.00 :   950:    sdot    v0.4s, v21.16b, v17.16b
    0.00 :   954:    sdot    v0.4s, v5.16b, v18.16b
    0.00 :   958:    sdot    v0.4s, v4.16b, v19.16b
    0.00 :   95c:    sdot    v0.4s, v3.16b, v23.16b
    0.00 :   960:    sdot    v0.4s, v2.16b, v24.16b
    0.00 :   a04:    sdot    v25.4s, v28.16b, v29.16b
    0.00 :   a14:    sdot    v25.4s, v20.16b, v30.16b
    0.00 :   a24:    sdot    v25.4s, v31.16b, v21.16b
    0.00 :   a34:    sdot    v25.4s, v5.16b, v4.16b
    0.00 :   a44:    sdot    v25.4s, v3.16b, v2.16b
    0.00 :   a54:    sdot    v25.4s, v6.16b, v22.16b
    0.00 :   a88:    sdot    v25.4s, v16.16b, v17.16b
    0.00 :   aa0:    sdot    v25.4s, v23.16b, v24.16b
    0.00 :   aa4:    sdot    v25.4s, v18.16b, v19.16b
    0.00 :   ab0:    sdot    v25.4s, v0.16b, v7.16b
    0.00 :   abc:    sdot    v25.4s, v1.16b, v26.16b
    0.00 :   ac8:    sdot    v25.4s, v27.16b, v28.16b
    0.00 :   ad4:    sdot    v25.4s, v29.16b, v20.16b
    0.00 :   ae0:    sdot    v25.4s, v30.16b, v31.16b
    0.00 :   bf8:    sdot    v0.4s, v21.16b, v6.16b
    0.00 :   c0c:    sdot    v25.4s, v27.16b, v26.16b

mikemccand · 2025-07-09T12:42:38Z

I opened mikemccand/luceneutil#423 to add the functionality to get back the disassembled code using perf tool. It helps to see into what instructions got executed (and even more events that we like to look into). I used it to verify that for 7-bit case in this PR to ensure on ARM aarch the SDOT instruction (specifically for dot product calculations) is getting used in candidate as expected like below :

WHOA! This looks awesome! So it uses perf tool to execute knnPerfTest, recording a histogram of which instructions executed how many times or so? And then we check for which instructions we think we know we want to see executed based on the current CPU arch / revision (sdot in this case)?

shubhamvishu · 2025-12-12T18:36:39Z

Hi,

Here at Amazon (customer facing product search), we’ve been testing this native dot product implementation in our production environment(ARM - Graviton 2 and 3) and we see 5-14x faster dot product computations in JMH benchmarks and we observed semantic latency improving from 62 msec to 28 msec (avg) for 4K embeddings(4.5 MM). Overall we saw 10-60% improvement on end-end avg search latencies in different scenarios (different sized vectors, vector-focused search vs search combined with other workloads). We haven’t tested all other CPUs types yet. I'm working on a draft PR on top of this PR with following changes and planning to raise it soon :

Removing the overhead from heap to off-heap copying by utilizing Linker.Option.critical, which eliminates unnecessary copying
Runtime dispatch using IFUNC to choose SVE vs NEON vs scalar implementation at runtime based on available intrinsics
Build related changes to generate the binary

We kept the native code isolated in the misc package and not getting it in the core module which we know is highly discouraged. Additionally, PR #15285 would later help eliminate some code duplication and enable a cleaner implementation similar to PanamaVectorUtilSupport - potentially through a NativeVectorUtilSupport class?

Our benchmarking suggests substantial optimization potential for ARM-based deployments, and we believe this could benefit the broader Lucene community. We hope to make it easy for any Lucene user to opt-in to this alternative vector implementation ideally. We're committed to refining this implementation based on community feedback and addressing any concerns during the review process. I'm eager to hear the community's thoughts on this change, as there appears to be significant optimization potential for ARM architectures that could benefit many users. Thank you!

goankur · 2025-12-13T00:47:48Z

Thanks Shubham for discovering Linker.Option.critical(true) that allows JVM to interoperate between off-heap and on-heap memory in a foreign (native) function. This eliminates extra on-heap to off-heap copying which improves the efficiency of this approach further as validated by our production environment at Amazon.

As an introductory note, both Shubham and I work at Amazon in the team that designs, builds and operates Amazon's product search engine.

Given the compelling performance improvements we are observing in production, especially with large (1K+ dimension) vectors, we'd love to work with the community to contribute it back so that larger Lucene community can benefit from this change. As Shubham mentioned, we are committed to supporting this alongside rest of the Lucene community. Please advise how to proceed.

To reiterate, the credit for the original idea goes to this blog post from Elastic
https://www.elastic.co/search-labs/blog/vector-similarity-computations-ludicrous-speed

msokolov

This issue has been open for some time now; thanks for refreshing it, @shubhamvishu, and sharing the test results. For one thing -- the issue title is confusing -- this has morphed from a new benchmark to support for native dot-product in vector search.

There is a lot of change here, but most of it has been in review for some time now. I think we should go forward soon and merge this improvement. It looks like the PR here has some merge conflicts making it hard to tell what's new. Could you work on resolving those so we can have a clean change set here?

msokolov · 2025-12-15T15:20:37Z

lucene/core/src/java/org/apache/lucene/util/VectorUtil.java

curious - is this part of the PR? Why did we have to remove this assert?

This is not required actually. Maybe just some stale change.

shubhamvishu · 2025-12-17T04:00:44Z

Thanks @msokolov and @goankur! I opened #15508 with proposed changes and move ahead with this. I also shared the JMH benchmark results with the PR here. Looking forward to your thoughts.

shubhamvishu · 2025-12-17T18:07:39Z

Hi @ChrisHegarty, following up on your comment about opening a bug issue for this in OpenJDK. I heard from an OpenJDK contributor that there isn't a bug filed for this yet. Should we create one now?

rmuir reviewed Jul 16, 2024

View reviewed changes

lucene/core/build.gradle Outdated Show resolved Hide resolved

goankur force-pushed the main branch from 5490b8e to 356d194 Compare July 18, 2024 02:00

rmuir reviewed Jul 18, 2024

View reviewed changes

goankur force-pushed the main branch from 356d194 to 160347f Compare July 20, 2024 02:50

goankur force-pushed the main branch 4 times, most recently from 9e9b33b to e45224b Compare July 30, 2024 04:31

rmuir reviewed Jul 30, 2024

View reviewed changes

lucene/core/src/c/dotProduct.c Outdated Show resolved Hide resolved

rmuir reviewed Jul 30, 2024

View reviewed changes

rmuir reviewed Aug 1, 2024

View reviewed changes

goankur force-pushed the main branch 3 times, most recently from 3079c5d to 9adcb8b Compare November 1, 2024 00:23

msokolov reviewed Nov 1, 2024

View reviewed changes

goankur force-pushed the main branch 2 times, most recently from 78b7c1f to 0694ac6 Compare November 3, 2024 22:20

Ankur Goel added 3 commits November 5, 2024 02:05

New JMH benchmark methods - vdot8s, neonVdot8s and sveVdot8s that imp…

e4697c4

…lement int8 dotProduct in C using Neon and SVE intrinsics respectively. Fallback to Neon if SVE instructions are not supported by target platform

New JMH benchmark methods - vdot8s, neonVdot8s and sveVdot8s that imp…

3cb723c

…lement int8 dotProduct in C using Neon and SVE intrinsics respectively. Fallback to Neon if SVE instructions are not supported by target platform

Move C code to native module and integrate Java code under java21

7349d4f

goankur force-pushed the main branch from 0694ac6 to ec7c6d0 Compare November 5, 2024 02:15

Allocate offHeap memory in dotProduct(byte[], byte[]) for unit tests …

3c94ceb

…if native dot-product is enabled. Simplifyy JMH benchmark code that tests native dot product. Incorporate other review feedback

goankur force-pushed the main branch from ec7c6d0 to 3c94ceb Compare November 5, 2024 02:27

mikemccand mentioned this pull request Jul 7, 2025

Create a simple tool that tells me whether SIMD is working in my environment mikemccand/luceneutil#421

Open

shubhamvishu mentioned this pull request Jul 9, 2025

Get disassembled code to verify SIMD optimizations mikemccand/luceneutil#423

Merged

msokolov requested changes Dec 15, 2025

View reviewed changes

shubhamvishu mentioned this pull request Dec 17, 2025

Use native dot product in Lucene #15508

Open

Benchmark	(size)	Mode	Cnt	Score	Error	Units
VectorUtilBenchmark.binaryDotProductVector	768	thrpt	15	10.570	± 0.003	ops/us
VectorUtilBenchmark.dot8s	768	thrpt	15	37.659	± 0.422	ops/us
VectorUtilBenchmark.vdot8s	768	thrpt	15	37.123	± 0.237	ops/us

New JMH benchmark method - vdot8s that implement int8 dotProduct in C… #13572

Are you sure you want to change the base?

New JMH benchmark method - vdot8s that implement int8 dotProduct in C… #13572

Uh oh!

Conversation

goankur commented Jul 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Credit: https://www.elastic.co/search-labs/blog/vector-similarity-computations-ludicrous-speed

Description (WIP - Need to be updated)

TODOs

NOTE

Build Lucene

Run Benchmark JAR

IGNORE -- These need to be redone

Summary

Test Environment

Results

Graviton 2 (m6g.4xlarge)

Graviton 3 (m7g.4xlarge)

[TBD] - Graviton 4 (r8g.4xlarge)

Apple M3 Pro

Uh oh!

rmuir commented Jul 16, 2024

Uh oh!

rmuir commented Jul 16, 2024

Uh oh!

Uh oh!

goankur commented Jul 18, 2024

Uh oh!

rmuir commented Jul 18, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ChrisHegarty commented Jul 18, 2024

Uh oh!

ChrisHegarty commented Jul 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rmuir commented Jul 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rmuir commented Jul 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rmuir commented Jul 18, 2024

Uh oh!

rmuir commented Jul 18, 2024

Uh oh!

ChrisHegarty commented Jul 19, 2024

Uh oh!

msokolov commented Jul 19, 2024

Uh oh!

goankur commented Jul 20, 2024

Uh oh!

goankur commented Jul 20, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

goankur Aug 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

goankur Aug 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

goankur commented Jul 16, 2024 •

edited

Loading

ChrisHegarty commented Jul 18, 2024 •

edited

Loading

rmuir commented Jul 18, 2024 •

edited

Loading

rmuir commented Jul 18, 2024 •

edited

Loading

goankur Aug 2, 2024 •

edited

Loading

goankur Aug 2, 2024 •

edited

Loading

goankur Nov 5, 2024 •

edited

Loading