Panama vector accelerated optimized scalar quantization #127118

benwtrent · 2025-04-21T16:20:23Z

When scalar quantizing for bbq, optimizing intervals can take time, especially at higher bit sizes.

While for bbq, the impact will be marginal, its still a frustrating bottleneck, especially at query time where the bit size is larger (e.g. 4 bits).

Here are some results from the new JMH benchmark, this is on my laptop. Panama vector is 3-4x faster. It can likely be made even faster, my panama vector skills aren't the absolute best.

OptimizedScalarQuantizerBenchmark.scalar       1     384  thrpt   15  131.443 ± 17.037  ops/ms
OptimizedScalarQuantizerBenchmark.scalar       1     702  thrpt   15   78.247 ± 13.923  ops/ms
OptimizedScalarQuantizerBenchmark.scalar       1    1024  thrpt   15   50.635 ±  9.605  ops/ms
OptimizedScalarQuantizerBenchmark.scalar       4     384  thrpt   15  143.211 ± 35.947  ops/ms
OptimizedScalarQuantizerBenchmark.scalar       4     702  thrpt   15   69.438 ± 10.760  ops/ms
OptimizedScalarQuantizerBenchmark.scalar       4    1024  thrpt   15   44.546 ±  5.081  ops/ms
OptimizedScalarQuantizerBenchmark.scalar       7     384  thrpt   15  146.597 ± 17.915  ops/ms
OptimizedScalarQuantizerBenchmark.scalar       7     702  thrpt   15   79.901 ±  7.855  ops/ms
OptimizedScalarQuantizerBenchmark.scalar       7    1024  thrpt   15   53.202 ±  8.419  ops/ms
OptimizedScalarQuantizerBenchmark.vector       1     384  thrpt   15  470.615 ± 62.610  ops/ms
OptimizedScalarQuantizerBenchmark.vector       1     702  thrpt   15  259.081 ± 27.011  ops/ms
OptimizedScalarQuantizerBenchmark.vector       1    1024  thrpt   15  180.375 ± 17.070  ops/ms
OptimizedScalarQuantizerBenchmark.vector       4     384  thrpt   15  443.701 ± 60.452  ops/ms
OptimizedScalarQuantizerBenchmark.vector       4     702  thrpt   15  234.729 ± 13.969  ops/ms
OptimizedScalarQuantizerBenchmark.vector       4    1024  thrpt   15  170.735 ± 22.461  ops/ms
OptimizedScalarQuantizerBenchmark.vector       7     384  thrpt   15  499.138 ± 49.826  ops/ms
OptimizedScalarQuantizerBenchmark.vector       7     702  thrpt   15  274.890 ± 28.438  ops/ms
OptimizedScalarQuantizerBenchmark.vector       7    1024  thrpt   15  171.937 ± 11.087  ops/ms

elasticsearchmachine · 2025-04-21T16:20:47Z

Pinging @elastic/es-search-relevance (Team:Search Relevance)

elasticsearchmachine · 2025-04-21T16:20:47Z

Hi @benwtrent, I've created a changelog YAML for you.

benwtrent · 2025-04-21T17:16:19Z

.../main21/java/org/elasticsearch/simdvec/internal/vectorization/PanamaESVectorUtilSupport.java

+        if (vector.length > 2 * FLOAT_SPECIES.length()) {
+            FloatVector vecMeanVec = FloatVector.zero(FLOAT_SPECIES);
+            FloatVector m2Vec = FloatVector.zero(FLOAT_SPECIES);
+            FloatVector norm2Vec = FloatVector.zero(FLOAT_SPECIES);
+            FloatVector minVec = FloatVector.broadcast(FLOAT_SPECIES, Float.MAX_VALUE);
+            FloatVector maxVec = FloatVector.broadcast(FLOAT_SPECIES, -Float.MAX_VALUE);
+            int count = 0;
+            for (; i < FLOAT_SPECIES.loopBound(vector.length); i += FLOAT_SPECIES.length()) {
+                ++count;
+                FloatVector v = FloatVector.fromArray(FLOAT_SPECIES, vector, i);
+                FloatVector c = FloatVector.fromArray(FLOAT_SPECIES, centroid, i);
+                FloatVector centeredVec = v.sub(c);
+                FloatVector deltaVec = centeredVec.sub(vecMeanVec);
+                norm2Vec = fma(centeredVec, centeredVec, norm2Vec);
+                vecMeanVec = vecMeanVec.add(deltaVec.div(count));
+                FloatVector delta2Vec = centeredVec.sub(vecMeanVec);
+                m2Vec = fma(deltaVec, delta2Vec, m2Vec);
+                minVec = minVec.min(centeredVec);
+                maxVec = maxVec.max(centeredVec);
+                centeredVec.intoArray(centered, i);
+            }
+            min = minVec.reduceLanes(MIN);
+            max = maxVec.reduceLanes(MAX);
+            norm2 = norm2Vec.reduceLanes(ADD);
+            vecMean = vecMeanVec.reduceLanes(ADD) / FLOAT_SPECIES.length();
+            FloatVector d2Mean = vecMeanVec.sub(vecMean);
+            m2Vec = fma(d2Mean, d2Mean, m2Vec);
+            vectCount = count * FLOAT_SPECIES.length();
+            vecVar = m2Vec.reduceLanes(ADD);
+        }
+
+        float tailMean = 0;
+        float tailM2 = 0;
+        int tailCount = 0;
+        // handle the tail
+        for (; i < vector.length; i++) {
+            centered[i] = vector[i] - centroid[i];
+            float delta = centered[i] - tailMean;
+            ++tailCount;
+            tailMean += delta / tailCount;
+            float delta2 = centered[i] - tailMean;
+            tailM2 = fma(delta, delta2, tailM2);
+            min = Math.min(min, centered[i]);
+            max = Math.max(max, centered[i]);
+            norm2 = fma(centered[i], centered[i], norm2);
+        }


@tveasey could you take a look here? I think I did the variance calculation here correctly. But I might have missed something.

Formulas look correct to me of course checking with vector lengths 10 - 100 vs slow calculation is within a few epsilon will prove there are no errors. Specifically, I would add testing of vectorised stats calculation against the super simple vestions, i.e. compute mean, then compute mean of square residuals, etc, so there is no chance of errors. If you're v.close to that a bunch of random length vectors you're good. (This may be what the Lucene reference does, but if it using online calculation my inclination would be to simplify further so there is no chance of errors.)

…nwtrent/elasticsearch into feature/panama-vector-accelerated-osq

ChrisHegarty

This looks good to me.

I will run the benchmark on my Linux machine, but that can be done post merge as a follow up.

john-wagster

lgtm

centerAndCalculateOSQStatsEuclidean and centerAndCalculateOSQStatsDp were a bit difficult to follow in places, but reading through each they make sense and I didn't see anything obviously wrong.

tveasey

Statistics calculations look correct to me. I think you can make the tail handling a bit cleaner and a bit faster, but functionally looks good to me.

tveasey · 2025-04-23T14:47:37Z

.../main21/java/org/elasticsearch/simdvec/internal/vectorization/PanamaESVectorUtilSupport.java

+        float min = Float.MAX_VALUE;
+        float max = -Float.MAX_VALUE;
+        int i = 0;
+        int vectCount = 0;


nit for consistency

Suggested change

int vectCount = 0;

int vecCount = 0;

tveasey · 2025-04-23T15:00:26Z

.../main21/java/org/elasticsearch/simdvec/internal/vectorization/PanamaESVectorUtilSupport.java

+            FloatVector d2Mean = vecMeanVec.sub(vecMean);
+            m2Vec = fma(d2Mean, d2Mean, m2Vec);
+            vectCount = count * FLOAT_SPECIES.length();
+            vecVar = m2Vec.reduceLanes(ADD);


My inclination is to add tail handling on reduced vector stats directly , it simplifies matters...

// Note i will be equal to vector.length if it is a multiple of FLOAT_SPECIES.length(). for (; i < vector.length; i++) { centered[i] = vector[i] - centroid[i]; float delta = centered[i] - tailMean; ++vecCount; vecMean += delta / vecCount; float delta2 = centered[i] - vecMean; vecVar = fma(delta, delta2, vecVar); min = Math.min(min, centered[i]); max = Math.max(max, centered[i]); norm2 = fma(centered[i], centered[i], norm2); }

and job done so no need for extra steps to combine.

tveasey · 2025-04-23T15:16:21Z

.../main21/java/org/elasticsearch/simdvec/internal/vectorization/PanamaESVectorUtilSupport.java

+        if (vector.length > 2 * FLOAT_SPECIES.length()) {
+            FloatVector vecMeanVec = FloatVector.zero(FLOAT_SPECIES);
+            FloatVector m2Vec = FloatVector.zero(FLOAT_SPECIES);
+            FloatVector norm2Vec = FloatVector.zero(FLOAT_SPECIES);
+            FloatVector minVec = FloatVector.broadcast(FLOAT_SPECIES, Float.MAX_VALUE);
+            FloatVector maxVec = FloatVector.broadcast(FLOAT_SPECIES, -Float.MAX_VALUE);
+            int count = 0;
+            for (; i < FLOAT_SPECIES.loopBound(vector.length); i += FLOAT_SPECIES.length()) {
+                ++count;
+                FloatVector v = FloatVector.fromArray(FLOAT_SPECIES, vector, i);
+                FloatVector c = FloatVector.fromArray(FLOAT_SPECIES, centroid, i);
+                FloatVector centeredVec = v.sub(c);
+                FloatVector deltaVec = centeredVec.sub(vecMeanVec);
+                norm2Vec = fma(centeredVec, centeredVec, norm2Vec);
+                vecMeanVec = vecMeanVec.add(deltaVec.div(count));
+                FloatVector delta2Vec = centeredVec.sub(vecMeanVec);
+                m2Vec = fma(deltaVec, delta2Vec, m2Vec);
+                minVec = minVec.min(centeredVec);
+                maxVec = maxVec.max(centeredVec);
+                centeredVec.intoArray(centered, i);
+            }
+            min = minVec.reduceLanes(MIN);
+            max = maxVec.reduceLanes(MAX);
+            norm2 = norm2Vec.reduceLanes(ADD);
+            vecMean = vecMeanVec.reduceLanes(ADD) / FLOAT_SPECIES.length();
+            FloatVector d2Mean = vecMeanVec.sub(vecMean);
+            m2Vec = fma(d2Mean, d2Mean, m2Vec);
+            vectCount = count * FLOAT_SPECIES.length();
+            vecVar = m2Vec.reduceLanes(ADD);
+        }
+
+        float tailMean = 0;
+        float tailM2 = 0;
+        int tailCount = 0;
+        // handle the tail
+        for (; i < vector.length; i++) {
+            centered[i] = vector[i] - centroid[i];
+            float delta = centered[i] - tailMean;
+            ++tailCount;
+            tailMean += delta / tailCount;
+            float delta2 = centered[i] - tailMean;
+            tailM2 = fma(delta, delta2, tailM2);
+            min = Math.min(min, centered[i]);
+            max = Math.max(max, centered[i]);
+            norm2 = fma(centered[i], centered[i], norm2);
+        }


Formulas look correct to me of course checking with vector lengths 10 - 100 vs slow calculation is within a few epsilon will prove there are no errors. Specifically, I would add testing of vectorised stats calculation against the super simple vestions, i.e. compute mean, then compute mean of square residuals, etc, so there is no chance of errors. If you're v.close to that a bunch of random length vectors you're good. (This may be what the Lucene reference does, but if it using online calculation my inclination would be to simplify further so there is no chance of errors.)

elasticsearchmachine · 2025-04-23T16:52:23Z

💔 Backport failed

Status	Branch	Result
❌	8.x	Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 127118

benwtrent · 2025-04-23T17:01:28Z

💚 All backports created successfully

Status	Branch	Result
✅	8.x

Questions ?

Please refer to the Backport tool documentation

… (#127269) * Panama vector accelerated optimized scalar quantization (#127118) * Adds accelerates optimized scalar quantization with vectorized functions * Adding benchmark * Update docs/changelog/127118.yaml * adjusting benchmark and delta (cherry picked from commit 059f91c) * fixing compilation * reverting unnecessary change

benwtrent added 2 commits April 21, 2025 11:19

Adds accelerates optimized scalar quantization with vectorized functions

b02cd72

Adding benchmark

d62211b

benwtrent added >enhancement auto-backport Automatically create backport pull requests when merged :Search Relevance/Vectors Vector search v8.19.0 v9.1.0 labels Apr 21, 2025

benwtrent requested a review from ChrisHegarty April 21, 2025 16:20

elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Apr 21, 2025

Update docs/changelog/127118.yaml

c5a93e0

benwtrent requested a review from john-wagster April 21, 2025 16:21

Merge branch 'main' into feature/panama-vector-accelerated-osq

093a502

benwtrent commented Apr 21, 2025

View reviewed changes

benwtrent added 2 commits April 21, 2025 13:18

adjusting benchmark and delta

6db146a

Merge branch 'feature/panama-vector-accelerated-osq' of github.com:be…

423cdec

…nwtrent/elasticsearch into feature/panama-vector-accelerated-osq

ChrisHegarty approved these changes Apr 22, 2025

View reviewed changes

john-wagster approved these changes Apr 23, 2025

View reviewed changes

Merge branch 'main' into feature/panama-vector-accelerated-osq

b4f43e8

tveasey approved these changes Apr 23, 2025

View reviewed changes

benwtrent added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Apr 23, 2025

benwtrent merged commit 059f91c into elastic:main Apr 23, 2025
17 checks passed

benwtrent deleted the feature/panama-vector-accelerated-osq branch April 23, 2025 16:51

elasticsearchmachine added the backport pending label Apr 23, 2025

benwtrent mentioned this pull request Apr 23, 2025

[8.x] Panama vector accelerated optimized scalar quantization (#127118) #127269

Merged

Panama vector accelerated optimized scalar quantization #127118

Panama vector accelerated optimized scalar quantization #127118

Uh oh!

Conversation

benwtrent commented Apr 21, 2025

Uh oh!

elasticsearchmachine commented Apr 21, 2025

Uh oh!

elasticsearchmachine commented Apr 21, 2025

Uh oh!

benwtrent Apr 21, 2025

Choose a reason for hiding this comment

Uh oh!

tveasey Apr 23, 2025

Choose a reason for hiding this comment

Uh oh!

ChrisHegarty left a comment

Choose a reason for hiding this comment

Uh oh!

john-wagster left a comment

Choose a reason for hiding this comment

Uh oh!

tveasey left a comment

Choose a reason for hiding this comment

Uh oh!

tveasey Apr 23, 2025

Choose a reason for hiding this comment

Uh oh!

tveasey Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tveasey Apr 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

elasticsearchmachine commented Apr 23, 2025

💔 Backport failed

Uh oh!

benwtrent commented Apr 23, 2025

💚 All backports created successfully

Questions ?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

tveasey Apr 23, 2025 •

edited

Loading