Skip to content

Conversation

@benwtrent
Copy link
Member

When scalar quantizing for bbq, optimizing intervals can take time, especially at higher bit sizes.

While for bbq, the impact will be marginal, its still a frustrating bottleneck, especially at query time where the bit size is larger (e.g. 4 bits).

Here are some results from the new JMH benchmark, this is on my laptop. Panama vector is 3-4x faster. It can likely be made even faster, my panama vector skills aren't the absolute best.

OptimizedScalarQuantizerBenchmark.scalar       1     384  thrpt   15  131.443 ± 17.037  ops/ms
OptimizedScalarQuantizerBenchmark.scalar       1     702  thrpt   15   78.247 ± 13.923  ops/ms
OptimizedScalarQuantizerBenchmark.scalar       1    1024  thrpt   15   50.635 ±  9.605  ops/ms
OptimizedScalarQuantizerBenchmark.scalar       4     384  thrpt   15  143.211 ± 35.947  ops/ms
OptimizedScalarQuantizerBenchmark.scalar       4     702  thrpt   15   69.438 ± 10.760  ops/ms
OptimizedScalarQuantizerBenchmark.scalar       4    1024  thrpt   15   44.546 ±  5.081  ops/ms
OptimizedScalarQuantizerBenchmark.scalar       7     384  thrpt   15  146.597 ± 17.915  ops/ms
OptimizedScalarQuantizerBenchmark.scalar       7     702  thrpt   15   79.901 ±  7.855  ops/ms
OptimizedScalarQuantizerBenchmark.scalar       7    1024  thrpt   15   53.202 ±  8.419  ops/ms
OptimizedScalarQuantizerBenchmark.vector       1     384  thrpt   15  470.615 ± 62.610  ops/ms
OptimizedScalarQuantizerBenchmark.vector       1     702  thrpt   15  259.081 ± 27.011  ops/ms
OptimizedScalarQuantizerBenchmark.vector       1    1024  thrpt   15  180.375 ± 17.070  ops/ms
OptimizedScalarQuantizerBenchmark.vector       4     384  thrpt   15  443.701 ± 60.452  ops/ms
OptimizedScalarQuantizerBenchmark.vector       4     702  thrpt   15  234.729 ± 13.969  ops/ms
OptimizedScalarQuantizerBenchmark.vector       4    1024  thrpt   15  170.735 ± 22.461  ops/ms
OptimizedScalarQuantizerBenchmark.vector       7     384  thrpt   15  499.138 ± 49.826  ops/ms
OptimizedScalarQuantizerBenchmark.vector       7     702  thrpt   15  274.890 ± 28.438  ops/ms
OptimizedScalarQuantizerBenchmark.vector       7    1024  thrpt   15  171.937 ± 11.087  ops/ms

@benwtrent benwtrent added >enhancement auto-backport Automatically create backport pull requests when merged :Search Relevance/Vectors Vector search v8.19.0 v9.1.0 labels Apr 21, 2025
@benwtrent benwtrent requested a review from ChrisHegarty April 21, 2025 16:20
@elasticsearchmachine elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Apr 21, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

@elasticsearchmachine
Copy link
Collaborator

Hi @benwtrent, I've created a changelog YAML for you.

@benwtrent benwtrent requested a review from john-wagster April 21, 2025 16:21
Comment on lines +120 to +165
if (vector.length > 2 * FLOAT_SPECIES.length()) {
FloatVector vecMeanVec = FloatVector.zero(FLOAT_SPECIES);
FloatVector m2Vec = FloatVector.zero(FLOAT_SPECIES);
FloatVector norm2Vec = FloatVector.zero(FLOAT_SPECIES);
FloatVector minVec = FloatVector.broadcast(FLOAT_SPECIES, Float.MAX_VALUE);
FloatVector maxVec = FloatVector.broadcast(FLOAT_SPECIES, -Float.MAX_VALUE);
int count = 0;
for (; i < FLOAT_SPECIES.loopBound(vector.length); i += FLOAT_SPECIES.length()) {
++count;
FloatVector v = FloatVector.fromArray(FLOAT_SPECIES, vector, i);
FloatVector c = FloatVector.fromArray(FLOAT_SPECIES, centroid, i);
FloatVector centeredVec = v.sub(c);
FloatVector deltaVec = centeredVec.sub(vecMeanVec);
norm2Vec = fma(centeredVec, centeredVec, norm2Vec);
vecMeanVec = vecMeanVec.add(deltaVec.div(count));
FloatVector delta2Vec = centeredVec.sub(vecMeanVec);
m2Vec = fma(deltaVec, delta2Vec, m2Vec);
minVec = minVec.min(centeredVec);
maxVec = maxVec.max(centeredVec);
centeredVec.intoArray(centered, i);
}
min = minVec.reduceLanes(MIN);
max = maxVec.reduceLanes(MAX);
norm2 = norm2Vec.reduceLanes(ADD);
vecMean = vecMeanVec.reduceLanes(ADD) / FLOAT_SPECIES.length();
FloatVector d2Mean = vecMeanVec.sub(vecMean);
m2Vec = fma(d2Mean, d2Mean, m2Vec);
vectCount = count * FLOAT_SPECIES.length();
vecVar = m2Vec.reduceLanes(ADD);
}

float tailMean = 0;
float tailM2 = 0;
int tailCount = 0;
// handle the tail
for (; i < vector.length; i++) {
centered[i] = vector[i] - centroid[i];
float delta = centered[i] - tailMean;
++tailCount;
tailMean += delta / tailCount;
float delta2 = centered[i] - tailMean;
tailM2 = fma(delta, delta2, tailM2);
min = Math.min(min, centered[i]);
max = Math.max(max, centered[i]);
norm2 = fma(centered[i], centered[i], norm2);
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tveasey could you take a look here? I think I did the variance calculation here correctly. But I might have missed something.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Formulas look correct to me of course checking with vector lengths 10 - 100 vs slow calculation is within a few epsilon will prove there are no errors. Specifically, I would add testing of vectorised stats calculation against the super simple vestions, i.e. compute mean, then compute mean of square residuals, etc, so there is no chance of errors. If you're v.close to that a bunch of random length vectors you're good. (This may be what the Lucene reference does, but if it using online calculation my inclination would be to simplify further so there is no chance of errors.)

Copy link
Contributor

@ChrisHegarty ChrisHegarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me.

I will run the benchmark on my Linux machine, but that can be done post merge as a follow up.

Copy link
Contributor

@john-wagster john-wagster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

centerAndCalculateOSQStatsEuclidean and centerAndCalculateOSQStatsDp were a bit difficult to follow in places, but reading through each they make sense and I didn't see anything obviously wrong.

Copy link
Contributor

@tveasey tveasey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Statistics calculations look correct to me. I think you can make the tail handling a bit cleaner and a bit faster, but functionally looks good to me.

float min = Float.MAX_VALUE;
float max = -Float.MAX_VALUE;
int i = 0;
int vectCount = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit for consistency

Suggested change
int vectCount = 0;
int vecCount = 0;

FloatVector d2Mean = vecMeanVec.sub(vecMean);
m2Vec = fma(d2Mean, d2Mean, m2Vec);
vectCount = count * FLOAT_SPECIES.length();
vecVar = m2Vec.reduceLanes(ADD);
Copy link
Contributor

@tveasey tveasey Apr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My inclination is to add tail handling on reduced vector stats directly , it simplifies matters...

// Note i will be equal to vector.length if it is a multiple of FLOAT_SPECIES.length().
for (; i < vector.length; i++) {
  centered[i] = vector[i] - centroid[i];
  float delta = centered[i] - tailMean;
  ++vecCount;
  vecMean += delta / vecCount;
  float delta2 = centered[i] - vecMean;
  vecVar = fma(delta, delta2, vecVar);
  min = Math.min(min, centered[i]);
  max = Math.max(max, centered[i]);
  norm2 = fma(centered[i], centered[i], norm2);
}

and job done so no need for extra steps to combine.

Comment on lines +120 to +165
if (vector.length > 2 * FLOAT_SPECIES.length()) {
FloatVector vecMeanVec = FloatVector.zero(FLOAT_SPECIES);
FloatVector m2Vec = FloatVector.zero(FLOAT_SPECIES);
FloatVector norm2Vec = FloatVector.zero(FLOAT_SPECIES);
FloatVector minVec = FloatVector.broadcast(FLOAT_SPECIES, Float.MAX_VALUE);
FloatVector maxVec = FloatVector.broadcast(FLOAT_SPECIES, -Float.MAX_VALUE);
int count = 0;
for (; i < FLOAT_SPECIES.loopBound(vector.length); i += FLOAT_SPECIES.length()) {
++count;
FloatVector v = FloatVector.fromArray(FLOAT_SPECIES, vector, i);
FloatVector c = FloatVector.fromArray(FLOAT_SPECIES, centroid, i);
FloatVector centeredVec = v.sub(c);
FloatVector deltaVec = centeredVec.sub(vecMeanVec);
norm2Vec = fma(centeredVec, centeredVec, norm2Vec);
vecMeanVec = vecMeanVec.add(deltaVec.div(count));
FloatVector delta2Vec = centeredVec.sub(vecMeanVec);
m2Vec = fma(deltaVec, delta2Vec, m2Vec);
minVec = minVec.min(centeredVec);
maxVec = maxVec.max(centeredVec);
centeredVec.intoArray(centered, i);
}
min = minVec.reduceLanes(MIN);
max = maxVec.reduceLanes(MAX);
norm2 = norm2Vec.reduceLanes(ADD);
vecMean = vecMeanVec.reduceLanes(ADD) / FLOAT_SPECIES.length();
FloatVector d2Mean = vecMeanVec.sub(vecMean);
m2Vec = fma(d2Mean, d2Mean, m2Vec);
vectCount = count * FLOAT_SPECIES.length();
vecVar = m2Vec.reduceLanes(ADD);
}

float tailMean = 0;
float tailM2 = 0;
int tailCount = 0;
// handle the tail
for (; i < vector.length; i++) {
centered[i] = vector[i] - centroid[i];
float delta = centered[i] - tailMean;
++tailCount;
tailMean += delta / tailCount;
float delta2 = centered[i] - tailMean;
tailM2 = fma(delta, delta2, tailM2);
min = Math.min(min, centered[i]);
max = Math.max(max, centered[i]);
norm2 = fma(centered[i], centered[i], norm2);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Formulas look correct to me of course checking with vector lengths 10 - 100 vs slow calculation is within a few epsilon will prove there are no errors. Specifically, I would add testing of vectorised stats calculation against the super simple vestions, i.e. compute mean, then compute mean of square residuals, etc, so there is no chance of errors. If you're v.close to that a bunch of random length vectors you're good. (This may be what the Lucene reference does, but if it using online calculation my inclination would be to simplify further so there is no chance of errors.)

@benwtrent benwtrent added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Apr 23, 2025
@benwtrent benwtrent merged commit 059f91c into elastic:main Apr 23, 2025
17 checks passed
@benwtrent benwtrent deleted the feature/panama-vector-accelerated-osq branch April 23, 2025 16:51
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
8.x Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 127118

@benwtrent
Copy link
Member Author

💚 All backports created successfully

Status Branch Result
8.x

Questions ?

Please refer to the Backport tool documentation

elasticsearchmachine pushed a commit that referenced this pull request Apr 24, 2025
… (#127269)

* Panama vector accelerated optimized scalar quantization (#127118)

* Adds accelerates optimized scalar quantization with vectorized functions

* Adding benchmark

* Update docs/changelog/127118.yaml

* adjusting benchmark and delta

(cherry picked from commit 059f91c)

* fixing compilation

* reverting unnecessary change
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Automatically create backport pull requests when merged auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) backport pending >enhancement :Search Relevance/Vectors Vector search Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch v8.19.0 v9.1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants