Adds `maxSim` functions for multi_dense_vector fields #116993

benwtrent · 2024-11-18T21:42:04Z

This adds maxSim functions, specifically dotProduct and InvHamming. Why these two you might ask? Well, they are the best approximations of whats possible with Col* late interaction type models. Effectively, you want a similarity metric where "greater == better". Regular hamming isn't exactly that, but inverting that (just like our element_type: bit index for dense_vectors), is a nice approximation with bit vectors and multi-vector scoring.

Then, of course, dotProduct is another usage. We will allow dot-product between like elements (bytes -> bytes, floats -> floats) and of course, allow floats -> bit, where the stored bit elements are applied as a "mask" over the float queries. This allows for some nice asymmetric interactions.

This is all behind a feature flag, and I need to write a mountain of docs in a separate PR.

iter iter

elasticsearchmachine · 2024-11-18T21:42:27Z

Pinging @elastic/es-search-relevance (Team:Search Relevance)

benwtrent · 2024-11-18T21:42:39Z

@joshdevins you might be interested in this. Anything significant missing?

server/src/main/java/org/elasticsearch/script/field/vectors/BitMultiDenseVector.java

jimczi

That's amazing @benwtrent !
I left a minor suggestion and a question, LGTM otherwise.

server/src/main/java/org/elasticsearch/script/field/vectors/BitMultiDenseVector.java

jimczi · 2024-11-20T08:25:31Z

...src/main/java/org/elasticsearch/script/field/vectors/ByteMultiDenseVectorDocValuesField.java

+
+        @Override
+        public Iterator<byte[]> copy() {
+            return new ByteVectorIterator(vectorValues, buffer, size);


Is it ok to reuse the buffer instead of creating a new one for each copy?

@jimczi my concern is when users have an iterator they are iterating. But WHILE iterating, they conduct a maxSimDotProduct.

@jdconrad could you comment here? Basically, I don't know if the same class instance for a field is usable in a script and thus my concern here is warranted.

OK, doing the following:

def field = field('vector').get(); def vvs = field.getVectors(); def vvs2 = field.getVectors(); def v1 = vvs.next(); def otherV1 = vvs2.next(); return v1[1];",

Will actually cause the second vector in vvs2 twice. Meaning, calling getVectors() without having a copy underneath are actually sharing iterators & buffer meaning v1s value is transformed with the vvs2.next()

Copying the buffer to keep users from shooting themselves in the foot with two iterators seems like a simple thing to do. This would be only for expert users regardless.

In short, yes @jimczi reusing the buffer is a bad idea. Fixed that and now copy will create a new buffer. This way all iterators have their own buffer, but for typical maxsim scoring, we can reuse the buffer there.

Sorry for the delay in commenting. I agree that we should not re-use the buffer given that while it's probably safe for expert users, I worry about future code changes that affect the re-usability.

joshdevins

Super minor comments about error messages. I think people could be confused at times since we're dealing with 2D tenors. I continue to wish we had tensor types in Elasticsearch to avoid some of these kinds of issues from arising in the first place.

server/src/main/java/org/elasticsearch/script/MultiVectorScoreScriptUtils.java

...ng-painless/src/main/resources/org/elasticsearch/painless/org.elasticsearch.script.score.txt

...ss/src/yamlRestTest/resources/rest-api-spec/test/painless/141_multi_dense_vector_max_sim.yml

mayya-sharipova

Thanks Ben, great addition

benwtrent · 2024-11-20T20:39:38Z

@elasticmachine update branch

elasticsearchmachine · 2024-11-20T21:43:19Z

💚 Backport successful

Status	Branch	Result
✅	8.x

This adds `maxSim` functions, specifically dotProduct and InvHamming. Why these two you might ask? Well, they are the best approximations of whats possible with Col* late interaction type models. Effectively, you want a similarity metric where "greater == better". Regular `hamming` isn't exactly that, but inverting that (just like our `element_type: bit` index for dense_vectors), is a nice approximation with bit vectors and multi-vector scoring. Then, of course, dotProduct is another usage. We will allow dot-product between like elements (bytes -> bytes, floats -> floats) and of course, allow `floats -> bit`, where the stored `bit` elements are applied as a "mask" over the float queries. This allows for some nice asymmetric interactions. This is all behind a feature flag, and I need to write a mountain of docs in a separate PR.

) This adds `maxSim` functions, specifically dotProduct and InvHamming. Why these two you might ask? Well, they are the best approximations of whats possible with Col* late interaction type models. Effectively, you want a similarity metric where "greater == better". Regular `hamming` isn't exactly that, but inverting that (just like our `element_type: bit` index for dense_vectors), is a nice approximation with bit vectors and multi-vector scoring. Then, of course, dotProduct is another usage. We will allow dot-product between like elements (bytes -> bytes, floats -> floats) and of course, allow `floats -> bit`, where the stored `bit` elements are applied as a "mask" over the float queries. This allows for some nice asymmetric interactions. This is all behind a feature flag, and I need to write a mountain of docs in a separate PR.

This adds `maxSim` functions, specifically dotProduct and InvHamming. Why these two you might ask? Well, they are the best approximations of whats possible with Col* late interaction type models. Effectively, you want a similarity metric where "greater == better". Regular `hamming` isn't exactly that, but inverting that (just like our `element_type: bit` index for dense_vectors), is a nice approximation with bit vectors and multi-vector scoring. Then, of course, dotProduct is another usage. We will allow dot-product between like elements (bytes -> bytes, floats -> floats) and of course, allow `floats -> bit`, where the stored `bit` elements are applied as a "mask" over the float queries. This allows for some nice asymmetric interactions. This is all behind a feature flag, and I need to write a mountain of docs in a separate PR.

webeng · 2025-02-19T23:00:51Z

Hi guys, good work on this. Do you know in what version this will be released? Keen to try it out.

benwtrent · 2025-02-20T13:53:06Z

@webeng 8.18.0, I am not sure the exact date for that release.

Its in https://www.elastic.co/cloud/serverless right now.

Note, the mapping is now called rank_vectors. #118804

benwtrent added 4 commits November 15, 2024 09:24

Add new maxSim functions for multi-vector fields

cd79489

iter iter

iter

ba0fd5f

iter

75458e0

Adding tests

3d2be40

benwtrent added >non-issue auto-backport Automatically create backport pull requests when merged :Search Relevance/Vectors Vector search v9.0.0 v8.17.0 labels Nov 18, 2024

benwtrent requested a review from a team as a code owner November 18, 2024 21:42

elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Nov 18, 2024

Merge remote-tracking branch 'upstream/main' into feature/add-max-sim

2c7bca9

jimczi reviewed Nov 19, 2024

View reviewed changes

server/src/main/java/org/elasticsearch/script/field/vectors/BitMultiDenseVector.java Show resolved Hide resolved

fixing maxSim and tests

f5c43fe

jimczi approved these changes Nov 20, 2024

View reviewed changes

joshdevins mentioned this pull request Nov 20, 2024

Allow representing a document with multiple embeddings (dense vectors) #72068

Open

joshdevins reviewed Nov 20, 2024

View reviewed changes

tteofili approved these changes Nov 20, 2024

View reviewed changes

benwtrent added 2 commits November 20, 2024 11:29

addressig pr comments

5e01350

Merge remote-tracking branch 'upstream/main' into feature/add-max-sim

87cba85

benwtrent added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Nov 20, 2024

mayya-sharipova reviewed Nov 20, 2024

View reviewed changes

...ng-painless/src/main/resources/org/elasticsearch/painless/org.elasticsearch.script.score.txt Show resolved Hide resolved

mayya-sharipova reviewed Nov 20, 2024

View reviewed changes

...ss/src/yamlRestTest/resources/rest-api-spec/test/painless/141_multi_dense_vector_max_sim.yml Show resolved Hide resolved

mayya-sharipova approved these changes Nov 20, 2024

View reviewed changes

Merge branch 'main' into feature/add-max-sim

97b42ee

elasticsearchmachine merged commit e68f317 into elastic:main Nov 20, 2024
16 checks passed

benwtrent deleted the feature/add-max-sim branch November 20, 2024 21:42

benwtrent mentioned this pull request Nov 20, 2024

[8.x] Adds maxSim functions for multi_dense_vector fields (#116993) #117203

Merged

benwtrent added v8.18.0 and removed v8.17.0 labels Nov 25, 2024

Adds maxSim functions for multi_dense_vector fields #116993

Adds maxSim functions for multi_dense_vector fields #116993

Uh oh!

Conversation

benwtrent commented Nov 18, 2024

Uh oh!

elasticsearchmachine commented Nov 18, 2024

Uh oh!

benwtrent commented Nov 18, 2024

Uh oh!

Uh oh!

jimczi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jimczi Nov 20, 2024

Choose a reason for hiding this comment

Uh oh!

benwtrent Nov 20, 2024

Choose a reason for hiding this comment

Uh oh!

benwtrent Nov 20, 2024

Choose a reason for hiding this comment

Uh oh!

benwtrent Nov 20, 2024

Choose a reason for hiding this comment

Uh oh!

jdconrad Nov 22, 2024

Choose a reason for hiding this comment

Uh oh!

joshdevins left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mayya-sharipova left a comment

Choose a reason for hiding this comment

Uh oh!

benwtrent commented Nov 20, 2024

Uh oh!

Uh oh!

elasticsearchmachine commented Nov 20, 2024

💚 Backport successful

Uh oh!

webeng commented Feb 19, 2025

Uh oh!

benwtrent commented Feb 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Adds `maxSim` functions for multi_dense_vector fields #116993

Adds `maxSim` functions for multi_dense_vector fields #116993

joshdevins left a comment •

edited

Loading