Skip to content

Conversation

@benwtrent
Copy link
Member

This adds maxSim functions, specifically dotProduct and InvHamming. Why these two you might ask? Well, they are the best approximations of whats possible with Col* late interaction type models. Effectively, you want a similarity metric where "greater == better". Regular hamming isn't exactly that, but inverting that (just like our element_type: bit index for dense_vectors), is a nice approximation with bit vectors and multi-vector scoring.

Then, of course, dotProduct is another usage. We will allow dot-product between like elements (bytes -> bytes, floats -> floats) and of course, allow floats -> bit, where the stored bit elements are applied as a "mask" over the float queries. This allows for some nice asymmetric interactions.

This is all behind a feature flag, and I need to write a mountain of docs in a separate PR.

@benwtrent benwtrent added >non-issue auto-backport Automatically create backport pull requests when merged :Search Relevance/Vectors Vector search v9.0.0 v8.17.0 labels Nov 18, 2024
@benwtrent benwtrent requested a review from a team as a code owner November 18, 2024 21:42
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

@elasticsearchmachine elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Nov 18, 2024
@benwtrent
Copy link
Member Author

@joshdevins you might be interested in this. Anything significant missing?

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's amazing @benwtrent !
I left a minor suggestion and a question, LGTM otherwise.


@Override
public Iterator<byte[]> copy() {
return new ByteVectorIterator(vectorValues, buffer, size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it ok to reuse the buffer instead of creating a new one for each copy?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jimczi my concern is when users have an iterator they are iterating. But WHILE iterating, they conduct a maxSimDotProduct.

@jdconrad could you comment here? Basically, I don't know if the same class instance for a field is usable in a script and thus my concern here is warranted.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, doing the following:

def field = field('vector').get();
def vvs = field.getVectors();
def vvs2 = field.getVectors();
def v1 = vvs.next(); 
def otherV1 = vvs2.next();
return v1[1];",

Will actually cause the second vector in vvs2 twice. Meaning, calling getVectors() without having a copy underneath are actually sharing iterators & buffer meaning v1s value is transformed with the vvs2.next()

Copying the buffer to keep users from shooting themselves in the foot with two iterators seems like a simple thing to do. This would be only for expert users regardless.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In short, yes @jimczi reusing the buffer is a bad idea. Fixed that and now copy will create a new buffer. This way all iterators have their own buffer, but for typical maxsim scoring, we can reuse the buffer there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay in commenting. I agree that we should not re-use the buffer given that while it's probably safe for expert users, I worry about future code changes that affect the re-usability.

Copy link
Member

@joshdevins joshdevins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super minor comments about error messages. I think people could be confused at times since we're dealing with 2D tenors. I continue to wish we had tensor types in Elasticsearch to avoid some of these kinds of issues from arising in the first place.

@benwtrent benwtrent added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Nov 20, 2024
Copy link
Contributor

@mayya-sharipova mayya-sharipova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Ben, great addition

@benwtrent
Copy link
Member Author

@elasticmachine update branch

@elasticsearchmachine elasticsearchmachine merged commit e68f317 into elastic:main Nov 20, 2024
16 checks passed
@benwtrent benwtrent deleted the feature/add-max-sim branch November 20, 2024 21:42
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
8.x

benwtrent added a commit to benwtrent/elasticsearch that referenced this pull request Nov 20, 2024
This adds `maxSim` functions, specifically dotProduct and InvHamming.
Why these two you might ask? Well, they are the best approximations of
whats possible with Col* late interaction type models. Effectively, you
want a similarity metric where "greater == better". Regular `hamming`
isn't exactly that, but inverting that (just like our `element_type:
bit` index for dense_vectors), is a nice approximation with bit vectors
and multi-vector scoring.

Then, of course, dotProduct is another usage. We will allow dot-product
between like elements (bytes -> bytes, floats -> floats) and of course,
allow `floats -> bit`, where the stored `bit` elements are applied as a
"mask" over the float queries. This allows for some nice asymmetric
interactions.

This is all behind a feature flag, and I need to write a mountain of
docs in a separate PR.
rjernst pushed a commit to rjernst/elasticsearch that referenced this pull request Nov 20, 2024
This adds `maxSim` functions, specifically dotProduct and InvHamming.
Why these two you might ask? Well, they are the best approximations of
whats possible with Col* late interaction type models. Effectively, you
want a similarity metric where "greater == better". Regular `hamming`
isn't exactly that, but inverting that (just like our `element_type:
bit` index for dense_vectors), is a nice approximation with bit vectors
and multi-vector scoring.

Then, of course, dotProduct is another usage. We will allow dot-product
between like elements (bytes -> bytes, floats -> floats) and of course,
allow `floats -> bit`, where the stored `bit` elements are applied as a
"mask" over the float queries. This allows for some nice asymmetric
interactions.

This is all behind a feature flag, and I need to write a mountain of
docs in a separate PR.
elasticsearchmachine pushed a commit that referenced this pull request Nov 20, 2024
)

This adds `maxSim` functions, specifically dotProduct and InvHamming.
Why these two you might ask? Well, they are the best approximations of
whats possible with Col* late interaction type models. Effectively, you
want a similarity metric where "greater == better". Regular `hamming`
isn't exactly that, but inverting that (just like our `element_type:
bit` index for dense_vectors), is a nice approximation with bit vectors
and multi-vector scoring.

Then, of course, dotProduct is another usage. We will allow dot-product
between like elements (bytes -> bytes, floats -> floats) and of course,
allow `floats -> bit`, where the stored `bit` elements are applied as a
"mask" over the float queries. This allows for some nice asymmetric
interactions.

This is all behind a feature flag, and I need to write a mountain of
docs in a separate PR.
@benwtrent benwtrent added v8.18.0 and removed v8.17.0 labels Nov 25, 2024
alexey-ivanov-es pushed a commit to alexey-ivanov-es/elasticsearch that referenced this pull request Nov 28, 2024
This adds `maxSim` functions, specifically dotProduct and InvHamming.
Why these two you might ask? Well, they are the best approximations of
whats possible with Col* late interaction type models. Effectively, you
want a similarity metric where "greater == better". Regular `hamming`
isn't exactly that, but inverting that (just like our `element_type:
bit` index for dense_vectors), is a nice approximation with bit vectors
and multi-vector scoring.

Then, of course, dotProduct is another usage. We will allow dot-product
between like elements (bytes -> bytes, floats -> floats) and of course,
allow `floats -> bit`, where the stored `bit` elements are applied as a
"mask" over the float queries. This allows for some nice asymmetric
interactions.

This is all behind a feature flag, and I need to write a mountain of
docs in a separate PR.
@webeng
Copy link

webeng commented Feb 19, 2025

Hi guys, good work on this. Do you know in what version this will be released? Keen to try it out.

@benwtrent
Copy link
Member Author

@webeng 8.18.0, I am not sure the exact date for that release.

Its in https://www.elastic.co/cloud/serverless right now.

Note, the mapping is now called rank_vectors. #118804

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Automatically create backport pull requests when merged auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) >non-issue :Search Relevance/Vectors Vector search Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch v8.18.0 v9.0.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants