Skip to content

Conversation

benwtrent
Copy link
Member

This adds maxSim functions, specifically dotProduct and InvHamming. Why these two you might ask? Well, they are the best approximations of whats possible with Col* late interaction type models. Effectively, you want a similarity metric where "greater == better". Regular hamming isn't exactly that, but inverting that (just like our element_type: bit index for dense_vectors), is a nice approximation with bit vectors and multi-vector scoring.

Then, of course, dotProduct is another usage. We will allow dot-product between like elements (bytes -> bytes, floats -> floats) and of course, allow floats -> bit, where the stored bit elements are applied as a "mask" over the float queries. This allows for some nice asymmetric interactions.

This is all behind a feature flag, and I need to write a mountain of docs in a separate PR.

@benwtrent benwtrent added >non-issue auto-backport Automatically create backport pull requests when merged :Search Relevance/Vectors Vector search v9.0.0 v8.17.0 labels Nov 18, 2024
@benwtrent benwtrent requested a review from a team as a code owner November 18, 2024 21:42
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

@elasticsearchmachine elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Nov 18, 2024
@benwtrent
Copy link
Member Author

@joshdevins you might be interested in this. Anything significant missing?

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's amazing @benwtrent !
I left a minor suggestion and a question, LGTM otherwise.


@Override
public Iterator<byte[]> copy() {
return new ByteVectorIterator(vectorValues, buffer, size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it ok to reuse the buffer instead of creating a new one for each copy?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jimczi my concern is when users have an iterator they are iterating. But WHILE iterating, they conduct a maxSimDotProduct.

@jdconrad could you comment here? Basically, I don't know if the same class instance for a field is usable in a script and thus my concern here is warranted.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, doing the following:

def field = field('vector').get();
def vvs = field.getVectors();
def vvs2 = field.getVectors();
def v1 = vvs.next(); 
def otherV1 = vvs2.next();
return v1[1];",

Will actually cause the second vector in vvs2 twice. Meaning, calling getVectors() without having a copy underneath are actually sharing iterators & buffer meaning v1s value is transformed with the vvs2.next()

Copying the buffer to keep users from shooting themselves in the foot with two iterators seems like a simple thing to do. This would be only for expert users regardless.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In short, yes @jimczi reusing the buffer is a bad idea. Fixed that and now copy will create a new buffer. This way all iterators have their own buffer, but for typical maxsim scoring, we can reuse the buffer there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay in commenting. I agree that we should not re-use the buffer given that while it's probably safe for expert users, I worry about future code changes that affect the re-usability.

Copy link
Member

@joshdevins joshdevins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super minor comments about error messages. I think people could be confused at times since we're dealing with 2D tenors. I continue to wish we had tensor types in Elasticsearch to avoid some of these kinds of issues from arising in the first place.

@benwtrent benwtrent added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Nov 20, 2024
Copy link
Contributor

@mayya-sharipova mayya-sharipova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Ben, great addition

@benwtrent
Copy link
Member Author

@elasticmachine update branch

@elasticsearchmachine elasticsearchmachine merged commit e68f317 into elastic:main Nov 20, 2024
16 checks passed
@benwtrent benwtrent deleted the feature/add-max-sim branch November 20, 2024 21:42
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
8.x

benwtrent added a commit to benwtrent/elasticsearch that referenced this pull request Nov 20, 2024
This adds `maxSim` functions, specifically dotProduct and InvHamming.
Why these two you might ask? Well, they are the best approximations of
whats possible with Col* late interaction type models. Effectively, you
want a similarity metric where "greater == better". Regular `hamming`
isn't exactly that, but inverting that (just like our `element_type:
bit` index for dense_vectors), is a nice approximation with bit vectors
and multi-vector scoring.

Then, of course, dotProduct is another usage. We will allow dot-product
between like elements (bytes -> bytes, floats -> floats) and of course,
allow `floats -> bit`, where the stored `bit` elements are applied as a
"mask" over the float queries. This allows for some nice asymmetric
interactions.

This is all behind a feature flag, and I need to write a mountain of
docs in a separate PR.
rjernst pushed a commit to rjernst/elasticsearch that referenced this pull request Nov 20, 2024
This adds `maxSim` functions, specifically dotProduct and InvHamming.
Why these two you might ask? Well, they are the best approximations of
whats possible with Col* late interaction type models. Effectively, you
want a similarity metric where "greater == better". Regular `hamming`
isn't exactly that, but inverting that (just like our `element_type:
bit` index for dense_vectors), is a nice approximation with bit vectors
and multi-vector scoring.

Then, of course, dotProduct is another usage. We will allow dot-product
between like elements (bytes -> bytes, floats -> floats) and of course,
allow `floats -> bit`, where the stored `bit` elements are applied as a
"mask" over the float queries. This allows for some nice asymmetric
interactions.

This is all behind a feature flag, and I need to write a mountain of
docs in a separate PR.
elasticsearchmachine pushed a commit that referenced this pull request Nov 20, 2024
)

This adds `maxSim` functions, specifically dotProduct and InvHamming.
Why these two you might ask? Well, they are the best approximations of
whats possible with Col* late interaction type models. Effectively, you
want a similarity metric where "greater == better". Regular `hamming`
isn't exactly that, but inverting that (just like our `element_type:
bit` index for dense_vectors), is a nice approximation with bit vectors
and multi-vector scoring.

Then, of course, dotProduct is another usage. We will allow dot-product
between like elements (bytes -> bytes, floats -> floats) and of course,
allow `floats -> bit`, where the stored `bit` elements are applied as a
"mask" over the float queries. This allows for some nice asymmetric
interactions.

This is all behind a feature flag, and I need to write a mountain of
docs in a separate PR.
@benwtrent benwtrent added v8.18.0 and removed v8.17.0 labels Nov 25, 2024
alexey-ivanov-es pushed a commit to alexey-ivanov-es/elasticsearch that referenced this pull request Nov 28, 2024
This adds `maxSim` functions, specifically dotProduct and InvHamming.
Why these two you might ask? Well, they are the best approximations of
whats possible with Col* late interaction type models. Effectively, you
want a similarity metric where "greater == better". Regular `hamming`
isn't exactly that, but inverting that (just like our `element_type:
bit` index for dense_vectors), is a nice approximation with bit vectors
and multi-vector scoring.

Then, of course, dotProduct is another usage. We will allow dot-product
between like elements (bytes -> bytes, floats -> floats) and of course,
allow `floats -> bit`, where the stored `bit` elements are applied as a
"mask" over the float queries. This allows for some nice asymmetric
interactions.

This is all behind a feature flag, and I need to write a mountain of
docs in a separate PR.
@webeng
Copy link

webeng commented Feb 19, 2025

Hi guys, good work on this. Do you know in what version this will be released? Keen to try it out.

@benwtrent
Copy link
Member Author

@webeng 8.18.0, I am not sure the exact date for that release.

Its in https://www.elastic.co/cloud/serverless right now.

Note, the mapping is now called rank_vectors. #118804

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Automatically create backport pull requests when merged auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) >non-issue :Search Relevance/Vectors Vector search Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch v8.18.0 v9.0.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants