-
Notifications
You must be signed in to change notification settings - Fork 25.4k
Remove soar duplicate checking #132617
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove soar duplicate checking #132617
Conversation
Pinging @elastic/es-search-relevance (Team:Search Relevance) |
server/src/main/java/org/elasticsearch/index/codec/vectors/IVFVectorsReader.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that the change removes the need for the visitedDocs
bitset but it worries me that some of the calls to common APIs now will not be correct. For example the following call:
if (scoredDocs > 0) {
knnCollector.incVisitedCount(scoredDocs);
}
It won't be correct because we are counting twice some documents? Is that a problem?
We may visit the same doc twice, but I think that is ok. We are using "visited" as a stand in for "number of vector ops". Which is correct and exposed via profiling. The top-hit count is still just being exposed as the What do you think? |
I saw in another PR that we might move from visiting x nProbes to have a visited ratio (which I think it is the right approach as clusters are not balanced). This change will have an effect on that ration. Anyway, I do like to remove the visitedDocs BitSet so I am good with this and trying to make the ration work considering that we might visit a document twice. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Through our various benchmarking runs, I have noticed we do a silly amount of work just handling duplicate vectors for overspill. When it comes to block scoring, it is likely much better to just score the duplicates, and deduplicate later. This indeed is the case, and the performance increases as the number of vector ops increases.
Multi-segment Cohere-wiki-768 8M
I ran every nprobe 5 times and picked the fastest.
CANDIDATE
BASELINE
Single segment Cohere-wiki-1024 1M
My thought being that maybe larger vectors will make block scoring more expensive, so picking individual vectors would be better. Same methodology as above
Candidate
Baseline