-
Notifications
You must be signed in to change notification settings - Fork 25.7k
[DiskBBQ] Replace n_probe, related to the number of centroids with visit_percentage, related to the number of documents #132722
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…sit_percentage, related to the number of documents
|
Pinging @elastic/es-search-relevance (Team:Search Relevance) |
|
I will benchmark, but this seems like the right way to me. |
|
It feels good to me too. It is not perfect but it feels it can work this way. |
| if (Math.min(knnCollector.k(), floatVectorValues.size()) == 0) { | ||
| if (floatVectorValues.size() == 0) { | ||
| return NO_RESULTS; | ||
| } | ||
| KnnSearchStrategy strategy = searchStrategy; | ||
| if (searchStrategy.getVisitRatio() == 0.0f) { | ||
| // dynamically set the percentage | ||
| float expected = (float) Math.round(1.75f * Math.log10(numCands) * Math.log10(numCands) * (numCands)); | ||
| float ratio = expected / floatVectorValues.size(); | ||
| strategy = new IVFKnnSearchStrategy(ratio); | ||
| } | ||
| KnnCollector knnCollector = knnCollectorManager.newCollector(visitedLimit, strategy, context); | ||
| if (knnCollector == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iverase you rightly called out that maybe we shouldn't do this.
What we SHOULD do is calculate this percentage across the sum of all segments (potentially having to adjust the auto calculation) and then push that down to each leaf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done it in ab36139. We might wait for the work on segment affinity for adjusting the ratio per leaf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have that done in #132396 already and had a similar approach to you here I think it looks good. I'll merge over your changes from today shortly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@john-wagster I looked into what you did and copied over here 😉
john-wagster
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
server/src/main/java/org/elasticsearch/search/vectors/AbstractIVFKnnVectorQuery.java
Show resolved
Hide resolved
| // dynamically set the percentage | ||
| float expected = (float) Math.round(1.75f * Math.log10(numCands) * Math.log10(numCands) * (numCands)); | ||
| visitRatio = expected / totalVectors; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to benchmark this to make sure its still sane.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My feel is that expected should depnd on the number of documents, still only depends on the numCands.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iverase I think so to. Otherwise, we end up with a static scale in the numerator, no matter the number of vectors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I move expected to depend on the number of vectors again:
float expected = (float) Math.round(Math.log10(totalVectors) * Math.log10(totalVectors) * (numCands));
benwtrent
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested over multiple data sets (single segment and multi). We end up with pretty good recall curves for different num_candidates and k.
Its obviously not the same as hnsw (just completely different scaling laws), but its pretty close. I think this is good.
…sit_percentage, related to the number of documents (elastic#132722) This commit changes the way we budget how many vectors we are going to score during search.
This PR proposes changing the way we budget how many vectors we are going to score during search.
We are currently using the number of centroids to search which can be dined by an static value, aka n_probe or computed dynamically if the value is -1. In this PR we change it so we can define a percentage of vectors we want to visit, which can be defined static or computed dynamically if the value is 0.
The motivation to do this change is that centroids are not balanced so some centroids can contain much more vectors than others so the search budget is not equal between runs. Using a percentage of vector operations feels much easier to understand.
The only complexity here is that we might be visiting documents more than once due to spilled assignmwnts so we need to state clear that this si not a percentage of the documents but a percentage of the vector operations (2 * numVectors).
Note that the configuration file for our utility
checkVecneeds to change so the "n_probe" entry needs to be replaced withvisit_percentage", like: