-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Add support for nested queries for ivf indices #128782
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for nested queries for ivf indices #128782
Conversation
|
Pinging @elastic/es-search-relevance (Team:Search Relevance) |
| * This collects the nearest children vectors. Diversifying the results over the provided parent | ||
| * filter. This means the nearest children vectors are returned, but only one per parent | ||
| */ | ||
| class DiversifyingNearestChildrenKnnCollector extends AbstractKnnCollector { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is mostly copied from Lucene, its package private there, so we cannot use it wholesale. We may end up mutating it to support ivf more directly. But this is just the first step.
| LeafReader reader = context.reader(); | ||
| FloatVectorValues floatVectorValues = reader.getFloatVectorValues(field); | ||
| if (floatVectorValues == null) { | ||
| if (floatVectorValues == null || knnCollector == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a null collector is now possible if the parent bit set is invalid.
| import org.apache.lucene.search.Query; | ||
| import org.apache.lucene.search.join.BitSetProducer; | ||
|
|
||
| public class DiversifyingChildrenIVFKnnFloatVectorQueryTests extends AbstractDiversifyingChildrenIVFKnnVectorQueryTestCase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the abstract sub-class assuming we will have byte support in the future.
| // TODO do we need to handle nested doc counts similarly to how we handle | ||
| // filtering? E.g. keep exploring until we hit an expected number of parent documents vs. child vectors? | ||
| while (centroidQueue.size() > 0 && centroidsVisited < nProbe && knnCollectorImpl.numCollected() < knnCollector.k()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did consider doing something similar to our filtering logic, by treating the number of visited vectors, vs. the number of visited parent docs, but I am not 100% sure its absolutely necessary.
If it is necessary, we will need to add some bit set logic to the collector to keep track of the visited parent docs as we cannot do a simple incremental count as we might visit the same parent document multiple times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to keep going until we collected k documents and we visited at least nProbe centroids, shouldn't the condition be:
centroidQueue.size() > 0 && (centroidsVisited < nProbe || knnCollectorImpl.numCollected() < knnCollector.k()))
| ) throws IOException { | ||
| KnnCollector knnCollector = knnCollectorManager.newCollector(visitedLimit, searchStrategy, context); | ||
| LeafReader reader = context.reader(); | ||
| FloatVectorValues floatVectorValues = reader.getFloatVectorValues(field); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the collector is null, we might not want to do this, it is not free.
| @Before | ||
| public void setUp() throws Exception { | ||
| super.setUp(); | ||
| format = new IVFVectorsFormat(128); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to randomize the number of vectors per cluster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iverase I can do that
john-wagster
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
iverase
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This does a first pass at adding nested query support for bbq_ivf indices. The support is pretty simple right now, basically, we keep exploring until we at least get `k` results to cover the case when the nested docs are all tightly clustered and the typical `nprobe` explores too few clusters to actually get `k` docs. I have some weird test failures I need to debug, so opening as draft for now.
This does a first pass at adding nested query support for bbq_ivf indices. The support is pretty simple right now, basically, we keep exploring until we at least get `k` results to cover the case when the nested docs are all tightly clustered and the typical `nprobe` explores too few clusters to actually get `k` docs. I have some weird test failures I need to debug, so opening as draft for now.
This does a first pass at adding nested query support for bbq_ivf indices.
The support is pretty simple right now, basically, we keep exploring until we at least get
kresults to cover the case when the nested docs are all tightly clustered and the typicalnprobeexplores too few clusters to actually getkdocs.I have some weird test failures I need to debug, so opening as draft for now.