Skip to content

Conversation

@benwtrent
Copy link
Member

This does a first pass at adding nested query support for bbq_ivf indices.

The support is pretty simple right now, basically, we keep exploring until we at least get k results to cover the case when the nested docs are all tightly clustered and the typical nprobe explores too few clusters to actually get k docs.

I have some weird test failures I need to debug, so opening as draft for now.

@benwtrent benwtrent marked this pull request as ready for review June 4, 2025 15:20
@elasticsearchmachine elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Jun 4, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

* This collects the nearest children vectors. Diversifying the results over the provided parent
* filter. This means the nearest children vectors are returned, but only one per parent
*/
class DiversifyingNearestChildrenKnnCollector extends AbstractKnnCollector {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mostly copied from Lucene, its package private there, so we cannot use it wholesale. We may end up mutating it to support ivf more directly. But this is just the first step.

LeafReader reader = context.reader();
FloatVectorValues floatVectorValues = reader.getFloatVectorValues(field);
if (floatVectorValues == null) {
if (floatVectorValues == null || knnCollector == null) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a null collector is now possible if the parent bit set is invalid.

import org.apache.lucene.search.Query;
import org.apache.lucene.search.join.BitSetProducer;

public class DiversifyingChildrenIVFKnnFloatVectorQueryTests extends AbstractDiversifyingChildrenIVFKnnVectorQueryTestCase {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the abstract sub-class assuming we will have byte support in the future.

Comment on lines 287 to 289
// TODO do we need to handle nested doc counts similarly to how we handle
// filtering? E.g. keep exploring until we hit an expected number of parent documents vs. child vectors?
while (centroidQueue.size() > 0 && centroidsVisited < nProbe && knnCollectorImpl.numCollected() < knnCollector.k()) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did consider doing something similar to our filtering logic, by treating the number of visited vectors, vs. the number of visited parent docs, but I am not 100% sure its absolutely necessary.

If it is necessary, we will need to add some bit set logic to the collector to keep track of the visited parent docs as we cannot do a simple incremental count as we might visit the same parent document multiple times.

Copy link
Contributor

@iverase iverase Jun 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to keep going until we collected k documents and we visited at least nProbe centroids, shouldn't the condition be:

centroidQueue.size() > 0 && (centroidsVisited < nProbe || knnCollectorImpl.numCollected() < knnCollector.k()))

) throws IOException {
KnnCollector knnCollector = knnCollectorManager.newCollector(visitedLimit, searchStrategy, context);
LeafReader reader = context.reader();
FloatVectorValues floatVectorValues = reader.getFloatVectorValues(field);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the collector is null, we might not want to do this, it is not free.

@Before
public void setUp() throws Exception {
super.setUp();
format = new IVFVectorsFormat(128);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to randomize the number of vectors per cluster?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iverase I can do that

Copy link
Contributor

@john-wagster john-wagster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Contributor

@iverase iverase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@benwtrent benwtrent added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Jun 6, 2025
@elasticsearchmachine elasticsearchmachine merged commit b5d5229 into elastic:main Jun 9, 2025
18 checks passed
@benwtrent benwtrent deleted the nested-ivf-queries branch June 9, 2025 17:01
benchaplin pushed a commit to benchaplin/elasticsearch that referenced this pull request Jun 9, 2025
This does a first pass at adding nested query support for bbq_ivf
indices. 

The support is pretty simple right now, basically, we keep exploring
until we at least get `k` results to cover the case when the nested docs
are all tightly clustered and the typical `nprobe` explores too few
clusters to actually get `k` docs.

I have some weird test failures I need to debug, so opening as draft for
now.
valeriy42 pushed a commit to valeriy42/elasticsearch that referenced this pull request Jun 12, 2025
This does a first pass at adding nested query support for bbq_ivf
indices. 

The support is pretty simple right now, basically, we keep exploring
until we at least get `k` results to cover the case when the nested docs
are all tightly clustered and the typical `nprobe` explores too few
clusters to actually get `k` docs.

I have some weird test failures I need to debug, so opening as draft for
now.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) >non-issue :Search Relevance/Vectors Vector search Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch v9.1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants