Skip to content

Conversation

mayya-sharipova
Copy link
Contributor

@mayya-sharipova mayya-sharipova commented Oct 2, 2024

Current knn search over nested vectors only supports filtering on parent's metadata.
This adds support for filtering over nested metadata.

Closes #106994
Closes #128803

Current knn search over nested vectors only supports filtering
on parent's meatadata. This adds support for filtering over
nested metadata.

Closes elastic#106994
Copy link
Contributor

github-actions bot commented Oct 2, 2024

Documentation preview:

@elasticsearchmachine elasticsearchmachine added v9.0.0 Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch labels Oct 2, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

@elasticsearchmachine
Copy link
Collaborator

Hi @mayya-sharipova, I've created a changelog YAML for you.

@benwtrent benwtrent self-requested a review October 2, 2024 14:09
Comment on lines 503 to 552
// If filter matches non-nested docs, we assume this is a filter over parents docs,
// so we will modify it accordingly: matching parents docs with join to its child docs
if (nestedHelper.mightMatchNonNestedDocs(filterQuery, parentPath)) {
// Ensure that the query only returns parent documents matching `filterQuery`
filterQuery = Queries.filtered(filterQuery, parentFilter);
filterQuery = new ToChildBlockJoinQuery(filterQuery, parentBitSet);
}
// Now join the filterQuery & parentFilter to provide the matching blocks of children
filterQuery = new ToChildBlockJoinQuery(filterQuery, parentBitSet);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work when filtering over both a nested & top level metadata?

Something like:

{ "bool": { "filter" : [{"nested" : {"path": ..., {...}}}, {"term": {"top-level-thing": "foo"}}]}}

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we run a query like this where this a filter on both parent and nested metadata, it will NOT work, and return 0 results:

{
    "query": {
        "nested": {
            "path": "paragraphs",
            "query": {
                "knn": {
                    "field": "paragraphs.vector",
                    "query_vector": [
                        1,
                        1,
                        1
                    ],
                    "k": 2,
                    "filter": [
                        {
                            "match": {
                                "paragraphs.language": "FR"
                            }
                        },
                        {
                            "match": {
                                "title": "title2"
                            }
                        }
                    ]
                }
            },
            "inner_hits": {
                "size": 2
            }
        }
    }
}

I can not think of anyway to support both filters like this. I suggest we can document this as a limitation, and advice to use a filter for parent's metadata as a post-filter. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its weird to me that this doesn't work:

POST test/_search 
{
    "query": {
        "nested": {
            "path": "nested",
            "query": {
                "knn": {
                    "field": "nested.vector",
                    "query_vector": [
                        1,
                        1,
                        1,
                        1,
                        1
                    ],
                    "k": 2,
                    "filter": [
                        {
							"nested": {
								"path": "nested",
								"query": {
									"match": {
										"nested.language": "EN"
									}
								}
							}
                        }
                    ]
                }
            },
            "inner_hits": {
                "size": 2
            }
        }
    }
}

It seems to me that filter should have a "top level view" of the docs and allow you to have a separate query that gets executed that just works and joins to the appropriate knn level. I am not sure if that is possible given the current API. If its truly a hard restriction, we should indicate it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benwtrent Sorry, I got confused with your last comment.

Why is it weird for you that the query you provided above doesn't work? I think it is mal-formed query, we already provided "nested" context on top, so knn filter should not contain "nested". That's how other nested queries work: inside nested query, we query only on nested metadata, and to my mind , it was wrong to allow filter on parents' metadata.


It seems to me that filter should have a "top level view" of the docs and allow you to have a separate query that gets executed that just works and joins to the appropriate knn level.

Can you please clarify what it means.

Copy link
Member

@benwtrent benwtrent Oct 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mayya-sharipova I need to think a bit more. Its difficult to fully resolve the user experience between the top-level kNN object (which has NO signal that its nested) and a query which obviously has a nested context as its inside a nested query.

This & simplicity was why I thought (and mostly still do), think filtering on top-level metadata by default is most useful.

I am not even sure how script_score actually works in these various scenarios.

Originally, Jim proposed a new query type that would allow a nested context to join back to the parent docs and then provide a parent level filter (reverse_join_parent type thing). Maybe that is the only way we can do this.

But I do not think we should ship this without the ability to filter on both (parent & nested) at the same time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is that old POC of jim's: main...jimczi:elasticsearch:nested_dense_vector

I honestly am not sure how to square this well.

But, I do think one of the following is going to be necessary:

  • The filter for kNN's need to be treated like a top-level query, meaning it completely ignores its nested context and allows for a completely separate set filters. This is admittedly difficult (maybe impossible with how we do parent ID joining while search the graph).
  • We detect that somebody is doing nested search and try to rewrite it so that it works :/ (yeah, I am not sure this is actually possible, again)
  • We keep the current behavior, but add a reverse_nested query so that folks can do a complex combination of nested & higher level filters. I am unsure if this is possible while maintaining current BWC.

These are some ideas, I realize that they are various levels of impossible/difficult. We can talk synchronously to brain storm further if you want.

@telendt
Copy link
Contributor

telendt commented May 21, 2025

@mayya-sharipova
What's missing here?
I understand that @benwtrent has some concerns, and while I don't understand all of them, I would like to give some user context.

Yes, KNN query still feels weird in many cases. I understand that there are many reasons for that, and that there is a desire to find a better, more intuitive API. Yet, looking at this MR, it looks like it works, improves on some limitations of current building blocks, and potentially solves some real problems your users experience now.

I work on enterprise search, and we want to filter on the parent document level as well as fields of nested paragraphs, holding things like detected entities or sentiment. And from what I see (looking at the test), this change solves exactly this problem.

I understand that a more complex filtering logic, like mixing parent and nested child filters (e.g., wrapped in a Boolean query and placed inside the should clause) won't work. But that is fine.

I'm pretty sure you'll find a better way of expressing such queries one day. But to me, it looks like enough time was spent, and I don't think you should wait any longer here.

@tihom88
Copy link

tihom88 commented Jun 3, 2025

@mayya-sharipova, with this change, are we focussing on queries based on same level data?
We also have a case where we want to filter documents based on metadata in nested sibling document. Can we enhance this functionality to support it also. #128803
I think @benwtrent's ask was to make filters generic enough and align with filter queries supported for text.

Comment on lines +186 to +189
// filter can be both over parents or nested docs, so add them as should clauses to a filter
BoolQueryBuilder adjustedFilter = new BoolQueryBuilder().should(filter)
.should(new ToChildBlockJoinQueryBuilder(filter));
boolQuery.filter(adjustedFilter);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lucene doesn't break here?

I guess we just do the right thing? I know Lucene will complain if we attempt to query things that are or are not child docs vs. parent docs. etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, Lucene ToChildBlockJoinQuery doesn't allow a filter that can also match child docs. But in our ToChildBlockJoinQueryBuilder::doToQuery, I've added extra parentFilter to ensure that only parents docs can match:

// ensure that parentQuery only applies to parent docs by adding parentFilter
return new ToChildBlockJoinQuery(Queries.filtered(parentQuery, parentFilter), parentBitSet);



---
"Test filter on nested fields with filter on both nested and parent metadata is not allowed":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a test/scenario where we are doing a filter on another nested field?

so, assume we have

{
  "nested_1": {"properties": {"vector"....}},
  "nested_2": {"properties": {"product_name":....}}
}

Then the filter would be:

filter: [{ nested: {path: nested_2, query: {match: { nested_2.product_name: "FR" }}] or something (unsure if I got all the nesting right...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, I think our "auto-nested check" should only apply in the case of the filter matching the vectors nested context. Any other filter should be applied as a "top level" filter and return the relevant parent doc ids (I am unsure if this is exactly the case, but it seems to be right now).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

of course, replicate such a test in the knn query tests if possible.

Copy link
Contributor Author

@mayya-sharipova mayya-sharipova Jul 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for bringing this up.

In acbfccd I've added a test for nested sibling fields. Currently:

  • For nested knn query, we can NOT filter on nested sibling fields, as the query DSL itself doesn't allow it.
  • For top level knn search, we can filter on nested sibling fields, and effectively addressing the 128803 issue.

Are you not happy with this behaviour, are you saying we should NOT allow filter on sibling nested fields in nested knn search and throw an exception, which would mean we should close 128803 as not going to implement?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For nested knn query, we can NOT filter on nested sibling fields, as the query DSL itself doesn't allow it.

This is weird. A separate nested field should be able to join up to the parent level, indicate the parents that are filtered and we join back down to the child level to filter on the vectors. I don't know how to express that in the DSL, but we should document it as a limitation.

nested2.value: "domestic"
inner_hits: { size: 3, "fields": [ "nested.paragraph_id", "nested.language"], _source: false }

- match: { hits.total.value: 0 }
Copy link
Contributor Author

@mayya-sharipova mayya-sharipova Jul 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benwtrent Notice how for nested query the DSL itself doesn't allow to filter on nested sibling fields, as the top level query:nested:path sets the parent context.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mayya-sharipova thank you.

I was thinking it would be possible via boolean with two separate nested contexts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They key issue is that with kNN search, we need to be clear that its within a SINGLE nested context (either top level, or its own context).

That gets tricky, and maybe we can make it better in the future. But the current compromise unblocks some folks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can not think of any way to support pre-filtering on nested sibling fields in knn query. We can do the following, but it would be post-filter:

  - do:
      search:
        index: test
        body:
          _source: false
          query:
            bool:
              must:
                - nested:
                    path: nested
                    query:
                      knn:
                        field: nested.vector
                        query_vector: [-0.5, 90.0, -10, 14.8, -156.0]
                        k: 10
              filter:
                - nested:
                    path: nested2
                    query:
                      bool:
                        filter:
                          - match:
                              nested2.key: "category"
                          - match:
                              nested2.value: "domestic"

nested2.key: "category"
- match:
nested2.value: "domestic"
- match: { hits.total.value: 2}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here with the current implementation we can query nested sibling fields.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was sort of hoping that we could do something similar in the nested query since we look to the parent level. But it gets too complicated given the nested context?

@mayya-sharipova
Copy link
Contributor Author

mayya-sharipova commented Jul 14, 2025

@benwtrent Thanks for your latest feedback, but I am not sure what is the path forward:

  • Are we ok with current implementation where knn search supports filtering on sibling nested fields, and knn query doesn't have DSL support for it?

@benwtrent
Copy link
Member

Are we ok with current implementation where knn search supports filtering on sibling nested fields, and knn query doesn't have DSL support for it?

I think this is OK. In the query DSL, we declare the KNN query to be particularly within a given nested context.

For the top level knn object, we infer the nested context and are able to have more flexibility.

it makes me wonder WHY it works...it likely has to do with the nested context...

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the current limitations are understandable. We just need to be clear in documentation and testing whats actually doable.

Thank you for iterating on this.

nested2.key: "category"
- match:
nested2.value: "domestic"
- match: { hits.total.value: 2}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was sort of hoping that we could do something similar in the nested query since we look to the parent level. But it gets too complicated given the nested context?

@elasticsearchmachine
Copy link
Collaborator

Hi @mayya-sharipova, I've updated the changelog YAML for you.

@mayya-sharipova
Copy link
Contributor Author

mayya-sharipova commented Jul 29, 2025

@elasticsearchmachine run "elasticsearch-ci/8.17.10 / Part 2 / bwc-snapshots"

@mayya-sharipova
Copy link
Contributor Author

@elasticsearchmachine run elasticsearch-ci/bwc-snapshots-part2

@mayya-sharipova mayya-sharipova added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Jul 29, 2025
mayya-sharipova added a commit to mayya-sharipova/docs-content that referenced this pull request Jul 29, 2025
This is to update our documentation that we now can support
knn filter on nested metadata.

Related to elastic/elasticsearch#113949
@elasticsearchmachine elasticsearchmachine merged commit 0e63c90 into elastic:main Jul 30, 2025
33 checks passed
@mayya-sharipova mayya-sharipova deleted the knn_query_nested_filter branch July 30, 2025 12:46
mayya-sharipova added a commit to elastic/docs-content that referenced this pull request Jul 30, 2025
This is to update our documentation that we now can support knn filter
on nested metadata.

Related to elastic/elasticsearch#113949

---------

Co-authored-by: Liam Thompson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) >enhancement :Search Relevance/Vectors Vector search Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch v9.2.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

KNN filters on metadata of sibling nested documents Pre-filter nested fields in knn queries

6 participants