-
Notifications
You must be signed in to change notification settings - Fork 25.6k
[POC] Multi Match Implementation to Support semantic_text #132516
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[POC] Multi Match Implementation to Support semantic_text #132516
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass POC review - I didn't focus on an in-depth code review here, but high level functionality. I think this is a good start, but I'd like to see us do some refactoring here and not rely on copy-pasta as much. Thanks for the work here @Samiul-TheSoccerFan !
...ava/org/elasticsearch/xpack/inference/queries/SemanticMultiMatchQueryRewriteInterceptor.java
Outdated
Show resolved
Hide resolved
...ava/org/elasticsearch/xpack/inference/queries/SemanticMultiMatchQueryRewriteInterceptor.java
Outdated
Show resolved
Hide resolved
...ava/org/elasticsearch/xpack/inference/queries/SemanticMultiMatchQueryRewriteInterceptor.java
Outdated
Show resolved
Hide resolved
...ava/org/elasticsearch/xpack/inference/queries/SemanticMultiMatchQueryRewriteInterceptor.java
Outdated
Show resolved
Hide resolved
...ava/org/elasticsearch/xpack/inference/queries/SemanticMultiMatchQueryRewriteInterceptor.java
Outdated
Show resolved
Hide resolved
...ava/org/elasticsearch/xpack/inference/queries/SemanticMultiMatchQueryRewriteInterceptor.java
Outdated
Show resolved
Hide resolved
...ava/org/elasticsearch/xpack/inference/queries/SemanticMultiMatchQueryRewriteInterceptor.java
Outdated
Show resolved
Hide resolved
...ava/org/elasticsearch/xpack/inference/queries/SemanticMultiMatchQueryRewriteInterceptor.java
Outdated
Show resolved
Hide resolved
...ava/org/elasticsearch/xpack/inference/queries/SemanticMultiMatchQueryRewriteInterceptor.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kderusso's suggestion to implement SemanticQueryRewriteInterceptor is a good one. We can get interceptAndRewrite to work with both single and multi field queries, it just takes some rework that I missed initially.
The key is to make InferenceIndexInformationForField a record for every field in the query, not just a single field. Kind of like what you do with MultiFieldInferenceInfo. We can then pass that to buildCombinedInferenceAndNonInferenceQuery and buildInferenceQuery to build the query we want.
As far as how to build the query, the concept is to have two query groups: inference fields and non-inference fields. Inference fields should be queried using semantic queries. Non-inference fields should be queried using a multi_match query. Then stitch the two query groups together with a dismax query configured appropriately for the multi-match query type.
See MultiFieldsInnerRetrieverUtils as an example of how to create these query groups. In that case, we are combining them using retrievers, so that part will need to be adjusted, but the grouping concept is the same.
...ava/org/elasticsearch/xpack/inference/queries/SemanticMultiMatchQueryRewriteInterceptor.java
Outdated
Show resolved
Hide resolved
...src/main/java/org/elasticsearch/xpack/inference/queries/SemanticQueryRewriteInterceptor.java
Outdated
Show resolved
Hide resolved
|
@elasticmachine update branch |
|
merge conflict between base and head |
|
@elasticmachine merge upstream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a look at the followup changes, and I have a few questions. Not sure if they're things you've already talked about with others.
| public static final TransportVersion INITIAL_ELASTICSEARCH_8_19_1 = def(8_841_0_65); | ||
| public static final TransportVersion INITIAL_ELASTICSEARCH_8_19_2 = def(8_841_0_66); | ||
| public static final TransportVersion INITIAL_ELASTICSEARCH_8_19_3 = def(8_841_0_67); | ||
| public static final TransportVersion MULTI_MATCH_SEMANTIC_TEXT_SUPPORT_8_19 = def(8_841_0_68); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want to backport this to 8.19? We only want this to be in 9.2 right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AIU, it would be only 9.2 but I have not discussed that yet. I will remove this in the original PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't backport new features, only bug fixes. 8.19 was a special case because we were developing two sibling releases (8.19 & 9.1) at the same time.
| public static final int DEFAULT_MAX_EXPANSIONS = FuzzyQuery.defaultMaxExpansions; | ||
| public static final ZeroTermsQueryOption DEFAULT_ZERO_TERMS_QUERY = MatchQueryParser.DEFAULT_ZERO_TERMS_QUERY; | ||
| public static final boolean DEFAULT_FUZZY_TRANSPOSITIONS = FuzzyQuery.defaultTranspositions; | ||
| private boolean resolveInferenceFieldWildcards = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did we add this to the multi match builder, rather than keeping all logic in the inference plugin? That seems odd to me and a lot of extra stuff that we have to pass around? I may be missing a good reason to do this...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The resolveInferenceFieldWildcards parameter was added to MultiMatchQueryBuilder because both LinearRetrieverBuilder and RRFRetrieverBuilder directly create MultiMatchQueryBuilder instances and need to prevent double-querying of semantic_text fields when using their simplified syntax. The changes in MultiFieldsInnerRetrieverUtils.java, LinearRetrieverBuilderTests, and RRFRetrieverBuilderTests provide additional context for this approach. Please let me know if this does not answer your question and need more clarification. Also, I am open to suggestions if there is a better way to handle this check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining your thought process around adding this.
If you look at the core details of what you added, though, you're only adding a flag to the MultiMatchQueryBuilder and not actually doing anything with that flag in this class itself. To me, that's a really big signal that this flag should not exist in this class. Furthermore, the linear and RRF retrievers are already performing wildcard resolution on inference fields in MultiFieldsInnerRetrieverUtils#generateInnerRetrieversForIndex. Please remove this and try to come up with a solution that is contained to the interceptors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed we need to adjust this. We need to separate the concepts of handling inference fields and resolving wildcards. AFAIU handling inference fields needs to be a multi-match query param so that the user can enable it (it should be false by default). In contrast, resolving wildcards is an immutable property of the query that is not configurable by the user.
| * @param queryBuilder {@link QueryBuilder} | ||
| * @return true if inference field wildcards should be resolved, false otherwise. | ||
| */ | ||
| protected abstract boolean isResolveInferenceFieldWildcardsRequired(QueryBuilder queryBuilder); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without knowing more about why you added this change, this is setting off my spidey sense. Why does this have to be propagated to all interceptors like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
multi_match queries control the resolveInferenceFieldWildcards parameter, while other query types like match or knn_vector don't support wildcards so they always return false. This follows the same pattern as other methods where each interceptor knows how to handle its own query type and field/s.
...src/main/java/org/elasticsearch/xpack/inference/queries/SemanticQueryRewriteInterceptor.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left some clarifying comments on this PR per our discussion. Please let me know if you have any questions!
| public static final int DEFAULT_MAX_EXPANSIONS = FuzzyQuery.defaultMaxExpansions; | ||
| public static final ZeroTermsQueryOption DEFAULT_ZERO_TERMS_QUERY = MatchQueryParser.DEFAULT_ZERO_TERMS_QUERY; | ||
| public static final boolean DEFAULT_FUZZY_TRANSPOSITIONS = FuzzyQuery.defaultTranspositions; | ||
| private boolean resolveInferenceFieldWildcards = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining your thought process around adding this.
If you look at the core details of what you added, though, you're only adding a flag to the MultiMatchQueryBuilder and not actually doing anything with that flag in this class itself. To me, that's a really big signal that this flag should not exist in this class. Furthermore, the linear and RRF retrievers are already performing wildcard resolution on inference fields in MultiFieldsInnerRetrieverUtils#generateInnerRetrieversForIndex. Please remove this and try to come up with a solution that is contained to the interceptors.
| } | ||
|
|
||
| @Override | ||
| protected String getFieldName(QueryBuilder queryBuilder) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is part of the refactoring that I think you could do in a first PR.
| * @param queryBuilder {@link QueryBuilder} | ||
| * @return true if inference field wildcards should be resolved, false otherwise. | ||
| */ | ||
| protected abstract boolean shouldResolveInferenceFieldWildcards(QueryBuilder queryBuilder); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't belong as an abstract method.
The first thing that jumps out to me here is that you're defining an abstract method that all interceptors have to implement - however it's going to always be false except in the case of the Multi-Match interceptor. So let's try to figure out a way to do what you need to do without adding this method to every class.
| fieldsToProcess = QueryParserHelper.parseFieldsAndWeights(defaultFields); | ||
| } | ||
|
|
||
| // Resolve wildcards for inference fields and multiply boosts when field matches multiple patterns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing that jumps out to me here, is that we're resolving wildcards here, but this code path will only ever be executed by the Multi-Match interceptor.
Instead of adding knobs for that here in the parent class, why not break this logic out here into its own specific method, and move the wildcard specific logic to an override method only executed by the Multi-Match interceptor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed we can clean this up. Not sure what the solution ultimately looks like, but it needs to account for all of the following:
- For
match,knn, andsparse_vectorqueries, we don't want any wildcard resolution - When
multi_match's inference field handling is disabled, we don't want to handle any inference fields (wildcard or not). - When
multi_match's inference field handling is enabled, we want to handle both explicitly queried and wildcard-resolved inference fields.
This implementation doesn't do that. @kderusso has a good suggestion about implementing wildcard resolution only in the multi-match interceptor. Not sure how that would work yet, but as an exercise maybe try creating two implementations:
- One for
match,knn, andsparse_vectorwith no wildcard resolution that always handles inference fields - Another for
multi_matchwith optional inference field and wildcard resolution
Then we can examine the two and look for opportunities to combine them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried multiple approach and this looks like better for now. However, as Jim suggested, if we decide to handle wildcard in a separate PR, this might get also changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I stuck with a high-level review, so I may have missed things. There are some things to sort out regarding behavior differences between the existing query interceptors and the multi-match interceptor.
| public static final TransportVersion INITIAL_ELASTICSEARCH_8_19_1 = def(8_841_0_65); | ||
| public static final TransportVersion INITIAL_ELASTICSEARCH_8_19_2 = def(8_841_0_66); | ||
| public static final TransportVersion INITIAL_ELASTICSEARCH_8_19_3 = def(8_841_0_67); | ||
| public static final TransportVersion MULTI_MATCH_SEMANTIC_TEXT_SUPPORT_8_19 = def(8_841_0_68); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't backport new features, only bug fixes. 8.19 was a special case because we were developing two sibling releases (8.19 & 9.1) at the same time.
| public static final int DEFAULT_MAX_EXPANSIONS = FuzzyQuery.defaultMaxExpansions; | ||
| public static final ZeroTermsQueryOption DEFAULT_ZERO_TERMS_QUERY = MatchQueryParser.DEFAULT_ZERO_TERMS_QUERY; | ||
| public static final boolean DEFAULT_FUZZY_TRANSPOSITIONS = FuzzyQuery.defaultTranspositions; | ||
| private boolean resolveInferenceFieldWildcards = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed we need to adjust this. We need to separate the concepts of handling inference fields and resolving wildcards. AFAIU handling inference fields needs to be a multi-match query param so that the user can enable it (it should be false by default). In contrast, resolving wildcards is an immutable property of the query that is not configurable by the user.
| fieldsToProcess = QueryParserHelper.parseFieldsAndWeights(defaultFields); | ||
| } | ||
|
|
||
| // Resolve wildcards for inference fields and multiply boosts when field matches multiple patterns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed we can clean this up. Not sure what the solution ultimately looks like, but it needs to account for all of the following:
- For
match,knn, andsparse_vectorqueries, we don't want any wildcard resolution - When
multi_match's inference field handling is disabled, we don't want to handle any inference fields (wildcard or not). - When
multi_match's inference field handling is enabled, we want to handle both explicitly queried and wildcard-resolved inference fields.
This implementation doesn't do that. @kderusso has a good suggestion about implementing wildcard resolution only in the multi-match interceptor. Not sure how that would work yet, but as an exercise maybe try creating two implementations:
- One for
match,knn, andsparse_vectorwith no wildcard resolution that always handles inference fields - Another for
multi_matchwith optional inference field and wildcard resolution
Then we can examine the two and look for opportunities to combine them.
| if (resolveInferenceFieldWildcards && Regex.isMatchAllPattern(field)) { | ||
| indexInferenceMetadata.keySet().forEach(f -> { | ||
| indexInferenceFields.put(f, indexInferenceMetadata.get(f)); | ||
| addToFieldBoostsMap(fieldBoosts, f, boost); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this logic is going to work. Say you're using a query like:
{
"multi_match": {
"fields": [ "*_semantic^2" ],
"query": "foo"
}
}
And you're querying two indices that both contain the field text_semantic.
The total boost should be 2. However, with your implementation it will be 4 because the field will be added to the boost map once per matching index.
You can leverage the fact that any given unique field name will match the same set of wildcards regardless of index to generate a global field boost map, but that approach requires different logic than this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
| if (minimumShouldMatch != null || isMostFields) { | ||
| // Use BoolQuery when minimum_should_match is specified or for MOST_FIELDS | ||
| BoolQueryBuilder boolQuery = new BoolQueryBuilder(); | ||
|
|
||
| // Add semantic queries | ||
| for (String fieldName : inferenceInfo.getAllInferenceFields()) { | ||
| Set<String> semanticIndices = inferenceInfo.getInferenceIndicesForField(fieldName); | ||
| if (semanticIndices.isEmpty() == false) { | ||
| boolQuery.should( | ||
| createSemanticSubQuery(semanticIndices, fieldName, queryValue).boost(inferenceInfo.getFieldBoost(fieldName)) | ||
| ); | ||
| } | ||
| } | ||
|
|
||
| // Add non-inference queries | ||
| addNonInferenceQueries(boolQuery::should, originalQuery, inferenceInfo); | ||
|
|
||
| // Set minimum_should_match - default to "1" for MOST_FIELDS when not specified | ||
| boolQuery.minimumShouldMatch(Objects.requireNonNullElse(minimumShouldMatch, "1")); | ||
| boolQuery.boost(originalQuery.boost()); | ||
| boolQuery.queryName(originalQuery.queryName()); | ||
| return boolQuery; | ||
| } else { | ||
| // Use DisMaxQuery for BEST_FIELDS without minimum_should_match | ||
| DisMaxQueryBuilder disMaxQuery = QueryBuilders.disMaxQuery(); | ||
|
|
||
| // Add semantic queries | ||
| for (String fieldName : inferenceInfo.getAllInferenceFields()) { | ||
| Set<String> semanticIndices = inferenceInfo.getInferenceIndicesForField(fieldName); | ||
| if (semanticIndices.isEmpty() == false) { | ||
| disMaxQuery.add( | ||
| createSemanticSubQuery(semanticIndices, fieldName, queryValue).boost(inferenceInfo.getFieldBoost(fieldName)) | ||
| ); | ||
| } | ||
| } | ||
|
|
||
| // Add non-inference queries | ||
| addNonInferenceQueries(disMaxQuery::add, originalQuery, inferenceInfo); | ||
|
|
||
| Float tieBreaker = originalQuery.tieBreaker(); | ||
| disMaxQuery.tieBreaker(Objects.requireNonNullElseGet(tieBreaker, () -> originalQuery.type().tieBreaker())); | ||
| disMaxQuery.boost(originalQuery.boost()); | ||
| disMaxQuery.queryName(originalQuery.queryName()); | ||
| return disMaxQuery; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't what I meant by combining the logic. You can examine the multi-match query implementation to see how both most-fields and best-fields cases can be written using a dismax query.
Additionally, the minimum-should-match parameter is applied incorrectly. If you refer to our documentation, you can see that that param is applied for the number of terms that match, not the number of fields that match. You implemented the latter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you think about this approach?
| Map<String, Float> fieldsToProcess = fieldsWithWeights; | ||
| if (fieldsToProcess.isEmpty()) { | ||
| Settings settings = indexMetadata.getSettings(); | ||
| List<String> defaultFields = settings.getAsList(DEFAULT_FIELD_SETTING.getKey(), DEFAULT_FIELD_SETTING.getDefault(settings)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only want to do this for the multi-match query. Another thing to consider when adjusting the approach here to handle both match/knn/sparse_vector and multi_match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushed this into MultiMatch interceptor.
| if (minimumShouldMatch != null) { | ||
| // Use BoolQuery when minimum_should_match is specified | ||
| BoolQueryBuilder boolQuery = new BoolQueryBuilder(); | ||
| for (String fieldName : inferenceFields) { | ||
| boolQuery.should(createSemanticQuery(fieldName, queryValue, indexInformation)); | ||
| } | ||
| boolQuery.minimumShouldMatch(minimumShouldMatch); | ||
| boolQuery.boost(originalQuery.boost()); | ||
| boolQuery.queryName(originalQuery.queryName()); | ||
| return boolQuery; | ||
| } else { | ||
| // Use DisMax for default behavior with tie_breaker | ||
| DisMaxQueryBuilder disMaxQuery = QueryBuilders.disMaxQuery(); | ||
| for (String fieldName : inferenceFields) { | ||
| disMaxQuery.add(createSemanticQuery(fieldName, queryValue, indexInformation)); | ||
| } | ||
| // Apply tie_breaker - use explicit value or fall back to type's default | ||
| Float tieBreaker = originalQuery.tieBreaker(); | ||
| disMaxQuery.tieBreaker(Objects.requireNonNullElseGet(tieBreaker, () -> originalQuery.type().tieBreaker())); | ||
| disMaxQuery.boost(originalQuery.boost()); | ||
| disMaxQuery.queryName(originalQuery.queryName()); | ||
| return disMaxQuery; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, we can use dismax for both cases and you are misapplying minimum-should-match
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
Tested with the following queries