Skip to content

Conversation

@Samiul-TheSoccerFan
Copy link
Contributor

Tested with the following queries

PUT text-index
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text"
      },
      "content": {
        "type": "text"
      }
    }
  }
}

PUT text-index/_doc/1
{ 
  "title": "Introduction to Machine Learning",
  "content": "This guide covers supervised and unsupervised learning." 
}

PUT text-index/_doc/2
{ 
  "title": "Solar Panels for Beginners",
  "content": "Learn how solar energy works and its benefits."
}

POST text-index/_search
{
  "query": {
    "multi_match": {
      "query": "learn solar energy",
      "fields": ["title", "content^2"],
      "boost": 3.0
    }
  }
}


PUT semantic-index
{
  "mappings": {
    "properties": {
      "title": {
        "type": "semantic_text"
      }
    }
  }
}

PUT semantic-index/_doc/1
{ 
  "title": "Introduction to Machine Learning"
}

PUT semantic-index/_doc/2
{ 
  "title": "Solar Panels for Beginners"
}

POST semantic-index/_search
{
  "query": {
    "multi_match": {
      "query": "learn solar energy",
      "fields": [
        "title^2"
      ],
      "boost": 3.0,
      "_name": "semantic_query"
    }
  }
}

PUT mixed-index
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text"
      },
      "content": {
        "type": "semantic_text"
      }
    }
  }
}

PUT mixed-index/_doc/1
{ 
  "title": "Introduction to Machine Learning",
  "content": "This guide covers supervised and unsupervised learning." 
}

PUT mixed-index/_doc/2
{ 
  "title": "Solar Panels for Beginners",
  "content": "Learn how solar energy works and its benefits."
}

POST mixed-index/_search
{
  "query": {
    "multi_match": {
      "query": "learn solar energy",
      "fields": [
        "title",
        "content^2"
      ],
      "boost": 3,
      "_name": "mixed_query"
    }
  }
}

POST mixed-index,text-index/_search
{
  "query": {
    "multi_match": {
      "query": "learn solar energy",
      "fields": [
        "title",
        "content^2"
      ],
      "boost": 3,
      "_name": "mixed_query"
    }
  }
}

PUT _inference/text_embedding/my-e5-model
{
  "service": "elasticsearch",
  "service_settings": {
    "num_allocations": 1,
    "num_threads": 1,
    "model_id": ".multilingual-e5-small"
  }
}

PUT dense-index
{
  "mappings": {
    "properties": {
      "content": {
        "type": "semantic_text",
        "inference_id": "my-e5-model"
      },
      "title": {
        "type": "text"
      }
    }
  }
}

PUT dense-index/_doc/1
{ 
  "title": "Introduction to Machine Learning",
  "content": "This guide covers supervised and unsupervised learning." 
}

PUT dense-index/_doc/2
{ 
  "title": "Solar Panels for Beginners",
  "content": "Learn how solar energy works and its benefits."
}

POST dense-index,text-index/_search
{
  "query": {
    "multi_match": {
      "query": "learn solar energy",
      "fields": [
        "title",
        "content"
      ]
    }
  }
}

Copy link
Member

@kderusso kderusso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass POC review - I didn't focus on an in-depth code review here, but high level functionality. I think this is a good start, but I'd like to see us do some refactoring here and not rely on copy-pasta as much. Thanks for the work here @Samiul-TheSoccerFan !

Copy link
Contributor

@Mikep86 Mikep86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kderusso's suggestion to implement SemanticQueryRewriteInterceptor is a good one. We can get interceptAndRewrite to work with both single and multi field queries, it just takes some rework that I missed initially.

The key is to make InferenceIndexInformationForField a record for every field in the query, not just a single field. Kind of like what you do with MultiFieldInferenceInfo. We can then pass that to buildCombinedInferenceAndNonInferenceQuery and buildInferenceQuery to build the query we want.

As far as how to build the query, the concept is to have two query groups: inference fields and non-inference fields. Inference fields should be queried using semantic queries. Non-inference fields should be queried using a multi_match query. Then stitch the two query groups together with a dismax query configured appropriately for the multi-match query type.

See MultiFieldsInnerRetrieverUtils as an example of how to create these query groups. In that case, we are combining them using retrievers, so that part will need to be adjusted, but the grouping concept is the same.

@Samiul-TheSoccerFan
Copy link
Contributor Author

@elasticmachine update branch

@elasticmachine
Copy link
Collaborator

merge conflict between base and head

@Samiul-TheSoccerFan
Copy link
Contributor Author

@elasticmachine merge upstream

Copy link
Member

@kderusso kderusso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a look at the followup changes, and I have a few questions. Not sure if they're things you've already talked about with others.

public static final TransportVersion INITIAL_ELASTICSEARCH_8_19_1 = def(8_841_0_65);
public static final TransportVersion INITIAL_ELASTICSEARCH_8_19_2 = def(8_841_0_66);
public static final TransportVersion INITIAL_ELASTICSEARCH_8_19_3 = def(8_841_0_67);
public static final TransportVersion MULTI_MATCH_SEMANTIC_TEXT_SUPPORT_8_19 = def(8_841_0_68);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to backport this to 8.19? We only want this to be in 9.2 right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AIU, it would be only 9.2 but I have not discussed that yet. I will remove this in the original PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't backport new features, only bug fixes. 8.19 was a special case because we were developing two sibling releases (8.19 & 9.1) at the same time.

public static final int DEFAULT_MAX_EXPANSIONS = FuzzyQuery.defaultMaxExpansions;
public static final ZeroTermsQueryOption DEFAULT_ZERO_TERMS_QUERY = MatchQueryParser.DEFAULT_ZERO_TERMS_QUERY;
public static final boolean DEFAULT_FUZZY_TRANSPOSITIONS = FuzzyQuery.defaultTranspositions;
private boolean resolveInferenceFieldWildcards = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we add this to the multi match builder, rather than keeping all logic in the inference plugin? That seems odd to me and a lot of extra stuff that we have to pass around? I may be missing a good reason to do this...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The resolveInferenceFieldWildcards parameter was added to MultiMatchQueryBuilder because both LinearRetrieverBuilder and RRFRetrieverBuilder directly create MultiMatchQueryBuilder instances and need to prevent double-querying of semantic_text fields when using their simplified syntax. The changes in MultiFieldsInnerRetrieverUtils.java, LinearRetrieverBuilderTests, and RRFRetrieverBuilderTests provide additional context for this approach. Please let me know if this does not answer your question and need more clarification. Also, I am open to suggestions if there is a better way to handle this check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining your thought process around adding this.

If you look at the core details of what you added, though, you're only adding a flag to the MultiMatchQueryBuilder and not actually doing anything with that flag in this class itself. To me, that's a really big signal that this flag should not exist in this class. Furthermore, the linear and RRF retrievers are already performing wildcard resolution on inference fields in MultiFieldsInnerRetrieverUtils#generateInnerRetrieversForIndex. Please remove this and try to come up with a solution that is contained to the interceptors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed we need to adjust this. We need to separate the concepts of handling inference fields and resolving wildcards. AFAIU handling inference fields needs to be a multi-match query param so that the user can enable it (it should be false by default). In contrast, resolving wildcards is an immutable property of the query that is not configurable by the user.

* @param queryBuilder {@link QueryBuilder}
* @return true if inference field wildcards should be resolved, false otherwise.
*/
protected abstract boolean isResolveInferenceFieldWildcardsRequired(QueryBuilder queryBuilder);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without knowing more about why you added this change, this is setting off my spidey sense. Why does this have to be propagated to all interceptors like this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

multi_match queries control the resolveInferenceFieldWildcards parameter, while other query types like match or knn_vector don't support wildcards so they always return false. This follows the same pattern as other methods where each interceptor knows how to handle its own query type and field/s.

Copy link
Member

@kderusso kderusso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left some clarifying comments on this PR per our discussion. Please let me know if you have any questions!

public static final int DEFAULT_MAX_EXPANSIONS = FuzzyQuery.defaultMaxExpansions;
public static final ZeroTermsQueryOption DEFAULT_ZERO_TERMS_QUERY = MatchQueryParser.DEFAULT_ZERO_TERMS_QUERY;
public static final boolean DEFAULT_FUZZY_TRANSPOSITIONS = FuzzyQuery.defaultTranspositions;
private boolean resolveInferenceFieldWildcards = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining your thought process around adding this.

If you look at the core details of what you added, though, you're only adding a flag to the MultiMatchQueryBuilder and not actually doing anything with that flag in this class itself. To me, that's a really big signal that this flag should not exist in this class. Furthermore, the linear and RRF retrievers are already performing wildcard resolution on inference fields in MultiFieldsInnerRetrieverUtils#generateInnerRetrieversForIndex. Please remove this and try to come up with a solution that is contained to the interceptors.

}

@Override
protected String getFieldName(QueryBuilder queryBuilder) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is part of the refactoring that I think you could do in a first PR.

* @param queryBuilder {@link QueryBuilder}
* @return true if inference field wildcards should be resolved, false otherwise.
*/
protected abstract boolean shouldResolveInferenceFieldWildcards(QueryBuilder queryBuilder);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't belong as an abstract method.

The first thing that jumps out to me here is that you're defining an abstract method that all interceptors have to implement - however it's going to always be false except in the case of the Multi-Match interceptor. So let's try to figure out a way to do what you need to do without adding this method to every class.

fieldsToProcess = QueryParserHelper.parseFieldsAndWeights(defaultFields);
}

// Resolve wildcards for inference fields and multiply boosts when field matches multiple patterns
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing that jumps out to me here, is that we're resolving wildcards here, but this code path will only ever be executed by the Multi-Match interceptor.

Instead of adding knobs for that here in the parent class, why not break this logic out here into its own specific method, and move the wildcard specific logic to an override method only executed by the Multi-Match interceptor?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed we can clean this up. Not sure what the solution ultimately looks like, but it needs to account for all of the following:

  • For match, knn, and sparse_vector queries, we don't want any wildcard resolution
  • When multi_match's inference field handling is disabled, we don't want to handle any inference fields (wildcard or not).
  • When multi_match's inference field handling is enabled, we want to handle both explicitly queried and wildcard-resolved inference fields.

This implementation doesn't do that. @kderusso has a good suggestion about implementing wildcard resolution only in the multi-match interceptor. Not sure how that would work yet, but as an exercise maybe try creating two implementations:

  • One for match, knn, and sparse_vector with no wildcard resolution that always handles inference fields
  • Another for multi_match with optional inference field and wildcard resolution

Then we can examine the two and look for opportunities to combine them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried multiple approach and this looks like better for now. However, as Jim suggested, if we decide to handle wildcard in a separate PR, this might get also changed.

Copy link
Contributor

@Mikep86 Mikep86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I stuck with a high-level review, so I may have missed things. There are some things to sort out regarding behavior differences between the existing query interceptors and the multi-match interceptor.

public static final TransportVersion INITIAL_ELASTICSEARCH_8_19_1 = def(8_841_0_65);
public static final TransportVersion INITIAL_ELASTICSEARCH_8_19_2 = def(8_841_0_66);
public static final TransportVersion INITIAL_ELASTICSEARCH_8_19_3 = def(8_841_0_67);
public static final TransportVersion MULTI_MATCH_SEMANTIC_TEXT_SUPPORT_8_19 = def(8_841_0_68);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't backport new features, only bug fixes. 8.19 was a special case because we were developing two sibling releases (8.19 & 9.1) at the same time.

public static final int DEFAULT_MAX_EXPANSIONS = FuzzyQuery.defaultMaxExpansions;
public static final ZeroTermsQueryOption DEFAULT_ZERO_TERMS_QUERY = MatchQueryParser.DEFAULT_ZERO_TERMS_QUERY;
public static final boolean DEFAULT_FUZZY_TRANSPOSITIONS = FuzzyQuery.defaultTranspositions;
private boolean resolveInferenceFieldWildcards = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed we need to adjust this. We need to separate the concepts of handling inference fields and resolving wildcards. AFAIU handling inference fields needs to be a multi-match query param so that the user can enable it (it should be false by default). In contrast, resolving wildcards is an immutable property of the query that is not configurable by the user.

fieldsToProcess = QueryParserHelper.parseFieldsAndWeights(defaultFields);
}

// Resolve wildcards for inference fields and multiply boosts when field matches multiple patterns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed we can clean this up. Not sure what the solution ultimately looks like, but it needs to account for all of the following:

  • For match, knn, and sparse_vector queries, we don't want any wildcard resolution
  • When multi_match's inference field handling is disabled, we don't want to handle any inference fields (wildcard or not).
  • When multi_match's inference field handling is enabled, we want to handle both explicitly queried and wildcard-resolved inference fields.

This implementation doesn't do that. @kderusso has a good suggestion about implementing wildcard resolution only in the multi-match interceptor. Not sure how that would work yet, but as an exercise maybe try creating two implementations:

  • One for match, knn, and sparse_vector with no wildcard resolution that always handles inference fields
  • Another for multi_match with optional inference field and wildcard resolution

Then we can examine the two and look for opportunities to combine them.

if (resolveInferenceFieldWildcards && Regex.isMatchAllPattern(field)) {
indexInferenceMetadata.keySet().forEach(f -> {
indexInferenceFields.put(f, indexInferenceMetadata.get(f));
addToFieldBoostsMap(fieldBoosts, f, boost);
Copy link
Contributor

@Mikep86 Mikep86 Aug 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this logic is going to work. Say you're using a query like:

{
  "multi_match": {
     "fields": [ "*_semantic^2" ],
     "query": "foo"
  }
}

And you're querying two indices that both contain the field text_semantic.

The total boost should be 2. However, with your implementation it will be 4 because the field will be added to the boost map once per matching index.

You can leverage the fact that any given unique field name will match the same set of wildcards regardless of index to generate a global field boost map, but that approach requires different logic than this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Comment on lines 205 to 249
if (minimumShouldMatch != null || isMostFields) {
// Use BoolQuery when minimum_should_match is specified or for MOST_FIELDS
BoolQueryBuilder boolQuery = new BoolQueryBuilder();

// Add semantic queries
for (String fieldName : inferenceInfo.getAllInferenceFields()) {
Set<String> semanticIndices = inferenceInfo.getInferenceIndicesForField(fieldName);
if (semanticIndices.isEmpty() == false) {
boolQuery.should(
createSemanticSubQuery(semanticIndices, fieldName, queryValue).boost(inferenceInfo.getFieldBoost(fieldName))
);
}
}

// Add non-inference queries
addNonInferenceQueries(boolQuery::should, originalQuery, inferenceInfo);

// Set minimum_should_match - default to "1" for MOST_FIELDS when not specified
boolQuery.minimumShouldMatch(Objects.requireNonNullElse(minimumShouldMatch, "1"));
boolQuery.boost(originalQuery.boost());
boolQuery.queryName(originalQuery.queryName());
return boolQuery;
} else {
// Use DisMaxQuery for BEST_FIELDS without minimum_should_match
DisMaxQueryBuilder disMaxQuery = QueryBuilders.disMaxQuery();

// Add semantic queries
for (String fieldName : inferenceInfo.getAllInferenceFields()) {
Set<String> semanticIndices = inferenceInfo.getInferenceIndicesForField(fieldName);
if (semanticIndices.isEmpty() == false) {
disMaxQuery.add(
createSemanticSubQuery(semanticIndices, fieldName, queryValue).boost(inferenceInfo.getFieldBoost(fieldName))
);
}
}

// Add non-inference queries
addNonInferenceQueries(disMaxQuery::add, originalQuery, inferenceInfo);

Float tieBreaker = originalQuery.tieBreaker();
disMaxQuery.tieBreaker(Objects.requireNonNullElseGet(tieBreaker, () -> originalQuery.type().tieBreaker()));
disMaxQuery.boost(originalQuery.boost());
disMaxQuery.queryName(originalQuery.queryName());
return disMaxQuery;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't what I meant by combining the logic. You can examine the multi-match query implementation to see how both most-fields and best-fields cases can be written using a dismax query.

Additionally, the minimum-should-match parameter is applied incorrectly. If you refer to our documentation, you can see that that param is applied for the number of terms that match, not the number of fields that match. You implemented the latter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you think about this approach?

Map<String, Float> fieldsToProcess = fieldsWithWeights;
if (fieldsToProcess.isEmpty()) {
Settings settings = indexMetadata.getSettings();
List<String> defaultFields = settings.getAsList(DEFAULT_FIELD_SETTING.getKey(), DEFAULT_FIELD_SETTING.getDefault(settings));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only want to do this for the multi-match query. Another thing to consider when adjusting the approach here to handle both match/knn/sparse_vector and multi_match.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed this into MultiMatch interceptor.

Comment on lines 167 to 189
if (minimumShouldMatch != null) {
// Use BoolQuery when minimum_should_match is specified
BoolQueryBuilder boolQuery = new BoolQueryBuilder();
for (String fieldName : inferenceFields) {
boolQuery.should(createSemanticQuery(fieldName, queryValue, indexInformation));
}
boolQuery.minimumShouldMatch(minimumShouldMatch);
boolQuery.boost(originalQuery.boost());
boolQuery.queryName(originalQuery.queryName());
return boolQuery;
} else {
// Use DisMax for default behavior with tie_breaker
DisMaxQueryBuilder disMaxQuery = QueryBuilders.disMaxQuery();
for (String fieldName : inferenceFields) {
disMaxQuery.add(createSemanticQuery(fieldName, queryValue, indexInformation));
}
// Apply tie_breaker - use explicit value or fall back to type's default
Float tieBreaker = originalQuery.tieBreaker();
disMaxQuery.tieBreaker(Objects.requireNonNullElseGet(tieBreaker, () -> originalQuery.type().tieBreaker()));
disMaxQuery.boost(originalQuery.boost());
disMaxQuery.queryName(originalQuery.queryName());
return disMaxQuery;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, we can use dismax for both cases and you are misapplying minimum-should-match

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@Samiul-TheSoccerFan Samiul-TheSoccerFan changed the title [POC] Multi Match Implementation to Support semantic_tex [POC] Multi Match Implementation to Support semantic_text Aug 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants