Skip to content

Conversation

@Mikep86
Copy link
Contributor

@Mikep86 Mikep86 commented Oct 28, 2025

This is a POC for adding semantic search CCS support with ccs_minimize_roundtrips=false. This is achieved through three high-level changes:

  • The addition of GetInferenceFieldsAction, which allows us to get inference fields (and their associated inference results for a given query) for a remote cluster
  • The ability to register async actions for remote clusters in QueryRewriteContext
  • Updating the semantic and intercepted queries to get remote cluster inference results during local cluster coordinator rewrite when ccs_minimize_roundtrips=false

These changes add semantic search CCS support when using:

  • ES|QL
  • Query DSL and ccs_minimize_roundtrips=false
  • (With some follow-up changes) Simplified linear/rrf retrievers

Note that as a POC, the code is a bit rough. Some things are over-complicated, unhandled edge cases still exist, and the few tests that exist are hacked together. Everything will get cleaned up as this is split into multiple PRs for a production implementation.

@Mikep86 Mikep86 added the :SearchOrg/Relevance Label for the Search (solution/org) Relevance team label Oct 28, 2025
@Mikep86
Copy link
Contributor Author

Mikep86 commented Oct 28, 2025

@elasticmachine update branch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is a total hack job, it exists only to show that this approach also addresses semantic search CCS in ES|QL

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH the test doesn't look that bad to me, a little cleanup and I think it's fine

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably time to factor out the methods for getting inference results into a common place outside of SemanticQueryBuilder

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test can be disregarded for now. I added it to help characterize retriever behavior with CCS.

Comment on lines -339 to 347

// If we are handling a CCS request, always retain the intercepted query logic so that we can get inference results generated on
// the local cluster from the inference results map when rewriting on remote cluster data nodes. This can be necessary when:
// - A query specifies an inference ID override
// - Only non-inference fields are queried on the remote cluster
if (inferenceIds.isEmpty() && this.ccsRequest == false) {
// Not querying a semantic text field
boolean ccsRequest = this.ccsRequest || resolvedIndices.getRemoteClusterIndices().isEmpty() == false;
Boolean ccsMinimizeRoundTrips = queryRewriteContext.isCcsMinimizeRoundTrips();
if (inferenceIds.isEmpty() && (ccsRequest == false || Boolean.TRUE.equals(ccsMinimizeRoundTrips))) {
// Not querying a semantic text field locally and either:
// - no remote indices are specified
// - ccs_minimize_roundtrips: true, so the query will be re-intercepted (if necessary) on the remote cluster
return originalQuery;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes here are causing some regressions at the moment. Since we now want to handle CCS when ccs_minimize_roundtrips: false and we are querying an inference field only on a remote cluster, we have to use the intercepted query logic for an additional rewrite iteration to resolve remote cluster inference fields (if any). Thus, we have to adjust when we rewrite to the original query to allow for this additional rewrite iteration.

This logic isn't quite right yet, but I'm confident with a a little more time we'll work out the edge cases here.

Comment on lines +36 to 37
@AwaitsFix(bugUrl = "https://fake.url")
public void testKnnQuery() throws Exception {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is temporarily disabled due to regressions introduced in https://github.com/elastic/elasticsearch/pull/137247/files#r2469934806

Copy link
Member

@kderusso kderusso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very high level review as I see that there is a lot of planned cleanup. Overall the approach seems to make a lot of sense. I would be curious, you'd mentioned there were some alternate approaches that traded some efficiency for simplicity. Wondering if you could help the team understand a bit more about those tradeoffs in case we want to do future optimizations.

*/
public void executeAsyncActions(ActionListener<Void> listener) {
if (asyncActions.isEmpty()) {
if (asyncActions.isEmpty() && remoteAsyncActions.isEmpty()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would break out remote async actions into their own method.

Copy link
Contributor Author

@Mikep86 Mikep86 Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with breaking remote async actions out into their own method is that it adds considerable complexity to callers of QueryRewriteContext that want to ensure that all (i.e. local and remote) async actions are executed. IMO that should be the default, executing only local async actions could lead to strange edge cases.

Having a different method for executing remote async actions would have the following side effects:

  • Callers would need to remember to check both hasAsyncActions and hasRemoteAsyncActions
  • Callers would need to construct a GroupedActionListener that encapsulates calls to executeAsyncActions and executeRemoteAsyncActions to ensure that both are complete before asynchronously moving to the next rewrite iteration. See how executeAsyncActions is called in rewriteAndFetch for a concrete example of the additional complexity pushed to the caller.

testImplementation('org.webjars.npm:fontsource__roboto-mono:4.5.7')

internalClusterTestImplementation project(":modules:mapper-extras")
internalClusterTestImplementation project(xpackModule('inference'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I found out the hard way, that ES|QL will enter some weird jarhell state and tests won't pass. You're going to want to make sure to not have this inference dependency for your non-draft PR. I solved this in my PR by refactoring the classes I needed to core.

super(NAME);
}

public static class Request extends ActionRequest {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remember to add BWC serialization tests for these (and the associated responses) & make sure the json docs are added too

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH the test doesn't look that bad to me, a little cleanup and I think it's fine

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

Copy link
Contributor

@dimitris-athanasiou dimitris-athanasiou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

++ on approach! Makes sense!

*/
public boolean hasAsyncActions() {
return asyncActions.isEmpty() == false;
return asyncActions.isEmpty() == false || remoteAsyncActions.isEmpty() == false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is some interplay here with remote and non-remote actions. In registerRemoteAsyncAction we're adding the remote ones in the non-remote list too. Thus, the check on asyncActions should be enough? Otherwise, should we not be mixing remote/non-remote at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, remote and local async actions are stored separate lists. This is necessary because the remote async actions are mapped by cluster alias.

CountDown countDown = new CountDown(asyncActions.size());
int actionCount = asyncActions.size();
for (var remoteAsyncActionList : remoteAsyncActions.values()) {
actionCount += remoteAsyncActionList.size();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. Are we double-counting remote actions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

protected final T originalQuery;
protected final Map<FullyQualifiedInferenceId, InferenceResults> inferenceResultsMap;
protected final SetOnce<Map<FullyQualifiedInferenceId, InferenceResults>> inferenceResultsMapSupplier;
protected final SetOnce<Map<FullyQualifiedInferenceId, InferenceResults>> localInferenceResultsMapSupplier;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As these are not serialized and we don't have BWC issues, I think it might be nice to group them together in a single class that manages them.

listener.onResponse(null);
}, e -> {
Exception failure = e;
if (e.getCause() instanceof ActionNotFoundTransportException actionNotFoundTransportException
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't we check remote's transport version to know if the new action is supported before we sent the request?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't get that information until the semantic query is serialized for transport to the remote cluster, which happens way later than query rewriting

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can check the minimum transport version in the cluster, in fact that's how we should determine if the query is eligible for the new path to handle ccs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jimczi Can we get a remote cluster's min transport version from local cluster state?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For ESQL at least, we know min transport version once initial index lookup is done, it's kept in PreAnalysisResult. It's still being worked on to make all the parts respect it properly, e.g. see #137431
Not sure when it comes into action on DSL side.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a look if it's possible to get remote cluster min transport version from something like cluster service

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See ClusterState#getMinTransportVersion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that only for the current cluster's min transport version?

Copy link
Contributor

@jimczi jimczi Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah sorry you meant the remote cluster and I read local for some reasons.

Copy link
Contributor Author

@Mikep86 Mikep86 Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To close the loop on this, I don't see a way to get the remote cluster transport version during query rewrite

Nevermind, we can use RemoteClusterClient#getConnection to get the remote cluster transport version here: https://github.com/Mikep86/elasticsearch/blob/9ee1813b95d3834194277a1603c424fef64cd7fc/server/src/main/java/org/elasticsearch/client/internal/RemoteClusterClient.java#L48-L53

@jimczi jimczi removed their request for review November 3, 2025 17:40
phananh1010 added a commit to phananh1010/elasticsearch that referenced this pull request Nov 6, 2025
BASE=60406a6315bb9b1fc847e614175899a9161b2e82
HEAD=5aab46d5a38808333f5f4a432ca3057a015f9162
Branch=main
phananh1010 added a commit to phananh1010/elasticsearch that referenced this pull request Nov 7, 2025
BASE=60406a6315bb9b1fc847e614175899a9161b2e82
HEAD=5aab46d5a38808333f5f4a432ca3057a015f9162
Branch=main
phananh1010 added a commit to phananh1010/elasticsearch that referenced this pull request Nov 7, 2025
BASE=60406a6315bb9b1fc847e614175899a9161b2e82
HEAD=5aab46d5a38808333f5f4a432ca3057a015f9162
Branch=main
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>non-issue :SearchOrg/Relevance Label for the Search (solution/org) Relevance team v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants