-
Notifications
You must be signed in to change notification settings - Fork 25.7k
POC - Semantic search CCS support with ccs_minimize_roundtrips=false #137247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…eried when ccs_minimize_roundtrips=false
… only when ccs_minimize_roundtrips=false
|
@elasticmachine update branch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is a total hack job, it exists only to show that this approach also addresses semantic search CCS in ES|QL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH the test doesn't look that bad to me, a little cleanup and I think it's fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably time to factor out the methods for getting inference results into a common place outside of SemanticQueryBuilder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test can be disregarded for now. I added it to help characterize retriever behavior with CCS.
|
|
||
| // If we are handling a CCS request, always retain the intercepted query logic so that we can get inference results generated on | ||
| // the local cluster from the inference results map when rewriting on remote cluster data nodes. This can be necessary when: | ||
| // - A query specifies an inference ID override | ||
| // - Only non-inference fields are queried on the remote cluster | ||
| if (inferenceIds.isEmpty() && this.ccsRequest == false) { | ||
| // Not querying a semantic text field | ||
| boolean ccsRequest = this.ccsRequest || resolvedIndices.getRemoteClusterIndices().isEmpty() == false; | ||
| Boolean ccsMinimizeRoundTrips = queryRewriteContext.isCcsMinimizeRoundTrips(); | ||
| if (inferenceIds.isEmpty() && (ccsRequest == false || Boolean.TRUE.equals(ccsMinimizeRoundTrips))) { | ||
| // Not querying a semantic text field locally and either: | ||
| // - no remote indices are specified | ||
| // - ccs_minimize_roundtrips: true, so the query will be re-intercepted (if necessary) on the remote cluster | ||
| return originalQuery; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes here are causing some regressions at the moment. Since we now want to handle CCS when ccs_minimize_roundtrips: false and we are querying an inference field only on a remote cluster, we have to use the intercepted query logic for an additional rewrite iteration to resolve remote cluster inference fields (if any). Thus, we have to adjust when we rewrite to the original query to allow for this additional rewrite iteration.
This logic isn't quite right yet, but I'm confident with a a little more time we'll work out the edge cases here.
| @AwaitsFix(bugUrl = "https://fake.url") | ||
| public void testKnnQuery() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is temporarily disabled due to regressions introduced in https://github.com/elastic/elasticsearch/pull/137247/files#r2469934806
kderusso
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very high level review as I see that there is a lot of planned cleanup. Overall the approach seems to make a lot of sense. I would be curious, you'd mentioned there were some alternate approaches that traded some efficiency for simplicity. Wondering if you could help the team understand a bit more about those tradeoffs in case we want to do future optimizations.
| */ | ||
| public void executeAsyncActions(ActionListener<Void> listener) { | ||
| if (asyncActions.isEmpty()) { | ||
| if (asyncActions.isEmpty() && remoteAsyncActions.isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would break out remote async actions into their own method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with breaking remote async actions out into their own method is that it adds considerable complexity to callers of QueryRewriteContext that want to ensure that all (i.e. local and remote) async actions are executed. IMO that should be the default, executing only local async actions could lead to strange edge cases.
Having a different method for executing remote async actions would have the following side effects:
- Callers would need to remember to check both
hasAsyncActionsandhasRemoteAsyncActions - Callers would need to construct a
GroupedActionListenerthat encapsulates calls toexecuteAsyncActionsandexecuteRemoteAsyncActionsto ensure that both are complete before asynchronously moving to the next rewrite iteration. See howexecuteAsyncActionsis called inrewriteAndFetchfor a concrete example of the additional complexity pushed to the caller.
| testImplementation('org.webjars.npm:fontsource__roboto-mono:4.5.7') | ||
|
|
||
| internalClusterTestImplementation project(":modules:mapper-extras") | ||
| internalClusterTestImplementation project(xpackModule('inference')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I found out the hard way, that ES|QL will enter some weird jarhell state and tests won't pass. You're going to want to make sure to not have this inference dependency for your non-draft PR. I solved this in my PR by refactoring the classes I needed to core.
| super(NAME); | ||
| } | ||
|
|
||
| public static class Request extends ActionRequest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remember to add BWC serialization tests for these (and the associated responses) & make sure the json docs are added too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH the test doesn't look that bad to me, a little cleanup and I think it's fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++
dimitris-athanasiou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
++ on approach! Makes sense!
| */ | ||
| public boolean hasAsyncActions() { | ||
| return asyncActions.isEmpty() == false; | ||
| return asyncActions.isEmpty() == false || remoteAsyncActions.isEmpty() == false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is some interplay here with remote and non-remote actions. In registerRemoteAsyncAction we're adding the remote ones in the non-remote list too. Thus, the check on asyncActions should be enough? Otherwise, should we not be mixing remote/non-remote at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, remote and local async actions are stored separate lists. This is necessary because the remote async actions are mapped by cluster alias.
| CountDown countDown = new CountDown(asyncActions.size()); | ||
| int actionCount = asyncActions.size(); | ||
| for (var remoteAsyncActionList : remoteAsyncActions.values()) { | ||
| actionCount += remoteAsyncActionList.size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. Are we double-counting remote actions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| protected final T originalQuery; | ||
| protected final Map<FullyQualifiedInferenceId, InferenceResults> inferenceResultsMap; | ||
| protected final SetOnce<Map<FullyQualifiedInferenceId, InferenceResults>> inferenceResultsMapSupplier; | ||
| protected final SetOnce<Map<FullyQualifiedInferenceId, InferenceResults>> localInferenceResultsMapSupplier; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As these are not serialized and we don't have BWC issues, I think it might be nice to group them together in a single class that manages them.
| listener.onResponse(null); | ||
| }, e -> { | ||
| Exception failure = e; | ||
| if (e.getCause() instanceof ActionNotFoundTransportException actionNotFoundTransportException |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't we check remote's transport version to know if the new action is supported before we sent the request?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't get that information until the semantic query is serialized for transport to the remote cluster, which happens way later than query rewriting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can check the minimum transport version in the cluster, in fact that's how we should determine if the query is eligible for the new path to handle ccs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jimczi Can we get a remote cluster's min transport version from local cluster state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For ESQL at least, we know min transport version once initial index lookup is done, it's kept in PreAnalysisResult. It's still being worked on to make all the parts respect it properly, e.g. see #137431
Not sure when it comes into action on DSL side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll take a look if it's possible to get remote cluster min transport version from something like cluster service
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See ClusterState#getMinTransportVersion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't that only for the current cluster's min transport version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah sorry you meant the remote cluster and I read local for some reasons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To close the loop on this, I don't see a way to get the remote cluster transport version during query rewrite
Nevermind, we can use RemoteClusterClient#getConnection to get the remote cluster transport version here: https://github.com/Mikep86/elasticsearch/blob/9ee1813b95d3834194277a1603c424fef64cd7fc/server/src/main/java/org/elasticsearch/client/internal/RemoteClusterClient.java#L48-L53
BASE=60406a6315bb9b1fc847e614175899a9161b2e82 HEAD=5aab46d5a38808333f5f4a432ca3057a015f9162 Branch=main
BASE=60406a6315bb9b1fc847e614175899a9161b2e82 HEAD=5aab46d5a38808333f5f4a432ca3057a015f9162 Branch=main
BASE=60406a6315bb9b1fc847e614175899a9161b2e82 HEAD=5aab46d5a38808333f5f4a432ca3057a015f9162 Branch=main
This is a POC for adding semantic search CCS support with
ccs_minimize_roundtrips=false. This is achieved through three high-level changes:GetInferenceFieldsAction, which allows us to get inference fields (and their associated inference results for a given query) for a remote clusterQueryRewriteContextsemanticand intercepted queries to get remote cluster inference results during local cluster coordinator rewrite whenccs_minimize_roundtrips=falseThese changes add semantic search CCS support when using:
ccs_minimize_roundtrips=falselinear/rrfretrieversNote that as a POC, the code is a bit rough. Some things are over-complicated, unhandled edge cases still exist, and the few tests that exist are hacked together. Everything will get cleaned up as this is split into multiple PRs for a production implementation.