Skip to content

Conversation

n1v0lg
Copy link
Contributor

@n1v0lg n1v0lg commented Jul 13, 2025

This is a POC for cross-project search support. It covers integration points with UIAM (via transport interceptor changes) and updates to index resolution logic.

The main index resolution changes are:

The security interceptor rewrites CPS-capable requests and records resolution and error information for CPS error handling. It uses the AuthorizedProjectsSupplier to decide whether to handle such requests as CPS or CCS. The interceptor marks any CPS-capable request as in CPS mode or not, allowing downstream consumers to apply the correct logic (CPS vs CCS).

The end-to-end flow is demonstrated in ResolveIndexAction, which is a good starting point for understanding the changes.

Further details:

Indices.Replaceable now supports storing index resolution results via the new ReplacedIndexExpressions class. This class tracks the original expression and its resolved form (e.g., logs* → logs), as well as missing or unauthorized resources to support CPS error handling. This required generalizing IndexAbstractionResolver#resolveIndexAbstractions to record replacements and authorization status. The same logic is used both for CPS fan-out actions and for resolving local expressions in CPS requests.

CrossProjectSearchCapable is a new interface marking request classes that support CPS (in the POC only ResolveIndexAction.Request, but intended for SearchRequest and others). In this POC it implements CPS error handling—though that logic may move elsewhere—and provides a crossProjectMode method indicating whether the request should be handled under CPS rules or as a CCS request.

For now, Indices.Replaceable supports both ReplacedIndexExpressions and String[] indices side by side, but a broader refactor may consolidate on ReplacedIndexExpressions.

@n1v0lg n1v0lg self-assigned this Jul 13, 2025
@elasticsearchmachine elasticsearchmachine added v9.2.0 serverless-linked Added by automation, don't add manually labels Jul 13, 2025

boolean checkRemote(List<RemoteClusterService.RemoteTag> tags);

record RewrittenIndexExpression(String original, List<String> rewritten) {}
Copy link
Contributor

@quux00 quux00 Jul 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we also need to check (and track) whether each component of the rewritten list is flat or qualified? Is that left for a later impl or do you think we don't need to track that here?

(ns, key) -> boolSetting(key, true, new RemoteConnectionEnabled<>(ns, key), Setting.Property.Dynamic, Setting.Property.NodeScope)
);

public record RemoteTag(String key, String value) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The local project also has tags, so my guess is that this doesn't belong in RemoteClusterService and shouldn't be called RemoteTag? Maybe MetadataTag or ProjectTag?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, ProjectTag makes sense though we might also go broader and just call it metadata in case we also want it to cover project alias and such.

if (routingTags != null) {
searchRequest.routingTags(
Arrays.stream(Strings.splitStringByCommaToArray(routingTags)).map(RemoteClusterService.RemoteTag::fromString).toList()
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give an example of the type/format of tags you are allowing here? And is there an implicit "ANDing" of those tags?

As a first impl that might be useful.

Of course to state the obvious (not a criticism)

  1. it doesn't handle ES|QL if that is embedded in the query
  2. it doesn't support the complex AND, OR, NOT and grouping logic we'll eventually have to implement
    so later we'll need some additional constructs/interfaces around parsing that more complex scenario.

@n1v0lg n1v0lg requested review from tvernum and ywangd September 2, 2025 07:42
for (var replacedExpression : replacedExpressionMap.values()) {
if (false == replacedExpression.authorized()) {
// TODO actual security exception with details here
replacedExpression.setAuthorizationError(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to do this dance (mark authorized=false first, then later fill in the error details) since we don't have the right context (authentication, role info) in the IndicesAndAliasesResolver. Maybe we can expose that info to IndicesAndAliasesResolver though, we would let us avoid this split here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the authentcation and role info not available in the threadContext?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they are. What I stated above though is actually misleading. Let me clarify:

  • IndicesAndAliasesResolver is in xpack and we can change it to have access to authentication and role info via ThreadContext (I'm fairly confident)
  • The place where we set authorized however, is in IndexAbstractionResolver -- that's in core so that doesn't have access to the classes Authentication etc

So while we could expose this info to IndicesAndAliasesResolver, we can't expose it to IndexAbstractionResolver which is what we'd need to do to avoid the 2 step process of "set authorized", then "set exception".

Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks promising! I had an initial read on the changes for index resolution and left some basic comments/questions. I'll let it sink a little bit and come back again for a closer look. Thanks!

Comment on lines 371 to 373
// TODO this behavior could be pluggable
if (indicesRequest instanceof IndicesRequest.CrossProjectSearchCapable crossProjectSearchCapableRequest
&& targetProjects != AuthorizedProjectsSupplier.AuthorizedProjects.NOT_CROSS_PROJECT) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, the pluggable behaviour can already be achieved with different AuthorizedProjects implemenations, i.e. wrap the instanceof check and the code in this if block inside an AuthorizedProjects implementation provided by serverless?


replacedExpressionMap.put(
original,
new ReplacedIndexExpression(original, combined, local.authorized(), local.existsAndVisible(), null)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means both authorized and existsAndVisible are applicable only to the local expression in the combined list. Might be ok. But it does feel a bit odd.

Copy link
Contributor Author

@n1v0lg n1v0lg Sep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes true, and yes it's odd. I don't think CrossProjectReplacedIndexExpressions is quite the right abstraction (or at least not the right shape) for the job still...

indicesRequest.includeDataStreams()
);
crossProjectReplacedIndexExpressions.replaceLocalExpressions(replacedExpressions);
crossProjectSearchCapableRequest.setReplacedIndexExpressions(crossProjectReplacedIndexExpressions);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expressions set here are unexpanded for remote indices, e.g. remote:foo*, which are then expanded by downstream actions, e.g. ResolveIndexAction. We will process the final expanded results for error handling. But do I understand it correctly that we will never store the final results back into the request itself and this is something completely up to downstream actions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, that was my thinking: keep 'em unexpanded for remotes, leave everything to the downstream action, don't store the final results on the request.

for (var replacedExpression : replacedExpressionMap.values()) {
if (false == replacedExpression.authorized()) {
// TODO actual security exception with details here
replacedExpression.setAuthorizationError(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the authentcation and role info not available in the threadContext?

Comment on lines 635 to 650
private Request buildFanoutRequest(boolean crossProjectMode, OriginalIndices originalIndices) {
return new Request(originalIndices.indices(), getIndicesOptions(crossProjectMode, originalIndices), crossProjectMode);
}

private IndicesOptions getIndicesOptions(boolean crossProjectMode, OriginalIndices localIndices) {
return crossProjectMode ? lenientIndicesOptions(localIndices.indicesOptions()) : localIndices.indicesOptions();
}

private static IndicesOptions lenientIndicesOptions(IndicesOptions indicesOptions) {
return IndicesOptions.builder(indicesOptions)
.concreteTargetOptions(new IndicesOptions.ConcreteTargetOptions(true))
.wildcardOptions(
IndicesOptions.WildcardOptions.builder(indicesOptions.wildcardOptions()).allowEmptyExpressions(true).build()
)
.build();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to see these handled by the request itself. It likely require some refactoring of existing fanout logic in ResolveIndex. Is it too much?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean something like originalRequest.toRemoteFanoutRequest()? Indeed could work 👍

Comment on lines 84 to 85
// TODO probably makes more sense on a service class as opposed to the request itself
default <T extends ResponseWithReplacedIndexExpressions> void remoteFanoutErrorHandling(Map<String, T> remoteResults) {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this method is needed only on CrossProjectSearchCapable and not here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True

Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the new changes. As discussed previously, I generally prefer a more service oriented approach. I had some mostly minor comments. Great work!

import java.util.List;
import java.util.Map;

public class CrossProjectSearchErrorHandler {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this class can also have different implementations and the one with actual error handling is provided by serverless.

Comment on lines +89 to 95
// Complete marker interface, can be folded into Replaceable if necessary
interface CrossProjectSearchCapable extends Replaceable {
@Override
default boolean allowsRemoteIndices() {
return true;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah unless we want to move getProjectRouting back here, the interface does not seem useful as is.

Comment on lines 73 to 76
// Could this be useful?
default IndexResolutionMode indexResolutionMode() {
return IndexResolutionMode.CANONICAL;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like another variant of the previous boolean crossProjectMode() method. I'd prefer we rely on the AuthorizedProjectSupplier if possible. I understand the intention is to explicitly tell no flat expansion is required. But I feel exploring other alternatives is more preferrable since this flag can in theory be configured contradictory to the setup and/or we can sometimes have 2 sources of truth.


public ReplacedIndexExpression(
String original,
List<String> replacedBy,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking the usage of this field, it does seem make sense to split it into remote and local.

  • On the origin cluster, it has both remote and local. Remote ones are sent to linked projects. Local ones are locally resolved with IndexAbstractionResolver. So the usages are separate.
  • On the linked cluster, only local ones make sense since there is no chaining.

If we represent remote and local with different objects, we can also fold ResolutionResult and exception into the object for local ones. The type for remote ones may simply be a List<String>.

Or maybe the split can be even higher up ReplacedIndexExpression which has a base version just for local and a subclass with the remote ones. The base one should be what we use mostly because they should be the ones used by the linked projects.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. I've pushed #134783 as a first pass on the data model. Fits quite cleanly IMO

final Runnable terminalHandler = () -> {
if (completionCounter.countDown()) {
try {
crossProjectSearchService.errorHandling(request, remoteResponses);
Copy link
Contributor

@quux00 quux00 Sep 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proposal: This method should only be called if this transport action is the primary one. Checking ThreadContext for its status in the TransportAction chain?

That applies to two scenarios:

  1. I'm the "remote" secondary where you don't have enough information to call this method
  2. I'm the "local" secondary and some other transport action will deal with checking the status of qualified resources

Otherwise, we have an opaque system of double checks and unintended "delegation" that is going to get confusing. ES|QL intentionally delegates some of that work to field-caps, but I don't think _search wants that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably still want to call the method but exactly as you say, we'll check ThreadContext for a marker telling us whether we are on a remote or not -- along with other conditions such as whether we're even in a CPS environment -- under the hood and make it a noop accordingly.

As @ywangd suggests here we might inject the error handling implementation via SPI in which case the call will be a noop outside of CPS, and for CPS will involve additional checks as outlined above.

Ideally, transport actions should have as little responsibility as possible w.r.t. to flat world error handling and instead make use of tooling. Otherwise, we end up with a brittle implementation where each action rolls its own error handling and runs the risk of not actually failing flows post fan-out that should have failed. How generic we can make this depends on the various transport actions we work on (ResolveIndex, Search, FieldCaps, ES|QL) but I think the following pattern holds:

  1. A CPS compatible action tracks the original expression, its rewrite, and local index resolution results
  2. It calls a fanout action for all remotes to perform remote index resolution (and potentially obtain additional metadata, in the case of FieldCaps)
  3. It uses the info from step 1 and 2 to decide whether to throw an error or not, then proceeds.

One thing we've touched on during our sync yesterday is that the CPS capable action -- in the case of e.g. Search -- needs to know the final list of remotes that actually have data on them, or more generally needs the output of the fanout actions that it called to make further decisions about its subsequent execution. Definitely something that the tooling around flat-world error should not get in the way of.

Lets iterate on this during the actual implementation. I think the way we structure this will evolve as we port more actions to CPS and get a clearer picture of the usage patterns.

cc @piergm

@n1v0lg
Copy link
Contributor Author

n1v0lg commented Sep 12, 2025

Hm well this was fun while it lasted, but I guess "1,8k files changed" probably makes this PR of limited value, at this point...

@n1v0lg n1v0lg closed this Sep 12, 2025
@n1v0lg n1v0lg deleted the poc-cps-e2e branch September 15, 2025 08:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>non-issue serverless-linked Added by automation, don't add manually v9.2.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants