-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Fix test fail on unavailable shards resolver #125096
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix test fail on unavailable shards resolver #125096
Conversation
| } else { | ||
| l.onResponse(result.withIndexResolution(indexResolution)); | ||
| } | ||
| }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change completes the listener early, immediately after index resolution with a failure if there are failures
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The approach looks right to me. I think we are missing tracking and returning this failure to users when allow_partial_results=true, but we can address it in a follow-up. Can you minimize the format changes, please? Thanks @idegtiarenko.
| ); | ||
|
|
||
| FieldCapabilitiesFailure localResolutionFailure = null; | ||
| for (FieldCapabilitiesFailure failure : fieldCapsResponse.getFailures()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can mistakenly use failures from the remote clusters. Can we extend determineUnavailableRemoteClusters to include local failures and extract from there instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are right. I have replaced this with filtering for NoShardAvailableActionException.
I would like to merge this fix as it fixes the handling of local unavailable shards, but likely we will need one more followup iteration with focus on CCS (does unavailable remote cluster means result is partial? can we detect remote cluster unavailable shards?)
| // all indices found by field-caps | ||
| private final Set<String> resolvedIndices; | ||
| @Nullable | ||
| private final FieldCapabilitiesFailure localResolutionFailure; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we merge this with unavailableClusters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think so. I just updated the implementation to look for unavailable shards rather than generic failures and I would like to keep unavailableClusters remote only as documented:
Line 63 in d185356
| // remote clusters included in the user's index expression that could not be connected to |
|
Pinging @elastic/es-analytical-engine (Team:Analytics) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left a comment, but LGTM as we can address it in a follow-up.
|
|
||
| Set<NoShardAvailableActionException> unavailableShards = new HashSet<>(); | ||
| for (FieldCapabilitiesFailure failure : fieldCapsResponse.getFailures()) { | ||
| if (failure.getException() instanceof NoShardAvailableActionException e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is restrictive and could be fragile. The field-caps API can fail for reasons other than a NoShardAvailableActionException. Should we handle all exceptions? I'm fine if you plan to address this in a follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This case is happening when running testFailOnUnavailableShards,
but I agree, we should expand this list as we find more cases.
For now, do you believe there are other cases I could add?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should detect the index pattern for non-remote scenarios and handle all exceptions.
💔 Backport failed
You can use sqren/backport to manually backport by running |
This enables testFailOnUnavailableShards.
This was originally pointing to the issue that is closed.
After enabling it I noticed that we were silently searching only in available shards.
This change fixes the failure by surfacing the failures during the index resolution.