Skip to content

Conversation

@dnhatn
Copy link
Member

@dnhatn dnhatn commented Jul 14, 2025

If all target shards, excluding skipped shards, fail, we should fail the entire query regardless of the partial_results configuration or skip_unavailable setting. This behavior does not fully align with the search API, where skip_unavailable ignores all failures from remote clusters and only fails the request when all shards in the local cluster fail. However, we believe the proposed behavior is more sensible than the existing behavior in the search API.

Closes #128994

@dnhatn dnhatn force-pushed the all-shards-failed branch from 6c358a7 to 6c34c05 Compare July 15, 2025 19:31
@dnhatn dnhatn changed the title All shards failed Fail request when all target shards fail in runtime Jul 15, 2025
@dnhatn dnhatn marked this pull request as ready for review July 15, 2025 19:43
@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Jul 15, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@dnhatn dnhatn added the auto-backport Automatically create backport pull requests when merged label Jul 15, 2025
@dnhatn dnhatn requested a review from smalyshev July 15, 2025 20:39
@dnhatn
Copy link
Member Author

dnhatn commented Jul 15, 2025

@smalyshev Thanks for your quick review!

@dnhatn dnhatn merged commit 8f6f763 into elastic:main Jul 15, 2025
33 checks passed
@dnhatn dnhatn deleted the all-shards-failed branch July 15, 2025 22:31
dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Jul 15, 2025
If all target shards, excluding skipped shards, fail, we should fail the 
entire query regardless of the partial_results configuration or
skip_unavailable setting. This behavior does not fully align with the
search API, where skip_unavailable ignores all failures from remote
clusters and only fails the request when all shards in the local cluster
fail. However, we believe the proposed behavior is more sensible than
the existing behavior in the search API.

Closes elastic#128994
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
9.1
8.19 Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 131177

@dnhatn
Copy link
Member Author

dnhatn commented Jul 15, 2025

💚 All backports created successfully

Status Branch Result
8.19

Questions ?

Please refer to the Backport tool documentation

dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Jul 15, 2025
If all target shards, excluding skipped shards, fail, we should fail the
entire query regardless of the partial_results configuration or
skip_unavailable setting. This behavior does not fully align with the
search API, where skip_unavailable ignores all failures from remote
clusters and only fails the request when all shards in the local cluster
fail. However, we believe the proposed behavior is more sensible than
the existing behavior in the search API.

Closes elastic#128994

(cherry picked from commit 8f6f763)

# Conflicts:
#	x-pack/plugin/esql/src/internalClusterTest/java/org/elasticsearch/xpack/esql/action/EsqlNodeFailureIT.java
elasticsearchmachine pushed a commit that referenced this pull request Jul 15, 2025
If all target shards, excluding skipped shards, fail, we should fail the 
entire query regardless of the partial_results configuration or
skip_unavailable setting. This behavior does not fully align with the
search API, where skip_unavailable ignores all failures from remote
clusters and only fails the request when all shards in the local cluster
fail. However, we believe the proposed behavior is more sensible than
the existing behavior in the search API.

Closes #128994
elasticsearchmachine pushed a commit that referenced this pull request Jul 16, 2025
If all target shards, excluding skipped shards, fail, we should fail the
entire query regardless of the partial_results configuration or
skip_unavailable setting. This behavior does not fully align with the
search API, where skip_unavailable ignores all failures from remote
clusters and only fails the request when all shards in the local cluster
fail. However, we believe the proposed behavior is more sensible than
the existing behavior in the search API.

Closes #128994

(cherry picked from commit 8f6f763)

# Conflicts:
#	x-pack/plugin/esql/src/internalClusterTest/java/org/elasticsearch/xpack/esql/action/EsqlNodeFailureIT.java
// do not fail if any final result has results
if (finalResults.stream().anyMatch(p -> p.getPositionCount() > 0)) {
return;
}
Copy link
Contributor

@idegtiarenko idegtiarenko Jul 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if this will behave correctly.
I am imagining a case with a single index and no rows in it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The request will not fail as long as the single target shard does not fail.

* regardless of the partial_results configuration or skip_unavailable setting. This behavior doesn't fully align with the search API,
* which doesn't consider the failures from the remote clusters when skip_unavailable is true.
*/
static void failIfAllShardsFailed(EsqlExecutionInfo execInfo, List<Page> finalResults) {
Copy link
Contributor

@idegtiarenko idegtiarenko Jul 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this method concerning only remote shards/failures or local too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method treats both local and remote failures uniformly. This is where the behavior in ES|QL differs slightly from the search API. Here, we iterate over all execution info in each cluster (both local and remotes) to accumulate successful shards (excluding skipped shards) and failed shards. The request only fails if there are no successful shards, some failed shards, and no rows produced in the final results.

@idegtiarenko
Copy link
Contributor

If all target shards, excluding skipped shards, fail, we should fail the entire query regardless of the partial_results configuration or skip_unavailable setting.

I do not think I follow this. Does this mean that the entire query will fail when we query a single index with a single shard using allow_partial_results=true and that shard fails?

@dnhatn
Copy link
Member Author

dnhatn commented Jul 18, 2025

I do not think I follow this. Does this mean that the entire query will fail when we query a single index with a single shard using allow_partial_results=true and that shard fails?

Yes, your understanding is correct. That is the proposal, and it matches the existing behavior of the search API when querying only the local cluster. I believe we should fail the request, rather than return partial results, when all target shards have failed and no results (rows) are produced.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/ES|QL AKA ESQL auto-backport Automatically create backport pull requests when merged >non-issue Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.19.1 v9.1.1 v9.2.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Examine the behavior of allow_partial_results with ES|QL for 500 status codes

4 participants