Skip to content

Conversation

joshua-adams-1
Copy link
Contributor

@joshua-adams-1 joshua-adams-1 commented Oct 2, 2025

This is the fourth part of a series of PRs fixing how the FailedToCommitClusterStateException is used in ElasticSearch. As per #135017, FailedToCommitClusterStateException is defined as:

Thrown when a cluster state publication fails to commit the new cluster state. If publication fails then a new master is elected but the
update might or might not take effect, depending on whether the newly-elected master accepted the published state that failed to
be committed. This exception should only be used when there is <i>ambiguity</i> whether a state update took effect or not.

Currently, FailedToCommitClusterStateException is used as a 'catch-all' exception thrown at multiple places throughout the Coordinator and MasterService during the publication process. Semantically however, it doesn't make sense to throw this exception before the cluster state update is actually sent over the wire, since at this point, we know for certain that the cluster state update failed. FailedToCommitClusterStateException is intended to display ambiguity.

This work is a pre-requisite to #134213.


Changes

Removes a check on FailedToCommitClusterStateException with a message "node closed" inside QueryableBuiltInRolesSynchronizer.isExpectedFailure since FailedToCommitClusterStateExceptions with this error message were replaced with NotMasterExceptions inside #135008, and NotMasterExceptions is already included inside this OR expression

As a note, this is not blocking and is not blocked by, any other change


Next Steps

The goal of this work is to fix up all erroneously used FailedToCommitClusterStateException.

Done:

Todo:

  • Replace the FailedToCommitClusterStateException exception inside MasterService.BatchingTaskQueue.submitTask, (here) with a NotMasterException.
  • Any final uses of FailedToCommitClusterStateException that need to be updated

Relates to: ES-13061

Removes a check on `FailedToCommitClusterStateException`
with a message "node closed" inside QueryableBuiltInRolesSynchronizer
.isExpectedFailure since this exception was changed to a
NotMasterException inside elastic#135008

Relates to: ES-13061
@joshua-adams-1 joshua-adams-1 self-assigned this Oct 2, 2025
@joshua-adams-1 joshua-adams-1 added >refactoring :Distributed Coordination/Distributed A catch all label for anything in the Distributed Coordination area. Please avoid if you can. labels Oct 2, 2025
Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once the other changes have landed.

@joshua-adams-1
Copy link
Contributor Author

@DaveCTurner The pre-requisite PR for this was #135008 which is merged. It's not blocked by any other changes

@DaveCTurner
Copy link
Contributor

Ah right I thought this was blocked on #135706 too but I guess not.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@joshua-adams-1 joshua-adams-1 marked this pull request as ready for review October 6, 2025 10:21
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Oct 6, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Distributed A catch all label for anything in the Distributed Coordination area. Please avoid if you can. >refactoring Team:Distributed Coordination Meta label for Distributed Coordination team v9.3.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants