Skip to content

Conversation

joshua-adams-1
Copy link
Contributor

@joshua-adams-1 joshua-adams-1 commented Sep 26, 2025

This is the second of a series of PRs fixing how the FailedToCommitClusterStateException is used in ElasticSearch. As per #135017, FailedToCommitClusterStateException is defined as:

Thrown when a cluster state publication fails to commit the new cluster state. If publication fails then a new master is elected but the
update might or might not take effect, depending on whether the newly-elected master accepted the published state that failed to
be committed. This exception should only be used when there is <i>ambiguity</i> whether a state update took effect or not.

Currently, FailedToCommitClusterStateException is used as a 'catch-all' exception thrown at multiple places throughout the Coordinator and MasterService during the publication process. Semantically however, it doesn't make sense to throw this exception before the cluster state update is actually sent over the wire, since at this point, we know for certain that the cluster state update failed. FailedToCommitClusterStateException is intended to display ambiguity.

This work is a pre-requisite to #134213.


Changes

As explained above, any exception thrown prior to the publish(...) call means that the cluster state update definitely failed. Hence, with no ambiguity, it should not be a FailedToCommitClusterStateException. In this instance, I changed the code to throw a NotMasterException since this is semantically correct.


Testing

  • Unit and integration tests succeed
  • I ran the MasterService test suite 100 times through

Next Steps

The goal of this work is to fix up all erroneously used FailedToCommitClusterStateException.

Done:

Todo:

  • Replace all subsequent pre-publication FailedToCommitClusterStateExceptions with FailedToPublishClusterStateException, a new exception designed to capture any failures before the cluster state update is sent over the wire. This exception will capture guaranteed cluster state update failure, with FailedToCommitClusterStateException used once the update has been sent over the wire, representing ambiguity
  • Replace the FailedToCommitClusterStateException exception inside MasterService.BatchingTaskQueue.submitTask, (here) with a FailedToPublishClusterStateException.

Relates to ES-13061

Changes a FailedToCommitClusterStateException incorrectly thrown prior
to cluster state update publication to a NotMasterException
@joshua-adams-1 joshua-adams-1 self-assigned this Sep 26, 2025
@joshua-adams-1 joshua-adams-1 added >non-issue :Distributed Coordination/Distributed A catch all label for anything in the Distributed Coordination area. Please avoid if you can. labels Sep 26, 2025
@joshua-adams-1 joshua-adams-1 marked this pull request as ready for review September 29, 2025 12:50
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Sep 29, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@joshua-adams-1
Copy link
Contributor Author

ISTG elasticsearchbot has a bug.

Hey @justzh , I'm not sure what you mean by this. Is there anything I can do to help?

Comment on lines 428 to 430
logger.debug(
() -> format(
"node is no longer the master prior to publication of cluster state version [%s]: [%s]",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any loss in understanding of the log message moving the summary to the end after the colon, without the 'failing' prefix, for the NotMasterException case?

Comment on lines -988 to +1005
void onPublishFailure(FailedToCommitClusterStateException e) {
void onPublishFailure(Exception e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this is only called with FailedToCommitClusterStateException or NotMasterException from above, but is it worth trying to be as specific as possible here instead of Exception? Maybe the common base class ElasticsearchException, and also adding an assertion that it is one of the two expected types?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, the parameter e is passed in from an onfailure method here, and is of type Exception, so I can't narrow this down.

But, since this method is currently called from within this IF statement:

if (exception instanceof FailedToCommitClusterStateException || exception instanceof NotMasterException) {
    ....

I know that currently the type must be either FailedToCommitClusterStateException or NotMasterException so it's a good call to add an assertion in case this method gets used anywhere in future, and that doesn't hold

Copy link
Contributor

@JeremyDahlgren JeremyDahlgren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM2

joshua-adams-1 and others added 2 commits October 6, 2025 11:23
@joshua-adams-1 joshua-adams-1 merged commit 1d88dbc into elastic:main Oct 6, 2025
34 checks passed
@joshua-adams-1 joshua-adams-1 deleted the change-failed-to-commit-exception-to-not-master-exception-before-publishing branch October 6, 2025 11:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Distributed A catch all label for anything in the Distributed Coordination area. Please avoid if you can. >non-issue Team:Distributed Coordination Meta label for Distributed Coordination team v9.3.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants