Change FailedToCommitClusterStateException to NotMasterException #135008

joshua-adams-1 · 2025-09-18T14:39:44Z

This is the first of a series of PRs fixing how the FailedToCommitClusterStateException is used in ElasticSearch. As per #135017, FailedToCommitClusterStateException is defined as:

Thrown when a cluster state publication fails to commit the new cluster state. If publication fails then a new master is elected but the
update might or might not take effect, depending on whether the newly-elected master accepted the published state that failed to
be committed. This exception should only be used when there is <i>ambiguity</i> whether a state update took effect or not.

Currently, FailedToCommitClusterStateException is used as a 'catch-all' exception thrown at multiple places throughout the Coordinator and MasterService during the publication process. Semantically however, it doesn't make sense to throw this exception before the cluster state update is actually sent over the wire, since at this point, we know for certain that the cluster state update failed. FailedToCommitClusterStateException is intended to display ambiguity.

This work is a pre-requisite to #134213.

Changes

This PR modifies the Batch.onRejection method to accept a NotMasterException rather than a FailedToCommitClusterStateException. We should throw a NotMasterException here because:

This code is actually called when draining the queue because the threadpool shut down, not during a cluster state update! It was used because FailedToCommitClusterStateException is retryable within TransportMasterNodeAction, and it was a quick way to guarantee retries
The master node has closed, and therefore a NotMasterException is semantically clearer, while also being retryable to preserve functionality

Testing

Unit and integration tests pass
I have run MasterServiceTests 100 times

Next Steps

The goal of this work is to fix up all erroneously used FailedToCommitClusterStateException.

Done:

Update exception messages: Update exception messages #135017

Todo:

Change a FailedToCommitClusterStateException to NotMasterException during the pre-publication process: Changes FailedToCommitClusterStateException to NotMasterException #135548
Replace all subsequent pre-publication FailedToCommitClusterStateExceptions with FailedToPublishClusterStateException, a new exception designed to capture any failures before the cluster state update is sent over the wire. This exception will capture guaranteed cluster state update failure, with FailedToCommitClusterStateException used once the update has been sent over the wire, representing ambiguity
Any final uses of FailedToCommitClusterStateException that need to be updated

Relates to ES-13061

Modifies the `Batch.onRejection` method to accept a `NotMasterException` rather than a `FailedToCommitClusterStateException`. The `NotMasterException` is thrown because the master node has closed. However, at this point we haven't even tried to commit the cluster state so we know for a fact that it's failed, so a `FailedToCommitClusterStateException`, which implies ambiguity, is wrong here.

joshua-adams-1 · 2025-09-18T14:44:06Z

I have assumed that these exceptions are eventually rolled into a 4XX error and therefore not exposed to the user, so it shouldn't matter that I'm changing them and no changelog update is required. Are there any disagreements to this?

elasticsearchmachine · 2025-09-19T13:19:29Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

JeremyDahlgren

These changes look like they are in line with the discussions you and David have had in slack. The exact exception used here seems like an internal implementation detail, so >non-issue seems ok, or even >refactoring. It might be good to have @DiannaHohensee or @henningandersen weigh in, or wait for David to comment when he is back if you are not blocked by this.

DiannaHohensee

Left a couple questions

DiannaHohensee · 2025-09-29T15:06:14Z

server/src/main/java/org/elasticsearch/cluster/service/MasterService.java

        // single-threaded: started when totalQueueSize transitions from 0 to 1 and keeps calling itself until the queue is drained.
        if (lifecycle.started() == false) {
-            drainQueueOnRejection(new FailedToCommitClusterStateException("node closed", getRejectionException()));
+            drainQueueOnRejection(new NotMasterException("node closed", getRejectionException()));


forkQueueProcessor can be called on success or failure of a Runnable. Is that a problem in terms of knowing whether or not the cluster update was successful, or is it not a problem for some reason? I haven't worked with the MasterService before, so I might be missing something.

AFAIU, your question is: "During a cluster state update, where is this code running, since if this exception is thrown after the cluster state update is pushed to the wire, then according to the FailedToCommitClusterStateException exception definition here, this is the correct exception to be throwing".

I'm still trying to understand the code myself so excuse me if I make a mistake, but as I understand it, this code isn't even running during a cluster state update. forkQueueProcessor is called twice in the code, here in the Runnable and here in a PerPriorityQueue.

PerPriorityQueue

In this case we're draining the queue because the threadpool shut down. The workflow is:

We batch tasks on the master node to be executed together. This is handled by a BatchingTaskQueue, here

When submitting a task to this queue, we invoke a PerPriorityQueue, here

The execute() function called invokes forkQueueProcessor here

A FailedToCommitClusterStateException is passed into drainQueueOnRejection here

drainQueueOnRejection here uses the exception to reject a batch of tasks.

In this case, we are not attempting to publish a cluster state update, and so a FailedToCommitClusterStateException is wrong. The cluster state update publication workflow is here inside the Coordinator. The reason a FailedToCommitClusterStateException is thrown is explained by a comment inside the Batch interface, here, but I shall copy below:

/** * Called when the batch is rejected due to the master service shutting down. * * @param e is a {@link FailedToCommitClusterStateException} to cause things like {@link TransportMasterNodeAction} to retry after * submitting a task to a master which shut down. {@code e.getCause()} is the rejection exception, which should be a * {@link EsRejectedExecutionException} with {@link EsRejectedExecutionException#isExecutorShutdown()} true. */ // Should really be a NodeClosedException instead, but this exception type doesn't trigger retries today. **/

^^ As shown, the FailedToCommitClusterStateException was always acknowledged to be wrong from the
beginning.

Runnable

The runnable is invoked in only one place, inside forkQueueProcessor, here. The success or failure you mention above is not referring to cluster state update success or failure, but rather success or failure executing a batch of tasks from the queue.

Therefore, in this case, I believe the exception to be incorrect, and used as a quick way to guarantee retries. As explained in another comment, both FailedToCommitClusterStateException and NotMasterException retry inside TransportMasterNodeAction, so replacing one with the other should have no adverse effect, while semantically improving our exception handling

DiannaHohensee · 2025-09-29T15:09:40Z

server/src/main/java/org/elasticsearch/cluster/service/MasterService.java

         *          submitting a task to a master which shut down. {@code e.getCause()} is the rejection exception, which should be a
         *          {@link EsRejectedExecutionException} with {@link EsRejectedExecutionException#isExecutorShutdown()} true.
         */
-        // Should really be a NodeClosedException instead, but this exception type doesn't trigger retries today.


Did you look into this, whether NotMasterException will trigger some sort of retry?

A NotMasterException is thrown when a non-master node is attempting to perform an action reserved for a master node. In this instance, it could occur when a master node loses an election midway through a cluster state update and is then not the master anymore. The action is therefore not to be retried on the same node (since it would hit the same exception), but is expected to be retried on the next master.

In the code, this can be seen here inside TransportMasterNodeAction:

// TransportMasterNodeAction ActionListener<Response> delegate = listener.delegateResponse((delegatedListener, t) -> { if (MasterService.isPublishFailureException(t)) { logger.debug( () -> format( "master could not publish cluster state or " + "stepped down before publishing action [%s], scheduling a retry", actionName ), t ); retryOnNextState(currentStateVersion, t); } else { logger.debug("unexpected exception during publication", t); delegatedListener.onFailure(t); } }); // MasterService public static boolean isPublishFailureException(Exception e) { return e instanceof NotMasterException || e instanceof FailedToCommitClusterStateException; }

where if an NotMasterException is thrown, we retry on the next cluster state update, which should be sent from the new master. Since the exception currently thrown is a FailedToCommitClusterStateException which is retried in the same way, this change should preserve functionality.

... and reformat Javadoc

DaveCTurner

LGTM2 apart from one nit

DaveCTurner · 2025-10-03T11:11:46Z

server/src/main/java/org/elasticsearch/cluster/service/MasterService.java

         *          submitting a task to a master which shut down. {@code e.getCause()} is the rejection exception, which should be a
         *          {@link EsRejectedExecutionException} with {@link EsRejectedExecutionException#isExecutorShutdown()} true.
         */
-        // Should really be a NodeClosedException instead, but this exception type doesn't trigger retries today.


I think we should have kept this comment - I opened #135902 to reinstate it.

... and reformat Javadoc

joshua-adams-1 self-assigned this Sep 18, 2025

joshua-adams-1 added the :Distributed Coordination/Distributed A catch all label for anything in the Distributed Coordination area. Please avoid if you can. label Sep 18, 2025

elasticsearchmachine added the v9.2.0 label Sep 18, 2025

joshua-adams-1 requested a review from DaveCTurner September 18, 2025 15:44

joshua-adams-1 marked this pull request as ready for review September 19, 2025 13:19

elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Sep 19, 2025

joshua-adams-1 added the >non-issue label Sep 22, 2025

JeremyDahlgren approved these changes Sep 25, 2025

View reviewed changes

joshua-adams-1 requested a review from DiannaHohensee September 26, 2025 09:59

DiannaHohensee reviewed Sep 29, 2025

View reviewed changes

This was referenced Sep 30, 2025

Changes FailedToCommitClusterStateException to NotMasterException #135548

Merged

Replace pre publication failed to commit cluster state exceptions #135706

Open

elasticsearchmachine added v9.3.0 and removed v9.2.0 labels Oct 2, 2025

elasticsearchmachine and others added 2 commits October 2, 2025 07:34

[CI] Update transport version definitions

b93d07d

Merge branch 'main' into fork-queue-processor-not-master-exception

cb59283

joshua-adams-1 merged commit bed02e2 into elastic:main Oct 2, 2025
34 checks passed

joshua-adams-1 deleted the fork-queue-processor-not-master-exception branch October 2, 2025 11:02

joshua-adams-1 mentioned this pull request Oct 2, 2025

Remove FailedToCommitClusterStateException Check #135846

Open

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Oct 3, 2025

Reinstate comment lost in elastic#135008

f8547ba

... and reformat Javadoc

DaveCTurner reviewed Oct 3, 2025

View reviewed changes

DaveCTurner added a commit that referenced this pull request Oct 6, 2025

Reinstate comment lost in #135008 (#135902)

17c9e76

... and reformat Javadoc

joshua-adams-1 mentioned this pull request Oct 7, 2025

Replace FailedToCommitClusterStateException with NotMasterException #136083

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Change FailedToCommitClusterStateException to NotMasterException #135008

Change FailedToCommitClusterStateException to NotMasterException #135008

joshua-adams-1 commented Sep 18, 2025 •

edited

Loading

Uh oh!

joshua-adams-1 commented Sep 18, 2025 •

edited

Loading

Uh oh!

elasticsearchmachine commented Sep 19, 2025

Uh oh!

JeremyDahlgren left a comment

Uh oh!

DiannaHohensee left a comment

Uh oh!

DiannaHohensee Sep 29, 2025

Uh oh!

joshua-adams-1 Sep 30, 2025

Uh oh!

DiannaHohensee Sep 29, 2025

Uh oh!

joshua-adams-1 Sep 30, 2025 •

edited

Loading

Uh oh!

Uh oh!

DaveCTurner left a comment

Uh oh!

DaveCTurner Oct 3, 2025

Uh oh!

Uh oh!

Change FailedToCommitClusterStateException to NotMasterException #135008

Change FailedToCommitClusterStateException to NotMasterException #135008

Conversation

joshua-adams-1 commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Testing

Next Steps

Done:

Todo:

Uh oh!

joshua-adams-1 commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Sep 19, 2025

Uh oh!

JeremyDahlgren left a comment

Choose a reason for hiding this comment

Uh oh!

DiannaHohensee left a comment

Choose a reason for hiding this comment

Uh oh!

DiannaHohensee Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

joshua-adams-1 Sep 30, 2025

Choose a reason for hiding this comment

PerPriorityQueue

Runnable

Uh oh!

DiannaHohensee Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

joshua-adams-1 Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

DaveCTurner Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

joshua-adams-1 commented Sep 18, 2025 •

edited

Loading

joshua-adams-1 commented Sep 18, 2025 •

edited

Loading

joshua-adams-1 Sep 30, 2025 •

edited

Loading