Skip to content

Conversation

joshua-adams-1
Copy link
Contributor

@joshua-adams-1 joshua-adams-1 commented Sep 18, 2025

This is the first of a series of PRs fixing how the FailedToCommitClusterStateException is used in ElasticSearch. As per #135017, FailedToCommitClusterStateException is defined as:

Thrown when a cluster state publication fails to commit the new cluster state. If publication fails then a new master is elected but the
update might or might not take effect, depending on whether the newly-elected master accepted the published state that failed to
be committed. This exception should only be used when there is <i>ambiguity</i> whether a state update took effect or not.

Currently, FailedToCommitClusterStateException is used as a 'catch-all' exception thrown at multiple places throughout the Coordinator and MasterService during the publication process. Semantically however, it doesn't make sense to throw this exception before the cluster state update is actually sent over the wire, since at this point, we know for certain that the cluster state update failed. FailedToCommitClusterStateException is intended to display ambiguity.

This work is a pre-requisite to #134213.


Changes

This PR modifies the Batch.onRejection method to accept a NotMasterException rather than a FailedToCommitClusterStateException. We should throw a NotMasterException here because:

  1. This code is actually called when draining the queue because the threadpool shut down, not during a cluster state update! It was used because FailedToCommitClusterStateException is retryable within TransportMasterNodeAction, and it was a quick way to guarantee retries
  2. The master node has closed, and therefore a NotMasterException is semantically clearer, while also being retryable to preserve functionality

Testing

  • Unit and integration tests pass
  • I have run MasterServiceTests 100 times

Next Steps

The goal of this work is to fix up all erroneously used FailedToCommitClusterStateException.

Done:

Todo:

  • Change a FailedToCommitClusterStateException to NotMasterException during the pre-publication process: Changes FailedToCommitClusterStateException to NotMasterException #135548
  • Replace all subsequent pre-publication FailedToCommitClusterStateExceptions with FailedToPublishClusterStateException, a new exception designed to capture any failures before the cluster state update is sent over the wire. This exception will capture guaranteed cluster state update failure, with FailedToCommitClusterStateException used once the update has been sent over the wire, representing ambiguity
  • Any final uses of FailedToCommitClusterStateException that need to be updated

Relates to ES-13061

Modifies the `Batch.onRejection` method to accept a `NotMasterException`
 rather than a `FailedToCommitClusterStateException`. The
 `NotMasterException` is thrown because the master node has closed.
 However, at this point we haven't even tried to commit the cluster
 state so we know for a fact that it's failed, so a
 `FailedToCommitClusterStateException`, which implies ambiguity, is
 wrong here.
@joshua-adams-1 joshua-adams-1 self-assigned this Sep 18, 2025
@joshua-adams-1 joshua-adams-1 added the :Distributed Coordination/Distributed A catch all label for anything in the Distributed Coordination area. Please avoid if you can. label Sep 18, 2025
@joshua-adams-1
Copy link
Contributor Author

joshua-adams-1 commented Sep 18, 2025

I have assumed that these exceptions are eventually rolled into a 4XX error and therefore not exposed to the user, so it shouldn't matter that I'm changing them and no changelog update is required. Are there any disagreements to this?

@joshua-adams-1 joshua-adams-1 marked this pull request as ready for review September 19, 2025 13:19
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Sep 19, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

Copy link
Contributor

@JeremyDahlgren JeremyDahlgren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes look like they are in line with the discussions you and David have had in slack. The exact exception used here seems like an internal implementation detail, so >non-issue seems ok, or even >refactoring. It might be good to have @DiannaHohensee or @henningandersen weigh in, or wait for David to comment when he is back if you are not blocked by this.

Copy link
Contributor

@DiannaHohensee DiannaHohensee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple questions

// single-threaded: started when totalQueueSize transitions from 0 to 1 and keeps calling itself until the queue is drained.
if (lifecycle.started() == false) {
drainQueueOnRejection(new FailedToCommitClusterStateException("node closed", getRejectionException()));
drainQueueOnRejection(new NotMasterException("node closed", getRejectionException()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

forkQueueProcessor can be called on success or failure of a Runnable. Is that a problem in terms of knowing whether or not the cluster update was successful, or is it not a problem for some reason? I haven't worked with the MasterService before, so I might be missing something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIU, your question is: "During a cluster state update, where is this code running, since if this exception is thrown after the cluster state update is pushed to the wire, then according to the FailedToCommitClusterStateException exception definition here, this is the correct exception to be throwing".

I'm still trying to understand the code myself so excuse me if I make a mistake, but as I understand it, this code isn't even running during a cluster state update. forkQueueProcessor is called twice in the code, here in the Runnable and here in a PerPriorityQueue.

PerPriorityQueue

In this case we're draining the queue because the threadpool shut down. The workflow is:

  1. We batch tasks on the master node to be executed together. This is handled by a BatchingTaskQueue, here
  2. When submitting a task to this queue, we invoke a PerPriorityQueue, here
  3. The execute() function called invokes forkQueueProcessor here
  4. A FailedToCommitClusterStateException is passed into drainQueueOnRejection here
  5. drainQueueOnRejection here uses the exception to reject a batch of tasks.
  6. In this case, we are not attempting to publish a cluster state update, and so a FailedToCommitClusterStateException is wrong. The cluster state update publication workflow is here inside the Coordinator. The reason a FailedToCommitClusterStateException is thrown is explained by a comment inside the Batch interface, here, but I shall copy below:
/**
   * Called when the batch is rejected due to the master service shutting down.
   *
   * @param e is a {@link FailedToCommitClusterStateException} to cause things like {@link TransportMasterNodeAction} to retry after
   *          submitting a task to a master which shut down. {@code e.getCause()} is the rejection exception, which should be a
   *          {@link EsRejectedExecutionException} with {@link EsRejectedExecutionException#isExecutorShutdown()} true.
   */
  // Should really be a NodeClosedException instead, but this exception type doesn't trigger retries today.
**/

^^ As shown, the FailedToCommitClusterStateException was always acknowledged to be wrong from the
beginning.

Runnable

The runnable is invoked in only one place, inside forkQueueProcessor, here. The success or failure you mention above is not referring to cluster state update success or failure, but rather success or failure executing a batch of tasks from the queue.

Therefore, in this case, I believe the exception to be incorrect, and used as a quick way to guarantee retries. As explained in another comment, both FailedToCommitClusterStateException and NotMasterException retry inside TransportMasterNodeAction, so replacing one with the other should have no adverse effect, while semantically improving our exception handling

* submitting a task to a master which shut down. {@code e.getCause()} is the rejection exception, which should be a
* {@link EsRejectedExecutionException} with {@link EsRejectedExecutionException#isExecutorShutdown()} true.
*/
// Should really be a NodeClosedException instead, but this exception type doesn't trigger retries today.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you look into this, whether NotMasterException will trigger some sort of retry?

Copy link
Contributor Author

@joshua-adams-1 joshua-adams-1 Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A NotMasterException is thrown when a non-master node is attempting to perform an action reserved for a master node. In this instance, it could occur when a master node loses an election midway through a cluster state update and is then not the master anymore. The action is therefore not to be retried on the same node (since it would hit the same exception), but is expected to be retried on the next master.

In the code, this can be seen here inside TransportMasterNodeAction:

// TransportMasterNodeAction
ActionListener<Response> delegate = listener.delegateResponse((delegatedListener, t) -> {
    if (MasterService.isPublishFailureException(t)) {
        logger.debug(
            () -> format(
                "master could not publish cluster state or "
                    + "stepped down before publishing action [%s], scheduling a retry",
                actionName
            ),
            t
        );
        retryOnNextState(currentStateVersion, t);
    } else {
        logger.debug("unexpected exception during publication", t);
        delegatedListener.onFailure(t);
    }
});

// MasterService
public static boolean isPublishFailureException(Exception e) {
    return e instanceof NotMasterException || e instanceof FailedToCommitClusterStateException;
}

where if an NotMasterException is thrown, we retry on the next cluster state update, which should be sent from the new master. Since the exception currently thrown is a FailedToCommitClusterStateException which is retried in the same way, this change should preserve functionality.

@joshua-adams-1 joshua-adams-1 merged commit bed02e2 into elastic:main Oct 2, 2025
34 checks passed
@joshua-adams-1 joshua-adams-1 deleted the fork-queue-processor-not-master-exception branch October 2, 2025 11:02
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Oct 3, 2025
... and reformat Javadoc
Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM2 apart from one nit

* submitting a task to a master which shut down. {@code e.getCause()} is the rejection exception, which should be a
* {@link EsRejectedExecutionException} with {@link EsRejectedExecutionException#isExecutorShutdown()} true.
*/
// Should really be a NodeClosedException instead, but this exception type doesn't trigger retries today.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have kept this comment - I opened #135902 to reinstate it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Distributed A catch all label for anything in the Distributed Coordination area. Please avoid if you can. >non-issue Team:Distributed Coordination Meta label for Distributed Coordination team v9.3.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants