-
Notifications
You must be signed in to change notification settings - Fork 25.5k
Change FailedToCommitClusterStateException to NotMasterException #135008
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1283,7 +1283,7 @@ public String toString() { | |
if (lifecycle.started()) { | ||
nextBatch.run(batchCompletionListener); | ||
} else { | ||
nextBatch.onRejection(new FailedToCommitClusterStateException("node closed", getRejectionException())); | ||
nextBatch.onRejection(new NotMasterException("node closed", getRejectionException())); | ||
batchCompletionListener.onResponse(null); | ||
} | ||
}); | ||
|
@@ -1309,7 +1309,7 @@ private void onCompletion() { | |
@Override | ||
public void onRejection(Exception e) { | ||
assert e instanceof EsRejectedExecutionException esre && esre.isExecutorShutdown() : e; | ||
drainQueueOnRejection(new FailedToCommitClusterStateException("node closed", e)); | ||
drainQueueOnRejection(new NotMasterException("node closed", e)); | ||
} | ||
|
||
@Override | ||
|
@@ -1336,7 +1336,7 @@ private Batch takeNextBatch() { | |
private void forkQueueProcessor() { | ||
// single-threaded: started when totalQueueSize transitions from 0 to 1 and keeps calling itself until the queue is drained. | ||
if (lifecycle.started() == false) { | ||
drainQueueOnRejection(new FailedToCommitClusterStateException("node closed", getRejectionException())); | ||
drainQueueOnRejection(new NotMasterException("node closed", getRejectionException())); | ||
return; | ||
} | ||
|
||
|
@@ -1353,7 +1353,7 @@ private EsRejectedExecutionException getRejectionException() { | |
return new EsRejectedExecutionException("master service is in state [" + lifecycleState() + "]", true); | ||
} | ||
|
||
private void drainQueueOnRejection(FailedToCommitClusterStateException e) { | ||
private void drainQueueOnRejection(NotMasterException e) { | ||
assert totalQueueSize.get() > 0; | ||
do { | ||
assert currentlyExecutingBatch == null; | ||
|
@@ -1407,12 +1407,11 @@ private interface Batch { | |
/** | ||
* Called when the batch is rejected due to the master service shutting down. | ||
* | ||
* @param e is a {@link FailedToCommitClusterStateException} to cause things like {@link TransportMasterNodeAction} to retry after | ||
* @param e is a {@link NotMasterException} to cause things like {@link TransportMasterNodeAction} to retry after | ||
* submitting a task to a master which shut down. {@code e.getCause()} is the rejection exception, which should be a | ||
* {@link EsRejectedExecutionException} with {@link EsRejectedExecutionException#isExecutorShutdown()} true. | ||
*/ | ||
// Should really be a NodeClosedException instead, but this exception type doesn't trigger retries today. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Did you look into this, whether NotMasterException will trigger some sort of retry? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A In the code, this can be seen here inside
where if an There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we should have kept this comment - I opened #135902 to reinstate it. |
||
void onRejection(FailedToCommitClusterStateException e); | ||
void onRejection(NotMasterException e); | ||
|
||
/** | ||
* @return number of tasks in this batch if the batch is pending, or {@code 0} if the batch is not pending. | ||
|
@@ -1634,7 +1633,7 @@ T acquireForExecution() { | |
return task; | ||
} | ||
|
||
void onRejection(FailedToCommitClusterStateException e) { | ||
void onRejection(NotMasterException e) { | ||
final var task = acquireForExecution(); | ||
if (task != null) { | ||
try (var ignored = storedContextSupplier.get()) { | ||
|
@@ -1654,7 +1653,7 @@ boolean isPending() { | |
|
||
private class Processor implements Batch { | ||
@Override | ||
public void onRejection(FailedToCommitClusterStateException e) { | ||
public void onRejection(NotMasterException e) { | ||
final var items = queueSize.getAndSet(0); | ||
for (int i = 0; i < items; i++) { | ||
final var entry = queue.poll(); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
forkQueueProcessor
can be called on success or failure of a Runnable. Is that a problem in terms of knowing whether or not the cluster update was successful, or is it not a problem for some reason? I haven't worked with the MasterService before, so I might be missing something.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIU, your question is: "During a cluster state update, where is this code running, since if this exception is thrown after the cluster state update is pushed to the wire, then according to the
FailedToCommitClusterStateException
exception definition here, this is the correct exception to be throwing".I'm still trying to understand the code myself so excuse me if I make a mistake, but as I understand it, this code isn't even running during a cluster state update.
forkQueueProcessor
is called twice in the code, here in the Runnable and here in aPerPriorityQueue
.PerPriorityQueue
In this case we're draining the queue because the threadpool shut down. The workflow is:
BatchingTaskQueue
, herePerPriorityQueue
, hereexecute()
function called invokesforkQueueProcessor
hereFailedToCommitClusterStateException
is passed intodrainQueueOnRejection
heredrainQueueOnRejection
here uses the exception to reject a batch of tasks.FailedToCommitClusterStateException
is wrong. The cluster state update publication workflow is here inside theCoordinator
. The reason aFailedToCommitClusterStateException
is thrown is explained by a comment inside theBatch
interface, here, but I shall copy below:^^ As shown, the
FailedToCommitClusterStateException
was always acknowledged to be wrong from thebeginning.
Runnable
The runnable is invoked in only one place, inside
forkQueueProcessor
, here. The success or failure you mention above is not referring to cluster state update success or failure, but rather success or failure executing a batch of tasks from the queue.Therefore, in this case, I believe the exception to be incorrect, and used as a quick way to guarantee retries. As explained in another comment, both
FailedToCommitClusterStateException
andNotMasterException
retry insideTransportMasterNodeAction
, so replacing one with the other should have no adverse effect, while semantically improving our exception handling