-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Don't run async actions when ILM is stopped #133683
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Today ILM will run `AsyncActionStep`s even if ILM is stopped. These actions are started either as callbacks after previous actions complete or when the move-to-step API is used. By checking the ILM operation mode before running the action in `IndexLifecycleRunner#maybeRunAsyncAction`, we prevent these actions from being executed while ILM is stopped. `AsyncActionStep`s are currently only automatically started as callbacks after previous actions complete or after a master failover. To ensure that these steps will be executed when ILM is restarted after a stop, we loop over all the managed indices and start all async action steps. Fixes elastic#81234 Fixes elastic#85097 Fixes elastic#99859
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR prevents ILM (Index Lifecycle Management) from executing AsyncActionSteps when ILM is stopped. Previously, these async actions could still run even when ILM was in a stopped state, either through callbacks or the move-to-step API. The fix checks the ILM operation mode before executing async actions and ensures proper recovery when ILM is restarted.
- Added operation mode check in
maybeRunAsyncAction
to prevent execution when ILM is not running - Refactored master node handling to kick off pending async actions when ILM transitions from stopped to running
- Added comprehensive test coverage for the async action blocking behavior when using the move-to-step API
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
File | Description |
---|---|
IndexLifecycleService.java | Refactored onMaster method to handle async action execution and detect ILM mode transitions |
IndexLifecycleRunner.java | Added operation mode check to prevent async actions when ILM is stopped |
TimeseriesMoveToStepIT.java | Added integration test to verify async actions don't execute when ILM is stopped |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
String format = format( | ||
"async action execution failed during master election trigger for index [%s] with policy [%s] in step [%s]", | ||
idxMeta.getIndex().getName(), | ||
policyName, | ||
stepKey | ||
); | ||
if (logger.isTraceEnabled()) { | ||
format += format(", lifecycle state: [%s]", lifecycleState.asMap()); | ||
} | ||
logger.warn(format, e); |
Copilot
AI
Aug 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using format
as both a variable name and function call creates ambiguous code. The variable format
is being concatenated with the result of calling format()
function, which could be confusing and error-prone.
Copilot uses AI. Check for mistakes.
// Since ILM is stopped, the async action should not execute and the index should remain in the readonly step. | ||
// This is the tricky part of the test, as we can't really verify that the async action will never happen. | ||
assertEquals(new StepKey("hot", "readonly", "readonly"), getStepKeyForIndex(client(), originalIndex)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any suggestions for improved assertions are welcome. Adding a Thread.sleep
here would increase our confidence, but still wouldn't make any guarantee.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we write a test that wraps the Client
passed to IndexLifecycleService
with a wrapper that asserts that we never perform some particular ILM action?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a test to IndexLifecycleRunnerTests
in 370d966 that verifies the action isn't executed when ILM is stopped. Let me know if that's what you had in mind. I chose for testing IndexLifecycleRunner#maybeRunAsyncAction
instead of IndexLifecycleService#maybeRunAsyncAction
, as the latter is only used by APIs and the former is also used internally by ILM.
} | ||
|
||
boolean safeToStop = true; // true until proven false by a run policy | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes in this method aren't stricly necessary, I just took this opportunity to clean this method up a bit, as it was becoming hard to read. If there are concerns with this refactor, I can revert these stylistic changes and stick to the bug fix.
Pinging @elastic/es-data-management (Team:Data Management) |
Hi @nielsbauman, I've created a changelog YAML for you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking a look at this Niels. I took a look at this and left some suggestions for testing and the refactoring, let me know what you think
// Since ILM is stopped, the async action should not execute and the index should remain in the readonly step. | ||
// This is the tricky part of the test, as we can't really verify that the async action will never happen. | ||
assertEquals(new StepKey("hot", "readonly", "readonly"), getStepKeyForIndex(client(), originalIndex)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we write a test that wraps the Client
passed to IndexLifecycleService
with a wrapper that asserts that we never perform some particular ILM action?
if (currentMode == OperationMode.RUNNING) { | ||
lifecycleRunner.maybeRunAsyncAction(state, idxMeta, policyName, stepKey); | ||
continue; | ||
} | ||
if (stepKey != null && IGNORE_STEPS_MAINTENANCE_REQUESTED.contains(stepKey.name())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this part is harder to read, because someone has to keep in mind that the second if
statement executes only if currentMode != OperationMode.RUNNING
.
What about factoring it into separate methods, so that it can look something like:
switch (currentMode) {
case OperationMode.RUNNING:
lifecycleRunner.maybeRunAsyncAction(state, idxMeta, policyName, stepKey);
break;
case OperationMode.STOPPING:
case OperationMode.STOPPED:
runOrCheckIfSafeToIgnore(…);
default:
throw new IllegalArgumentException("you need to handle the case for " + currentMode);
}
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because someone has to keep in mind that ...
I'm not sure I get what you mean. The two if
statements are only a few lines apart, right? If they would be far apart, I could get that it's harder to understand/remember.
I do agree that it's easier to miss that we already do the check
if (OperationMode.STOPPED.equals(currentMode)) {
return;
}
earlier in the method, which means that the second if
statement you were referring to only executes if currentMode == STOPPING
. Therefore, the switch
you suggested looks a bit overkill to me, as there would only be two branches. A switch
with two values is essentially just an if-else
, which is basically what I have now.
I noticed another if
statement at the start of this method that I could invert, and I added a comment between these two if
statements, in the hope that that will make it easier to read, in 26b3dc4. Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the further refactoring. The method is still longer than I'd like (not at all due to you or this PR, just in general), but that's something we can work on at a later time.
Today ILM will run
AsyncActionStep
s even if ILM is stopped. These actions are started either as callbacks after previous actions complete or when the move-to-step API is used. By checking the ILM operation mode before running the action inIndexLifecycleRunner#maybeRunAsyncAction
, we prevent these actions from being executed while ILM is stopped.AsyncActionStep
s are currently only automatically started as callbacks after previous actions complete or after a master failover. To ensure that these steps will be executed when ILM is restarted after a stop, we loop over all the managed indices and start all async action steps.Fixes #81234
Fixes #85097
Fixes #99859