Skip to content

Conversation

nielsbauman
Copy link
Contributor

Today ILM will run AsyncActionSteps even if ILM is stopped. These actions are started either as callbacks after previous actions complete or when the move-to-step API is used. By checking the ILM operation mode before running the action in IndexLifecycleRunner#maybeRunAsyncAction, we prevent these actions from being executed while ILM is stopped.

AsyncActionSteps are currently only automatically started as callbacks after previous actions complete or after a master failover. To ensure that these steps will be executed when ILM is restarted after a stop, we loop over all the managed indices and start all async action steps.

Fixes #81234
Fixes #85097
Fixes #99859

Today ILM will run `AsyncActionStep`s even if ILM is stopped. These
actions are started either as callbacks after previous actions complete
or when the move-to-step API is used. By checking the ILM operation mode
before running the action in `IndexLifecycleRunner#maybeRunAsyncAction`,
we prevent these actions from being executed while ILM is stopped.

`AsyncActionStep`s are currently only automatically started as callbacks
after previous actions complete or after a master failover. To ensure
that these steps will be executed when ILM is restarted after a stop, we
loop over all the managed indices and start all async action steps.

Fixes elastic#81234
Fixes elastic#85097
Fixes elastic#99859
@nielsbauman nielsbauman requested a review from Copilot August 27, 2025 19:48
@elasticsearchmachine elasticsearchmachine added v9.2.0 needs:triage Requires assignment of a team area label labels Aug 27, 2025
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR prevents ILM (Index Lifecycle Management) from executing AsyncActionSteps when ILM is stopped. Previously, these async actions could still run even when ILM was in a stopped state, either through callbacks or the move-to-step API. The fix checks the ILM operation mode before executing async actions and ensures proper recovery when ILM is restarted.

  • Added operation mode check in maybeRunAsyncAction to prevent execution when ILM is not running
  • Refactored master node handling to kick off pending async actions when ILM transitions from stopped to running
  • Added comprehensive test coverage for the async action blocking behavior when using the move-to-step API

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
IndexLifecycleService.java Refactored onMaster method to handle async action execution and detect ILM mode transitions
IndexLifecycleRunner.java Added operation mode check to prevent async actions when ILM is stopped
TimeseriesMoveToStepIT.java Added integration test to verify async actions don't execute when ILM is stopped

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines 236 to 245
String format = format(
"async action execution failed during master election trigger for index [%s] with policy [%s] in step [%s]",
idxMeta.getIndex().getName(),
policyName,
stepKey
);
if (logger.isTraceEnabled()) {
format += format(", lifecycle state: [%s]", lifecycleState.asMap());
}
logger.warn(format, e);
Copy link

Copilot AI Aug 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using format as both a variable name and function call creates ambiguous code. The variable format is being concatenated with the result of calling format() function, which could be confusing and error-prone.

Copilot uses AI. Check for mistakes.

Comment on lines +303 to +305
// Since ILM is stopped, the async action should not execute and the index should remain in the readonly step.
// This is the tricky part of the test, as we can't really verify that the async action will never happen.
assertEquals(new StepKey("hot", "readonly", "readonly"), getStepKeyForIndex(client(), originalIndex));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any suggestions for improved assertions are welcome. Adding a Thread.sleep here would increase our confidence, but still wouldn't make any guarantee.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we write a test that wraps the Client passed to IndexLifecycleService with a wrapper that asserts that we never perform some particular ILM action?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a test to IndexLifecycleRunnerTests in 370d966 that verifies the action isn't executed when ILM is stopped. Let me know if that's what you had in mind. I chose for testing IndexLifecycleRunner#maybeRunAsyncAction instead of IndexLifecycleService#maybeRunAsyncAction, as the latter is only used by APIs and the former is also used internally by ILM.

}

boolean safeToStop = true; // true until proven false by a run policy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes in this method aren't stricly necessary, I just took this opportunity to clean this method up a bit, as it was becoming hard to read. If there are concerns with this refactor, I can revert these stylistic changes and stick to the bug fix.

@nielsbauman nielsbauman added >bug :Data Management/ILM+SLM Index and Snapshot lifecycle management and removed needs:triage Requires assignment of a team area label labels Aug 27, 2025
@nielsbauman nielsbauman requested a review from dakrone August 27, 2025 19:51
@elasticsearchmachine elasticsearchmachine added the Team:Data Management Meta label for data/management team label Aug 27, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@elasticsearchmachine
Copy link
Collaborator

Hi @nielsbauman, I've created a changelog YAML for you.

Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look at this Niels. I took a look at this and left some suggestions for testing and the refactoring, let me know what you think

Comment on lines +303 to +305
// Since ILM is stopped, the async action should not execute and the index should remain in the readonly step.
// This is the tricky part of the test, as we can't really verify that the async action will never happen.
assertEquals(new StepKey("hot", "readonly", "readonly"), getStepKeyForIndex(client(), originalIndex));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we write a test that wraps the Client passed to IndexLifecycleService with a wrapper that asserts that we never perform some particular ILM action?

Comment on lines 213 to 217
if (currentMode == OperationMode.RUNNING) {
lifecycleRunner.maybeRunAsyncAction(state, idxMeta, policyName, stepKey);
continue;
}
if (stepKey != null && IGNORE_STEPS_MAINTENANCE_REQUESTED.contains(stepKey.name())) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this part is harder to read, because someone has to keep in mind that the second if statement executes only if currentMode != OperationMode.RUNNING.

What about factoring it into separate methods, so that it can look something like:

switch (currentMode) {
    case OperationMode.RUNNING:
        lifecycleRunner.maybeRunAsyncAction(state, idxMeta, policyName, stepKey);
        break;
    case OperationMode.STOPPING:
    case OperationMode.STOPPED:
        runOrCheckIfSafeToIgnore(…);
    default:
        throw new IllegalArgumentException("you need to handle the case for " + currentMode);
}

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because someone has to keep in mind that ...

I'm not sure I get what you mean. The two if statements are only a few lines apart, right? If they would be far apart, I could get that it's harder to understand/remember.

I do agree that it's easier to miss that we already do the check

        if (OperationMode.STOPPED.equals(currentMode)) {
            return;
        }

earlier in the method, which means that the second if statement you were referring to only executes if currentMode == STOPPING. Therefore, the switch you suggested looks a bit overkill to me, as there would only be two branches. A switch with two values is essentially just an if-else, which is basically what I have now.

I noticed another if statement at the start of this method that I could invert, and I added a comment between these two if statements, in the hope that that will make it easier to read, in 26b3dc4. Let me know what you think.

@nielsbauman nielsbauman requested a review from dakrone August 29, 2025 06:59
Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the further refactoring. The method is still longer than I'd like (not at all due to you or this PR, just in general), but that's something we can work on at a later time.

@nielsbauman nielsbauman enabled auto-merge (squash) August 30, 2025 07:03
@nielsbauman nielsbauman merged commit 2b73a6c into elastic:main Aug 30, 2025
33 checks passed
@nielsbauman nielsbauman deleted the fix-ilm-stop branch August 30, 2025 09:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>bug :Data Management/ILM+SLM Index and Snapshot lifecycle management Team:Data Management Meta label for data/management team v9.2.0

Projects

None yet

3 participants