Don't run async actions when ILM is stopped #133683

nielsbauman · 2025-08-27T19:48:30Z

Today ILM will run AsyncActionSteps even if ILM is stopped. These actions are started either as callbacks after previous actions complete or when the move-to-step API is used. By checking the ILM operation mode before running the action in IndexLifecycleRunner#maybeRunAsyncAction, we prevent these actions from being executed while ILM is stopped.

AsyncActionSteps are currently only automatically started as callbacks after previous actions complete or after a master failover. To ensure that these steps will be executed when ILM is restarted after a stop, we loop over all the managed indices and start all async action steps.

Fixes #81234
Fixes #85097
Fixes #99859

Today ILM will run `AsyncActionStep`s even if ILM is stopped. These actions are started either as callbacks after previous actions complete or when the move-to-step API is used. By checking the ILM operation mode before running the action in `IndexLifecycleRunner#maybeRunAsyncAction`, we prevent these actions from being executed while ILM is stopped. `AsyncActionStep`s are currently only automatically started as callbacks after previous actions complete or after a master failover. To ensure that these steps will be executed when ILM is restarted after a stop, we loop over all the managed indices and start all async action steps. Fixes elastic#81234 Fixes elastic#85097 Fixes elastic#99859

Copilot

Pull Request Overview

This PR prevents ILM (Index Lifecycle Management) from executing AsyncActionSteps when ILM is stopped. Previously, these async actions could still run even when ILM was in a stopped state, either through callbacks or the move-to-step API. The fix checks the ILM operation mode before executing async actions and ensures proper recovery when ILM is restarted.

Added operation mode check in maybeRunAsyncAction to prevent execution when ILM is not running
Refactored master node handling to kick off pending async actions when ILM transitions from stopped to running
Added comprehensive test coverage for the async action blocking behavior when using the move-to-step API

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
IndexLifecycleService.java	Refactored `onMaster` method to handle async action execution and detect ILM mode transitions
IndexLifecycleRunner.java	Added operation mode check to prevent async actions when ILM is stopped
TimeseriesMoveToStepIT.java	Added integration test to verify async actions don't execute when ILM is stopped

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-08-27T19:48:58Z

x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/ilm/IndexLifecycleService.java

+                    String format = format(
+                        "async action execution failed during master election trigger for index [%s] with policy [%s] in step [%s]",
+                        idxMeta.getIndex().getName(),
+                        policyName,
+                        stepKey
+                    );
+                    if (logger.isTraceEnabled()) {
+                        format += format(", lifecycle state: [%s]", lifecycleState.asMap());
                    }
+                    logger.warn(format, e);


Using format as both a variable name and function call creates ambiguous code. The variable format is being concatenated with the result of calling format() function, which could be confusing and error-prone.

nielsbauman · 2025-08-27T19:49:19Z

.../qa/multi-node/src/javaRestTest/java/org/elasticsearch/xpack/ilm/TimeseriesMoveToStepIT.java

+        // Since ILM is stopped, the async action should not execute and the index should remain in the readonly step.
+        // This is the tricky part of the test, as we can't really verify that the async action will never happen.
+        assertEquals(new StepKey("hot", "readonly", "readonly"), getStepKeyForIndex(client(), originalIndex));


Any suggestions for improved assertions are welcome. Adding a Thread.sleep here would increase our confidence, but still wouldn't make any guarantee.

Could we write a test that wraps the Client passed to IndexLifecycleService with a wrapper that asserts that we never perform some particular ILM action?

I added a test to IndexLifecycleRunnerTests in 370d966 that verifies the action isn't executed when ILM is stopped. Let me know if that's what you had in mind. I chose for testing IndexLifecycleRunner#maybeRunAsyncAction instead of IndexLifecycleService#maybeRunAsyncAction, as the latter is only used by APIs and the former is also used internally by ILM.

nielsbauman · 2025-08-27T19:50:49Z

x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/ilm/IndexLifecycleService.java

            }

            boolean safeToStop = true; // true until proven false by a run policy
-


The changes in this method aren't stricly necessary, I just took this opportunity to clean this method up a bit, as it was becoming hard to read. If there are concerns with this refactor, I can revert these stylistic changes and stick to the bug fix.

elasticsearchmachine · 2025-08-27T19:52:04Z

Pinging @elastic/es-data-management (Team:Data Management)

elasticsearchmachine · 2025-08-27T19:52:05Z

Hi @nielsbauman, I've created a changelog YAML for you.

dakrone

Thanks for taking a look at this Niels. I took a look at this and left some suggestions for testing and the refactoring, let me know what you think

docs/changelog/133683.yaml

dakrone · 2025-08-27T22:55:47Z

.../qa/multi-node/src/javaRestTest/java/org/elasticsearch/xpack/ilm/TimeseriesMoveToStepIT.java

+        // Since ILM is stopped, the async action should not execute and the index should remain in the readonly step.
+        // This is the tricky part of the test, as we can't really verify that the async action will never happen.
+        assertEquals(new StepKey("hot", "readonly", "readonly"), getStepKeyForIndex(client(), originalIndex));


Could we write a test that wraps the Client passed to IndexLifecycleService with a wrapper that asserts that we never perform some particular ILM action?

dakrone · 2025-08-27T23:04:32Z

x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/ilm/IndexLifecycleService.java

+                    if (currentMode == OperationMode.RUNNING) {
+                        lifecycleRunner.maybeRunAsyncAction(state, idxMeta, policyName, stepKey);
+                        continue;
+                    }
+                    if (stepKey != null && IGNORE_STEPS_MAINTENANCE_REQUESTED.contains(stepKey.name())) {


I think this part is harder to read, because someone has to keep in mind that the second if statement executes only if currentMode != OperationMode.RUNNING.

What about factoring it into separate methods, so that it can look something like:

switch (currentMode) { case OperationMode.RUNNING: lifecycleRunner.maybeRunAsyncAction(state, idxMeta, policyName, stepKey); break; case OperationMode.STOPPING: case OperationMode.STOPPED: runOrCheckIfSafeToIgnore(…); default: throw new IllegalArgumentException("you need to handle the case for " + currentMode); }

?

because someone has to keep in mind that ...

I'm not sure I get what you mean. The two if statements are only a few lines apart, right? If they would be far apart, I could get that it's harder to understand/remember.

I do agree that it's easier to miss that we already do the check

if (OperationMode.STOPPED.equals(currentMode)) { return; }

earlier in the method, which means that the second if statement you were referring to only executes if currentMode == STOPPING. Therefore, the switch you suggested looks a bit overkill to me, as there would only be two branches. A switch with two values is essentially just an if-else, which is basically what I have now.

I noticed another if statement at the start of this method that I could invert, and I added a comment between these two if statements, in the hope that that will make it easier to read, in 26b3dc4. Let me know what you think.

dakrone

LGTM, thanks for the further refactoring. The method is still longer than I'd like (not at all due to you or this PR, just in general), but that's something we can work on at a later time.

nielsbauman requested a review from Copilot August 27, 2025 19:48

elasticsearchmachine added v9.2.0 needs:triage Requires assignment of a team area label labels Aug 27, 2025

Copilot AI reviewed Aug 27, 2025

View reviewed changes

nielsbauman commented Aug 27, 2025

View reviewed changes

Rename variable

930155f

nielsbauman added >bug :Data Management/ILM+SLM Index and Snapshot lifecycle management and removed needs:triage Requires assignment of a team area label labels Aug 27, 2025

nielsbauman requested a review from dakrone August 27, 2025 19:51

elasticsearchmachine added the Team:Data Management Meta label for data/management team label Aug 27, 2025

Update docs/changelog/133683.yaml

6aadbb3

dakrone reviewed Aug 27, 2025

View reviewed changes

nielsbauman and others added 5 commits August 28, 2025 09:36

Refactor method some more

26b3dc4

Add unit test for IndexLifecycleRunner

370d966

Merge branch 'main' into fix-ilm-stop

0ed022c

Rephrase changelog summary

ebe7f23

Merge branch 'main' into fix-ilm-stop

80f5e81

nielsbauman requested a review from dakrone August 29, 2025 06:59

dakrone approved these changes Aug 29, 2025

View reviewed changes

Merge branch 'main' into fix-ilm-stop

02e978b

nielsbauman enabled auto-merge (squash) August 30, 2025 07:03

nielsbauman merged commit 2b73a6c into elastic:main Aug 30, 2025
33 checks passed

nielsbauman deleted the fix-ilm-stop branch August 30, 2025 09:59

		}

		boolean safeToStop = true; // true until proven false by a run policy

Don't run async actions when ILM is stopped #133683

Don't run async actions when ILM is stopped #133683

Uh oh!

Conversation

nielsbauman commented Aug 27, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

nielsbauman Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

dakrone Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

nielsbauman Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

nielsbauman Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

elasticsearchmachine commented Aug 27, 2025

Uh oh!

elasticsearchmachine commented Aug 27, 2025

Uh oh!

dakrone left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dakrone Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

dakrone Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

nielsbauman Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

dakrone left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants