-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Don't run async actions when ILM is stopped #133683
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
3b77520
Don't run async actions when ILM is stopped
nielsbauman 930155f
Rename variable
nielsbauman 6aadbb3
Update docs/changelog/133683.yaml
nielsbauman 26b3dc4
Refactor method some more
nielsbauman 370d966
Add unit test for `IndexLifecycleRunner`
nielsbauman 0ed022c
Merge branch 'main' into fix-ilm-stop
nielsbauman ebe7f23
Rephrase changelog summary
nielsbauman 80f5e81
Merge branch 'main' into fix-ilm-stop
nielsbauman 02e978b
Merge branch 'main' into fix-ilm-stop
nielsbauman File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
pr: 133683 | ||
summary: Avoid running asynchronous ILM actions while ILM is stopped | ||
area: ILM+SLM | ||
type: bug | ||
issues: | ||
- 99859 | ||
- 81234 | ||
- 85097 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -184,88 +184,78 @@ void onMaster(ClusterState clusterState) { | |
maybeScheduleJob(); | ||
|
||
for (var projectId : clusterState.metadata().projects().keySet()) { | ||
onMaster(clusterState.projectState(projectId)); | ||
maybeRunAsyncActions(clusterState.projectState(projectId)); | ||
} | ||
} | ||
|
||
void onMaster(ProjectState state) { | ||
/** | ||
* Kicks off any async actions that may not have been run due to either master failover or ILM being manually stopped. | ||
*/ | ||
private void maybeRunAsyncActions(ProjectState state) { | ||
final ProjectMetadata projectMetadata = state.metadata(); | ||
final IndexLifecycleMetadata currentMetadata = projectMetadata.custom(IndexLifecycleMetadata.TYPE); | ||
if (currentMetadata != null) { | ||
OperationMode currentMode = currentILMMode(projectMetadata); | ||
if (OperationMode.STOPPED.equals(currentMode)) { | ||
return; | ||
} | ||
|
||
boolean safeToStop = true; // true until proven false by a run policy | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The changes in this method aren't stricly necessary, I just took this opportunity to clean this method up a bit, as it was becoming hard to read. If there are concerns with this refactor, I can revert these stylistic changes and stick to the bug fix. |
||
// If we just became master, we need to kick off any async actions that | ||
// may have not been run due to master rollover | ||
for (IndexMetadata idxMeta : projectMetadata.indices().values()) { | ||
if (projectMetadata.isIndexManagedByILM(idxMeta)) { | ||
String policyName = idxMeta.getLifecyclePolicyName(); | ||
final LifecycleExecutionState lifecycleState = idxMeta.getLifecycleExecutionState(); | ||
StepKey stepKey = Step.getCurrentStepKey(lifecycleState); | ||
|
||
try { | ||
if (OperationMode.STOPPING == currentMode) { | ||
if (stepKey != null && IGNORE_STEPS_MAINTENANCE_REQUESTED.contains(stepKey.name())) { | ||
logger.info( | ||
"waiting to stop ILM because index [{}] with policy [{}] is currently in step [{}]", | ||
idxMeta.getIndex().getName(), | ||
policyName, | ||
stepKey.name() | ||
); | ||
lifecycleRunner.maybeRunAsyncAction(state, idxMeta, policyName, stepKey); | ||
// ILM is trying to stop, but this index is in a Shrink step (or other dangerous step) so we can't stop | ||
safeToStop = false; | ||
} else { | ||
logger.info( | ||
"skipping policy execution of step [{}] for index [{}] with policy [{}]" + " because ILM is stopping", | ||
stepKey == null ? "n/a" : stepKey.name(), | ||
idxMeta.getIndex().getName(), | ||
policyName | ||
); | ||
} | ||
} else { | ||
lifecycleRunner.maybeRunAsyncAction(state, idxMeta, policyName, stepKey); | ||
} | ||
} catch (Exception e) { | ||
if (logger.isTraceEnabled()) { | ||
logger.warn( | ||
() -> format( | ||
"async action execution failed during master election trigger" | ||
+ " for index [%s] with policy [%s] in step [%s], lifecycle state: [%s]", | ||
idxMeta.getIndex().getName(), | ||
policyName, | ||
stepKey, | ||
lifecycleState.asMap() | ||
), | ||
e | ||
); | ||
} else { | ||
logger.warn( | ||
() -> format( | ||
"async action execution failed during master election trigger" | ||
+ " for index [%s] with policy [%s] in step [%s]", | ||
idxMeta.getIndex().getName(), | ||
policyName, | ||
stepKey | ||
), | ||
e | ||
); | ||
if (currentMetadata == null) { | ||
return; | ||
} | ||
OperationMode currentMode = currentILMMode(projectMetadata); | ||
if (OperationMode.STOPPED.equals(currentMode)) { | ||
return; | ||
} | ||
|
||
} | ||
// Don't rethrow the exception, we don't want a failure for one index to be | ||
// called to cause actions not to be triggered for further indices | ||
} | ||
} | ||
boolean safeToStop = true; // true until proven false by a run policy | ||
for (IndexMetadata idxMeta : projectMetadata.indices().values()) { | ||
if (projectMetadata.isIndexManagedByILM(idxMeta) == false) { | ||
continue; | ||
} | ||
String policyName = idxMeta.getLifecyclePolicyName(); | ||
final LifecycleExecutionState lifecycleState = idxMeta.getLifecycleExecutionState(); | ||
StepKey stepKey = Step.getCurrentStepKey(lifecycleState); | ||
|
||
try { | ||
if (currentMode == OperationMode.RUNNING) { | ||
lifecycleRunner.maybeRunAsyncAction(state, idxMeta, policyName, stepKey); | ||
continue; | ||
} | ||
// We only get here if ILM is in STOPPING mode. In that case, we need to check if there is any index that is in a step | ||
// that we can't stop ILM in. If there is, we don't stop ILM yet. | ||
if (stepKey != null && IGNORE_STEPS_MAINTENANCE_REQUESTED.contains(stepKey.name())) { | ||
logger.info( | ||
"waiting to stop ILM because index [{}] with policy [{}] is currently in step [{}]", | ||
idxMeta.getIndex().getName(), | ||
policyName, | ||
stepKey.name() | ||
); | ||
lifecycleRunner.maybeRunAsyncAction(state, idxMeta, policyName, stepKey); | ||
// ILM is trying to stop, but this index is in a Shrink step (or other dangerous step) so we can't stop | ||
safeToStop = false; | ||
} else { | ||
logger.info( | ||
"skipping policy execution of step [{}] for index [{}] with policy [{}]" + " because ILM is stopping", | ||
stepKey == null ? "n/a" : stepKey.name(), | ||
idxMeta.getIndex().getName(), | ||
policyName | ||
); | ||
} | ||
} catch (Exception e) { | ||
String logMessage = format( | ||
"async action execution failed during master election trigger for index [%s] with policy [%s] in step [%s]", | ||
idxMeta.getIndex().getName(), | ||
policyName, | ||
stepKey | ||
); | ||
if (logger.isTraceEnabled()) { | ||
logMessage += format(", lifecycle state: [%s]", lifecycleState.asMap()); | ||
} | ||
logger.warn(logMessage, e); | ||
|
||
if (safeToStop && OperationMode.STOPPING == currentMode) { | ||
stopILM(state.projectId()); | ||
// Don't rethrow the exception, we don't want a failure for one index to be | ||
// called to cause actions not to be triggered for further indices | ||
} | ||
} | ||
|
||
if (safeToStop && OperationMode.STOPPING == currentMode) { | ||
stopILM(state.projectId()); | ||
} | ||
} | ||
|
||
private void stopILM(ProjectId projectId) { | ||
|
@@ -333,6 +323,20 @@ public void clusterChanged(ClusterChangedEvent event) { | |
cancelJob(); | ||
policyRegistry.clear(); | ||
} | ||
} else if (this.isMaster) { | ||
// If we are the master and we were before, check if any projects changed their ILM mode from non-RUNNING to RUNNING. | ||
// If so, kick off any async actions that may not have run while not in RUNNING mode. | ||
for (ProjectMetadata project : event.state().metadata().projects().values()) { | ||
final var previousProject = event.previousState().metadata().projects().get(project.id()); | ||
if (previousProject == null || project == previousProject) { | ||
continue; | ||
} | ||
final OperationMode currentMode = currentILMMode(project); | ||
final OperationMode previousMode = currentILMMode(previousProject); | ||
if (currentMode == OperationMode.RUNNING && previousMode != OperationMode.RUNNING) { | ||
maybeRunAsyncActions(event.state().projectState(project.id())); | ||
} | ||
} | ||
} | ||
|
||
// if we're the master, then process deleted indices and trigger policies | ||
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any suggestions for improved assertions are welcome. Adding a
Thread.sleep
here would increase our confidence, but still wouldn't make any guarantee.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we write a test that wraps the
Client
passed toIndexLifecycleService
with a wrapper that asserts that we never perform some particular ILM action?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a test to
IndexLifecycleRunnerTests
in 370d966 that verifies the action isn't executed when ILM is stopped. Let me know if that's what you had in mind. I chose for testingIndexLifecycleRunner#maybeRunAsyncAction
instead ofIndexLifecycleService#maybeRunAsyncAction
, as the latter is only used by APIs and the former is also used internally by ILM.