-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
Description
π howdy, team!
Indices can end up in ILM Explain is_auto_retryable_error:true
and failed_step_retry_count: +1
looping without (consistent) UI/API information of there being a problem requiring manual user intervention.
E.g. if you bootstrap ILM poorly and hot/rollover/check-rollover-ready
infinitely errors known issue does not point to index
in Elasticsearch logs, the UI/API oscillates but usually does not report any issues to resolve
> GET filebeat-8.10.3-electrum_log-2023.10.13/_ilm/explain?only_errors=true
{ "indices": {} }
> GET filebeat-8.10.3-electrum_log-2023.10.13/_ilm/explain
{
"indices": {
"filebeat-8.10.3-electrum_log-2023.10.13": {
"index": "filebeat-8.10.3-electrum_log-2023.10.13",
"managed": true,
"policy": "365-days-default",
"index_creation_date_millis": 1697155202751,
"time_since_index_creation": "8.66d",
"lifecycle_date_millis": 1697155202751,
"age": "8.66d",
"phase": "hot",
"phase_time_millis": 1697903938553,
"action": "rollover",
"action_time_millis": 1697155203899,
"step": "check-rollover-ready",
"step_time_millis": 1697903938553,
"is_auto_retryable_error": true,
"failed_step_retry_count": 624,
"phase_execution": {
"policy": "365-days-default",
"phase_definition": {
"min_age": "0ms",
"actions": {
"rollover": {
"max_age": "1d",
"max_primary_shard_size": "50gb"
}
}
},
"version": 4,
"modified_date_in_millis": 1695888703972
}
}
}
}
while in reality the index is fully blocked from progressing as shown in Elasticsearch logs ...
[instance-0000000013] policy [365-days-default] for index [filebeat-8.10.3-electrum_metrics-2023.10.17] failed on step [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}]. Moving to ERROR step
java.lang.IllegalArgumentException: index.lifecycle.rollover_alias [alias-electrum_metrics] does not point to index [filebeat-8.10.3-electrum_metrics-2023.10.17] ...
... which will require user intervention to resolve.
A small portion of the time, I can get the example index to show hot/rollover/ERROR
via API...
> GET filebeat-8.10.3-electrum_log-2023.10.13/_ilm/explain?only_errors=true
{
"indices": {
"filebeat-8.10.3-electrum_log-2023.10.13": {
"index": "filebeat-8.10.3-electrum_log-2023.10.13",
"managed": true,
"policy": "365-days-default",
"index_creation_date_millis": 1697155202751,
"time_since_index_creation": "8.67d",
"lifecycle_date_millis": 1697155202751,
"age": "8.67d",
"phase": "hot",
"phase_time_millis": 1697903938553,
"action": "rollover",
"action_time_millis": 1697155203899,
"step": "ERROR",
"step_time_millis": 1697904538636,
"failed_step": "check-rollover-ready",
"is_auto_retryable_error": true,
"failed_step_retry_count": 624,
"step_info": {
"type": "illegal_argument_exception",
"reason": "setting [index.lifecycle.rollover_alias] for index [filebeat-8.10.3-electrum_log-2023.10.13] is empty or not defined"
},
"phase_execution": {
"policy": "365-days-default",
"phase_definition": {
"min_age": "0ms",
"actions": {
"rollover": {
"max_age": "1d",
"max_primary_shard_size": "50gb"
}
}
},
"version": 4,
"modified_date_in_millis": 1695888703972
}
}
}
}
... which I suspect is related to A) the ILM polling interval and B) the index doesn't error while ILM thinks its executing. However, across a 30min check it only returned error information for <5mins cumulatively so the vast majority of the time I didn't know/realize manual intervention was required.
ππΌ Will you kindly consider
- Having certain
IllegalArgumentException
errors overrideIsAutoRetryable
so that intervention-required errors consistently surface? - Always report
step_info
foris_auto_retryable_error
so that users/Support can tell there is manual intervention required. (Alternatively considerlast_step_info
to signal maybe previous attempt failed but next might not.)