Skip to content

[ILM] Still show step_info for is_auto_retryable_errorΒ #101193

@stefnestor

Description

@stefnestor

Description

πŸ‘‹ howdy, team!

Indices can end up in ILM Explain is_auto_retryable_error:true and failed_step_retry_count: +1 looping without (consistent) UI/API information of there being a problem requiring manual user intervention.

E.g. if you bootstrap ILM poorly and hot/rollover/check-rollover-ready infinitely errors known issue does not point to index in Elasticsearch logs, the UI/API oscillates but usually does not report any issues to resolve

> GET filebeat-8.10.3-electrum_log-2023.10.13/_ilm/explain?only_errors=true
{ "indices": {} }

> GET filebeat-8.10.3-electrum_log-2023.10.13/_ilm/explain
{
  "indices": {
    "filebeat-8.10.3-electrum_log-2023.10.13": {
      "index": "filebeat-8.10.3-electrum_log-2023.10.13",
      "managed": true,
      "policy": "365-days-default",
      "index_creation_date_millis": 1697155202751,
      "time_since_index_creation": "8.66d",
      "lifecycle_date_millis": 1697155202751,
      "age": "8.66d",
      "phase": "hot",
      "phase_time_millis": 1697903938553,
      "action": "rollover",
      "action_time_millis": 1697155203899,
      "step": "check-rollover-ready",
      "step_time_millis": 1697903938553,
      "is_auto_retryable_error": true,
      "failed_step_retry_count": 624,
      "phase_execution": {
        "policy": "365-days-default",
        "phase_definition": {
          "min_age": "0ms",
          "actions": {
            "rollover": {
              "max_age": "1d",
              "max_primary_shard_size": "50gb"
            }
          }
        },
        "version": 4,
        "modified_date_in_millis": 1695888703972
      }
    }
  }
}

while in reality the index is fully blocked from progressing as shown in Elasticsearch logs ...

[instance-0000000013] policy [365-days-default] for index [filebeat-8.10.3-electrum_metrics-2023.10.17] failed on step [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}]. Moving to ERROR step
java.lang.IllegalArgumentException: index.lifecycle.rollover_alias [alias-electrum_metrics] does not point to index [filebeat-8.10.3-electrum_metrics-2023.10.17] ...

... which will require user intervention to resolve.

A small portion of the time, I can get the example index to show hot/rollover/ERROR via API...

> GET filebeat-8.10.3-electrum_log-2023.10.13/_ilm/explain?only_errors=true
{
  "indices": {
    "filebeat-8.10.3-electrum_log-2023.10.13": {
      "index": "filebeat-8.10.3-electrum_log-2023.10.13",
      "managed": true,
      "policy": "365-days-default",
      "index_creation_date_millis": 1697155202751,
      "time_since_index_creation": "8.67d",
      "lifecycle_date_millis": 1697155202751,
      "age": "8.67d",
      "phase": "hot",
      "phase_time_millis": 1697903938553,
      "action": "rollover",
      "action_time_millis": 1697155203899,
      "step": "ERROR",
      "step_time_millis": 1697904538636,
      "failed_step": "check-rollover-ready",
      "is_auto_retryable_error": true,
      "failed_step_retry_count": 624,
      "step_info": {
        "type": "illegal_argument_exception",
        "reason": "setting [index.lifecycle.rollover_alias] for index [filebeat-8.10.3-electrum_log-2023.10.13] is empty or not defined"
      },
      "phase_execution": {
        "policy": "365-days-default",
        "phase_definition": {
          "min_age": "0ms",
          "actions": {
            "rollover": {
              "max_age": "1d",
              "max_primary_shard_size": "50gb"
            }
          }
        },
        "version": 4,
        "modified_date_in_millis": 1695888703972
      }
    }
  }
}

... which I suspect is related to A) the ILM polling interval and B) the index doesn't error while ILM thinks its executing. However, across a 30min check it only returned error information for <5mins cumulatively so the vast majority of the time I didn't know/realize manual intervention was required.

πŸ™πŸΌ Will you kindly consider

  1. Having certain IllegalArgumentException errors override IsAutoRetryable so that intervention-required errors consistently surface?
  2. Always report step_info for is_auto_retryable_error so that users/Support can tell there is manual intervention required. (Alternatively consider last_step_info to signal maybe previous attempt failed but next might not.)

Metadata

Metadata

Assignees

Labels

:Data Management/ILM+SLMIndex and Snapshot lifecycle management>enhancementSupportabilityImprove our (devs, SREs, support eng, users) ability to troubleshoot/self-service product better.Team:Data ManagementMeta label for data/management team

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions