Skip to content

Conversation

lukewhiting
Copy link
Contributor

@lukewhiting lukewhiting commented Sep 19, 2024

This PR does a few things:

  • Adds a new field to the Explain ILM lifecycle endpoint (<target>/_ilm/explain) called previous_step_info that contains the previous step runs step_info field contents
  • Logic in the Lifecycle step transition logic to copy the step info of a previous run into the new field
  • Docs for the new field

Bonus content includes:

  • Refactored and simplified test object mutation logic
  • Refactored test's to use recommended Clock interface for time dependent tests rather than custom supplier Reverted this as it's too risky / unrelated to PR

Commits are atomic so feel free to review commit by commit if you prefer reviewing smaller chunks

Copy link
Contributor

Documentation preview:

@elasticsearchmachine elasticsearchmachine added Team:Data Management Meta label for data/management team v9.0.0 labels Sep 19, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@lukewhiting lukewhiting force-pushed the 101193-step-info-missing-exception-info branch from dca5328 to 940cee7 Compare September 19, 2024 14:30
@elasticsearchmachine
Copy link
Collaborator

Hi @lukewhiting, I've created a changelog YAML for you.

@lukewhiting lukewhiting force-pushed the 101193-step-info-missing-exception-info branch 2 times, most recently from 940cee7 to 14e3a7a Compare September 19, 2024 14:34
@elasticsearchmachine
Copy link
Collaborator

Hi @lukewhiting, I've updated the changelog YAML for you.

@lukewhiting lukewhiting linked an issue Sep 19, 2024 that may be closed by this pull request
@lukewhiting lukewhiting added v8.16.0 auto-backport Automatically create backport pull requests when merged labels Sep 20, 2024
@lukewhiting
Copy link
Contributor Author

lukewhiting commented Sep 24, 2024

Local functional testing complete.

Method:

  1. Index template created with policy that has a rollover step
  2. Index (ilm-error-test-000001) created to match template and dummy doc inserted
  3. Wait for ILM explain on index to confirm in check-rollover-ready state
  4. Create new index (ilm-error-test-000000) that also matches the template
  5. Manually create alias (ilm-error-test) for second index that matches rollover_alias of policy
  6. ILM explain for the first index enters Error state with step info index.lifecycle.rollover_alias [ilm-error-test] does not point to index [ilm-error-test-000001]
  7. Wait for auto retry
  8. ILM explain for ilm-error-test-000001 shows following output that matches expected with failed_step_retry_count incremented by 1 and previous_step_info populated with previous attempts step info.

Test Output:

{
  "indices": {
    "ilm-error-test-000001": {
      "index": "ilm-error-test-000001",
      "managed": true,
      "policy": "365-days-default",
      "index_creation_date_millis": 1727179631766,
      "time_since_index_creation": "19.92m",
      "lifecycle_date_millis": 1727179631766,
      "age": "19.92m",
      "phase": "hot",
      "phase_time_millis": 1727180790334,
      "action": "rollover",
      "action_time_millis": 1727179631820,
      "step": "check-rollover-ready",
      "step_time_millis": 1727180790334,
      "is_auto_retryable_error": true,
      "failed_step_retry_count": 1,
      "previous_step_info": {
        "type": "illegal_argument_exception",
        "reason": "index.lifecycle.rollover_alias [ilm-error-test] does not point to index [ilm-error-test-000001]"
      },
      "phase_execution": {
        "policy": "365-days-default",
        "phase_definition": {
          "min_age": "0ms",
          "actions": {
            "rollover": {
              "max_age": "30d",
              "max_primary_shard_docs": 200000000,
              "min_docs": 1,
              "max_primary_shard_size": "50gb"
            }
          }
        },
        "version": 1,
        "modified_date_in_millis": 1727179591769
      }
    }
  }
}

@nielsbauman
Copy link
Contributor

@lukewhiting That functional test looks great! Well done! I think it would also be worthwhile to ensure the previous_step_info isn't cleared after the second retry, but we can do that in the integration test that we'll be writing.

@gmarouli
Copy link
Contributor

@lukewhiting nice work coming up with the test scenario!

Local functional testing complete.

Method:

  1. Index template created with policy that has a rollover step
  2. Index (ilm-error-test-000001) created to match template and dummy doc inserted
  3. Wait for ILM explain on index to confirm in check-rollover-ready state
  4. Create new index (ilm-error-test-000000) that also matches the template
  5. Manually create alias (ilm-error-test) for second index that matches rollover_alias of policy
  6. ILM explain for the first index enters Error state with step info index.lifecycle.rollover_alias [ilm-error-test] does not point to index [ilm-error-test-000001]
  7. Wait for auto retry
  8. ILM explain for ilm-error-test-000001 shows following output that matches expected with failed_step_retry_count incremented by 1 and previous_step_info populated with previous attempts step info.

Test Output:

{
  "indices": {
    "ilm-error-test-000001": {
      "index": "ilm-error-test-000001",
      "managed": true,
      "policy": "365-days-default",
      "index_creation_date_millis": 1727179631766,
      "time_since_index_creation": "19.92m",
      "lifecycle_date_millis": 1727179631766,
      "age": "19.92m",
      "phase": "hot",
      "phase_time_millis": 1727180790334,
      "action": "rollover",
      "action_time_millis": 1727179631820,
      "step": "check-rollover-ready",
      "step_time_millis": 1727180790334,
      "is_auto_retryable_error": true,
      "failed_step_retry_count": 1,
      "previous_step_info": {
        "type": "illegal_argument_exception",
        "reason": "index.lifecycle.rollover_alias [ilm-error-test] does not point to index [ilm-error-test-000001]"
      },
      "phase_execution": {
        "policy": "365-days-default",
        "phase_definition": {
          "min_age": "0ms",
          "actions": {
            "rollover": {
              "max_age": "30d",
              "max_primary_shard_docs": 200000000,
              "min_docs": 1,
              "max_primary_shard_size": "50gb"
            }
          }
        },
        "version": 1,
        "modified_date_in_millis": 1727179591769
      }
    }
  }
}

I think this could be simplified in an java rest test for example:

  1. Decrease ILM polling frequency to reduce the test duration
  2. Index template created without the rollover alias but with a regular alias and with policy that has a rollover step with max_doc: 1
  3. Index (ilm-error-test-000001) created to match template and dummy doc inserted
  4. Wait for ILM explain on index to confirm in check-rollover-ready state
  5. ILM explain will throw the error setting [index.lifecycle.rollover_alias] for index
  6. Wait for auto retry
  7. ILM explain for ilm-error-test-000001 shows following output that matches expected with failed_step_retry_count incremented by 1 and previous_step_info populated with previous attempts step info.

It's very similar with what you have just a few less steps. Let me know if you want to work on it together.

Nit: I think the test-update-serverless can be removed because ILM is not used in serverless. That should speed up ci as well.

@lukewhiting
Copy link
Contributor Author

@nielsbauman @gmarouli Integration test for new field added based on the above manual testing: 91dac7e

Copy link
Contributor

@gmarouli gmarouli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a minor comment but LGTM :)

@lukewhiting lukewhiting added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Sep 30, 2024
@lukewhiting lukewhiting merged commit b1b249d into elastic:main Sep 30, 2024
15 of 16 checks passed
@lukewhiting lukewhiting deleted the 101193-step-info-missing-exception-info branch September 30, 2024 10:44
@lukewhiting lukewhiting restored the 101193-step-info-missing-exception-info branch September 30, 2024 10:45
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
8.x

@lukewhiting lukewhiting deleted the 101193-step-info-missing-exception-info branch September 30, 2024 10:45
lukewhiting added a commit to lukewhiting/elasticsearch that referenced this pull request Sep 30, 2024
…3187)

* Add new Previous Step Info field to LifecycleExecutionState

* Add new field to IndexLifecycleExplainResponse

* Add new field to TransportExplainLifecycleAction

* Add logic to IndexLifecycleTransition to keep previous setp info

* Switch tests to use Java standard Clock class

for any time based testing, this is the recommended method

* Fix tests for new field

Also refactor tests to newer style

* Add test to ensure step info is preserved

Across auto retries

* Add docs for new field

* Changelog Entry

* Update docs/changelog/113187.yaml

* Revert "Switch tests to use Java standard Clock class"

This reverts commit 241074c.

* PR Changes

* PR Changes - Improve docs wording

Co-authored-by: Mary Gouseti <[email protected]>

* Integration test for new ILM explain field

* Use ROOT locale instead of default toLowerCase

* PR Changes - Switch to block strings

* Remove forbidden API usage

---------

Co-authored-by: Mary Gouseti <[email protected]>
elasticsearchmachine pushed a commit that referenced this pull request Sep 30, 2024
* Add new Previous Step Info field to LifecycleExecutionState

* Add new field to IndexLifecycleExplainResponse

* Add new field to TransportExplainLifecycleAction

* Add logic to IndexLifecycleTransition to keep previous setp info

* Switch tests to use Java standard Clock class

for any time based testing, this is the recommended method

* Fix tests for new field

Also refactor tests to newer style

* Add test to ensure step info is preserved

Across auto retries

* Add docs for new field

* Changelog Entry

* Update docs/changelog/113187.yaml

* Revert "Switch tests to use Java standard Clock class"

This reverts commit 241074c.

* PR Changes

* PR Changes - Improve docs wording



* Integration test for new ILM explain field

* Use ROOT locale instead of default toLowerCase

* PR Changes - Switch to block strings

* Remove forbidden API usage

---------

Co-authored-by: Mary Gouseti <[email protected]>
matthewabbott pushed a commit to matthewabbott/elasticsearch that referenced this pull request Oct 4, 2024
…3187)

* Add new Previous Step Info field to LifecycleExecutionState

* Add new field to IndexLifecycleExplainResponse

* Add new field to TransportExplainLifecycleAction

* Add logic to IndexLifecycleTransition to keep previous setp info

* Switch tests to use Java standard Clock class

for any time based testing, this is the recommended method

* Fix tests for new field

Also refactor tests to newer style

* Add test to ensure step info is preserved

Across auto retries

* Add docs for new field

* Changelog Entry

* Update docs/changelog/113187.yaml

* Revert "Switch tests to use Java standard Clock class"

This reverts commit 241074c.

* PR Changes

* PR Changes - Improve docs wording

Co-authored-by: Mary Gouseti <[email protected]>

* Integration test for new ILM explain field

* Use ROOT locale instead of default toLowerCase

* PR Changes - Switch to block strings

* Remove forbidden API usage

---------

Co-authored-by: Mary Gouseti <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) :Data Management/ILM+SLM Index and Snapshot lifecycle management >enhancement Team:Data Management Meta label for data/management team v8.16.0 v9.0.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[ILM] Still show step_info for is_auto_retryable_error

4 participants