[Failure store] Introduce dedicated failure store lifecycle configuration #127314

gmarouli · 2025-04-24T10:58:56Z

The failure store is a set of data stream indices that are used to store certain type of ingestion failures. Until this moment they were sharing the configuration of the backing indices. We understand that the two data sets have different lifecycle needs.

We believe that typically the failures will need to be retained much less than the data. Considering this we believe the lifecycle needs of the failures also more limited and they fit better the simplicity of the data stream lifecycle feature.

This allows the user to only set the desired retention and we will perform the rollover and other maintenance tasks without the user having to think about them. Furthermore, having only one lifecycle management feature allows us to ensure that these data is managed by default.

This PR introduces the following:

Configuration

We extend the failure store configuration to allow lifecycle configuration too, this configuration reflects the user's configuration only as shown below:

PUT _data_stream/*/options
{
  "failure_store": {
     "lifecycle": {
       "data_retention": "5d"
     }
  }
}

GET _data_stream/*/options

{
  "data_streams": [
    {
      "name": "my-ds",
      "options": {
        "failure_store": {
          "lifecycle": {
            "data_retention": "5d"
          }
        }
      }
    }
  ]
}

To retrieve the effective configuration you need to use the GET data streams API, see #126668

Functionality

The data stream lifecycle (DLM) will manage the failure indices regardless if the failure store is enabled or not. This will ensure that if the failure store gets disabled we will not have stagnant data.
The data stream options APIs reflect only the user's configuration.
The GET data stream API should be used to check the current state of the effective failure store configuration.
Telemetry

We extend the data stream failure store telemetry to also include the lifecycle telemetry.

{
  "data_streams": {
     "available": true,
     "enabled": true,
     "data_streams": 10,
     "indices_count": 50,
     "failure_store": {
       "explicitly_enabled_count": 1,
       "effectively_enabled_count": 15,
       "failure_indices_count": 30
       "lifecycle": { 
         "explicitly_enabled_count": 5,
         "effectively_enabled_count": 20,
         "data_retention": {
           "configured_data_streams": 5,
           "minimum_millis": X,
           "maximum_millis": Y,
           "average_millis": Z,
          },
          "effective_retention": {
            "retained_data_streams": 20,
            "minimum_millis": X,
            "maximum_millis": Y, 
            "average_millis": Z
          },
         "global_retention": {
           "max": {
             "defined": false
           },
           "default": {
             "defined": true,  <------ this is the default value applicable for the failure store
             "millis": X
           }
        }
      }
   }
}

Implementation details

We ensure that partially reset failure store will create valid failure store configuration.
We ensure that when a node communicates with a note with a previous version it will ensure it will not send an invalid failure store configuration enabled: null.

…with null enabled.

elasticsearchmachine · 2025-04-24T10:59:20Z

Pinging @elastic/es-data-management (Team:Data Management)

elasticsearchmachine · 2025-04-24T11:00:01Z

Hi @gmarouli, I've created a changelog YAML for you.

jbaiera

I have left a couple of non binding questions, but otherwise this LGTM!

jbaiera · 2025-04-29T05:58:45Z

server/src/main/java/org/elasticsearch/cluster/metadata/MetadataDataStreamsService.java

+            // We don't issue any warnings if all data streams are internal data streams
+            dataStreamOptions.failureStore()
+                .lifecycle()
+                .addWarningHeaderIfDataRetentionNotEffective(globalRetentionSettings.get(), onlyInternalDataStreams);


This doesn't have the default failure store retention setting yet does it?

Not yet, it defaults to the global default right now. In the follow up PR it will only use the failures default. Do you think we should handle it differently?

Nah, follow up is perfectly fine!

jbaiera · 2025-04-29T20:54:53Z

...treams/src/main/java/org/elasticsearch/datastreams/lifecycle/DataStreamLifecycleService.java

        for (DataStream dataStream : project.dataStreams().values()) {
            clearErrorStoreForUnmanagedIndices(project, dataStream);
-            if (dataStream.getDataLifecycle() == null) {
+            var dataLifecycleEnabled = dataStream.getDataLifecycle() != null && dataStream.getDataLifecycle().enabled();


Previously we weren't checking if the lifecycle was enabled, I'm assuming there isn't any important logic that follows this that needs executing if the data lifecycle configuration is present but disabled? Also, is present-but-disabled an invalid configuration state for the normal lifecycle?

Good questions.

A lifecycle can be present by disabled, either for maintenance or if a user wants it disable for some reason.

If a lifecycle is not present or if it's disabled, then the lifecycle service should not perform any operations for this data stream, so I think it makes sense to skip it if both the data and failures lifecycle are not enabled.

I will double check the code to ensure the data lifecycle being null is handled properly, because this is something that could not happen before.

We could keep just the null check here of course, it doesn't cost that much, but it creates this doubt that null and disabled behave differently, that's why I would prefer to include it here.

Confirmed, all actions check DataStream::isIndexManagedByDataStreamLifecycle() before executing any operation and this method handled a null lifecycle.

...treams/src/main/java/org/elasticsearch/datastreams/lifecycle/DataStreamLifecycleService.java

jbaiera · 2025-04-30T03:33:14Z

...ugin/esql/src/internalClusterTest/java/org/elasticsearch/xpack/esql/action/EsqlActionIT.java

-                                                ResettableValue.create(new DataStreamFailureStore.Template(ResettableValue.create(true)))
-                                            )
-                                        )
+                                        new DataStreamOptions.Template(DataStreamFailureStore.builder().enabled(true).buildTemplate())


I'm so happy 😌

jbaiera · 2025-04-30T03:34:51Z

server/src/main/java/org/elasticsearch/action/datastreams/PutDataStreamOptionsAction.java

 */

-package org.elasticsearch.datastreams.options.action;
+package org.elasticsearch.action.datastreams;


Was looking for where this was now used that it needed to be moved and I think I missed it. Where is this used now?

It's for serverless.

elasticsearchmachine · 2025-04-30T15:23:17Z

💔 Backport failed

Status	Branch	Result
❌	8.19	Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 127314

gmarouli · 2025-04-30T16:24:24Z

💚 All backports created successfully

Status	Branch	Result
✅	8.19

Questions ?

Please refer to the Backport tool documentation

…tion (elastic#127314) The failure store is a set of data stream indices that are used to store certain type of ingestion failures. Until this moment they were sharing the configuration of the backing indices. We understand that the two data sets have different lifecycle needs. We believe that typically the failures will need to be retained much less than the data. Considering this we believe the lifecycle needs of the failures also more limited and they fit better the simplicity of the data stream lifecycle feature. This allows the user to only set the desired retention and we will perform the rollover and other maintenance tasks without the user having to think about them. Furthermore, having only one lifecycle management feature allows us to ensure that these data is managed by default. This PR introduces the following: Configuration We extend the failure store configuration to allow lifecycle configuration too, this configuration reflects the user's configuration only as shown below: PUT _data_stream/*/options { "failure_store": { "lifecycle": { "data_retention": "5d" } } } GET _data_stream/*/options { "data_streams": [ { "name": "my-ds", "options": { "failure_store": { "lifecycle": { "data_retention": "5d" } } } } ] } To retrieve the effective configuration you need to use the GET data streams API, see elastic#126668 Functionality The data stream lifecycle (DLM) will manage the failure indices regardless if the failure store is enabled or not. This will ensure that if the failure store gets disabled we will not have stagnant data. The data stream options APIs reflect only the user's configuration. The GET data stream API should be used to check the current state of the effective failure store configuration. Telemetry We extend the data stream failure store telemetry to also include the lifecycle telemetry. { "data_streams": { "available": true, "enabled": true, "data_streams": 10, "indices_count": 50, "failure_store": { "explicitly_enabled_count": 1, "effectively_enabled_count": 15, "failure_indices_count": 30 "lifecycle": { "explicitly_enabled_count": 5, "effectively_enabled_count": 20, "data_retention": { "configured_data_streams": 5, "minimum_millis": X, "maximum_millis": Y, "average_millis": Z, }, "effective_retention": { "retained_data_streams": 20, "minimum_millis": X, "maximum_millis": Y, "average_millis": Z }, "global_retention": { "max": { "defined": false }, "default": { "defined": true, <------ this is the default value applicable for the failure store "millis": X } } } } } Implementation details We ensure that partially reset failure store will create valid failure store configuration. We ensure that when a node communicates with a note with a previous version it will ensure it will not send an invalid failure store configuration enabled: null. (cherry picked from commit 03d7781) # Conflicts: # modules/data-streams/src/main/java/org/elasticsearch/datastreams/DataStreamsPlugin.java # modules/data-streams/src/main/java/org/elasticsearch/datastreams/lifecycle/DataStreamLifecycleService.java # modules/data-streams/src/test/java/org/elasticsearch/datastreams/lifecycle/DataStreamLifecycleServiceTests.java # server/src/main/java/org/elasticsearch/TransportVersions.java # server/src/main/java/org/elasticsearch/cluster/metadata/MetadataDataStreamsService.java # server/src/test/java/org/elasticsearch/cluster/metadata/DataStreamTests.java

In this PR we claim the backport to 8.19 version for the introduction of the failure store lifecycle (#127314).

In elastic#127623 we backported elastic#127299 and added a backport transport version for it - `ESQL_AGGREGATE_METRIC_DOUBLE_BLOCK_8_19` aka `8_841_0_24`. This brings that version forwards to `main` and adds support for parsing streams with that version. In elastic#127639 we backported elastic#126401 and added a backport transport version for it - `PINNED_RETRIEVER_8_19` aka `8_841_0_23`. This brings that version forwards to `main` and adds support for parsing streams with that versions. In elastic#127633 we a claimed a backport transport version to backport elastic#127314 - `INTRODUCE_FAILURES_LIFECYCLE_BACKPORT_8_19` aka `8_841_0_23`. That's the same versions as `PINNED_RETRIEVER_8_19`. It's just that this one is in `main` and `PINNED_RETRIEVER_8_19` is in `8.19`. To allow me to bring `PINNED_RETRIEVER_8_19` for wards I've had to revert elastic#127633. Closes elastic#127667

In #127623 we backported #127299 and added a backport transport version for it - `ESQL_AGGREGATE_METRIC_DOUBLE_BLOCK_8_19` aka `8_841_0_24`. This brings that version forwards to `main` and adds support for parsing streams with that version. In #127639 we backported #126401 and added a backport transport version for it - `PINNED_RETRIEVER_8_19` aka `8_841_0_23`. This brings that version forwards to `main` and adds support for parsing streams with that versions. In #127633 we a claimed a backport transport version to backport #127314 - `INTRODUCE_FAILURES_LIFECYCLE_BACKPORT_8_19` aka `8_841_0_23`. That's the same versions as `PINNED_RETRIEVER_8_19`. It's just that this one is in `main` and `PINNED_RETRIEVER_8_19` is in `8.19`. To allow me to bring `PINNED_RETRIEVER_8_19` for wards I've had to revert #127633. Closes #127667

…nfiguration (#127314) (#127577) * [Failure store] Introduce dedicated failure store lifecycle configuration (#127314) The failure store is a set of data stream indices that are used to store certain type of ingestion failures. Until this moment they were sharing the configuration of the backing indices. We understand that the two data sets have different lifecycle needs. We believe that typically the failures will need to be retained much less than the data. Considering this we believe the lifecycle needs of the failures also more limited and they fit better the simplicity of the data stream lifecycle feature. This allows the user to only set the desired retention and we will perform the rollover and other maintenance tasks without the user having to think about them. Furthermore, having only one lifecycle management feature allows us to ensure that these data is managed by default. This PR introduces the following: Configuration We extend the failure store configuration to allow lifecycle configuration too, this configuration reflects the user's configuration only as shown below: PUT _data_stream/*/options { "failure_store": { "lifecycle": { "data_retention": "5d" } } } GET _data_stream/*/options { "data_streams": [ { "name": "my-ds", "options": { "failure_store": { "lifecycle": { "data_retention": "5d" } } } } ] } To retrieve the effective configuration you need to use the GET data streams API, see #126668 Functionality The data stream lifecycle (DLM) will manage the failure indices regardless if the failure store is enabled or not. This will ensure that if the failure store gets disabled we will not have stagnant data. The data stream options APIs reflect only the user's configuration. The GET data stream API should be used to check the current state of the effective failure store configuration. Telemetry We extend the data stream failure store telemetry to also include the lifecycle telemetry. { "data_streams": { "available": true, "enabled": true, "data_streams": 10, "indices_count": 50, "failure_store": { "explicitly_enabled_count": 1, "effectively_enabled_count": 15, "failure_indices_count": 30 "lifecycle": { "explicitly_enabled_count": 5, "effectively_enabled_count": 20, "data_retention": { "configured_data_streams": 5, "minimum_millis": X, "maximum_millis": Y, "average_millis": Z, }, "effective_retention": { "retained_data_streams": 20, "minimum_millis": X, "maximum_millis": Y, "average_millis": Z }, "global_retention": { "max": { "defined": false }, "default": { "defined": true, <------ this is the default value applicable for the failure store "millis": X } } } } } Implementation details We ensure that partially reset failure store will create valid failure store configuration. We ensure that when a node communicates with a note with a previous version it will ensure it will not send an invalid failure store configuration enabled: null. (cherry picked from commit 03d7781) # Conflicts: # modules/data-streams/src/main/java/org/elasticsearch/datastreams/DataStreamsPlugin.java # modules/data-streams/src/main/java/org/elasticsearch/datastreams/lifecycle/DataStreamLifecycleService.java # modules/data-streams/src/test/java/org/elasticsearch/datastreams/lifecycle/DataStreamLifecycleServiceTests.java # server/src/main/java/org/elasticsearch/TransportVersions.java # server/src/main/java/org/elasticsearch/cluster/metadata/MetadataDataStreamsService.java # server/src/test/java/org/elasticsearch/cluster/metadata/DataStreamTests.java * [CI] Auto commit changes from spotless --------- Co-authored-by: elasticsearchmachine <[email protected]>

…27633) In this PR we claim the backport to 8.19 version for the introduction of the failure store lifecycle (elastic#127314).

In elastic#127623 we backported elastic#127299 and added a backport transport version for it - `ESQL_AGGREGATE_METRIC_DOUBLE_BLOCK_8_19` aka `8_841_0_24`. This brings that version forwards to `main` and adds support for parsing streams with that version. In elastic#127639 we backported elastic#126401 and added a backport transport version for it - `PINNED_RETRIEVER_8_19` aka `8_841_0_23`. This brings that version forwards to `main` and adds support for parsing streams with that versions. In elastic#127633 we a claimed a backport transport version to backport elastic#127314 - `INTRODUCE_FAILURES_LIFECYCLE_BACKPORT_8_19` aka `8_841_0_23`. That's the same versions as `PINNED_RETRIEVER_8_19`. It's just that this one is in `main` and `PINNED_RETRIEVER_8_19` is in `8.19`. To allow me to bring `PINNED_RETRIEVER_8_19` for wards I've had to revert elastic#127633. Closes elastic#127667

gmarouli added 10 commits April 24, 2025 09:19

Merge getBackingIndicesPastRetention & getFailureIndicesPastRetention

f171f9d

Add configuration for failure store lifecycle

f1bb80f

Use the failure store lifecycle config in DataStreamLifecycleService

c1366c2

Expose the failure store lifecycle in info APIs

3a6f151

Add telemetry for the failure store lifecycle

b9179a6

Ensure backwards compatibility when it comes to failure store config …

073df93

…with null enabled.

Ensure fully reset failure store composes to valid template

e22ce67

Failure store should not inherit the ILM policy from the data

99664a0

Warn the user when the data retention of the failure exceeds the max

217218a

Small test fixes

0e0bd34

gmarouli added >enhancement :Data Management/Data streams Data streams and their lifecycles labels Apr 24, 2025

elasticsearchmachine added Team:Data Management Meta label for data/management team v9.1.0 labels Apr 24, 2025

gmarouli added auto-backport Automatically create backport pull requests when merged v8.19.0 labels Apr 24, 2025

Update docs/changelog/127314.yaml

197e323

Merge branch 'main' into failures-lifecycle-config

09c83a8

elasticsearchmachine added the serverless-linked Added by automation, don't add manually label Apr 24, 2025

Fix test

f9474c1

gmarouli requested a review from jbaiera April 24, 2025 12:42

gmarouli added 7 commits April 27, 2025 17:42

Merge branch 'main' into failures-lifecycle-config

c8de4bd

Merge branch 'main' into failures-lifecycle-config

7ba3698

Fix test

75b6666

Merge branch 'main' into failures-lifecycle-config

45b5e6d

fix test

7bab15e

Merge branch 'main' into failures-lifecycle-config

f5bd96b

Merge branch 'main' into failures-lifecycle-config

18a029f

Do not use literal string when adding the ::failures suffix

7db24cb

jbaiera approved these changes Apr 30, 2025

View reviewed changes

gmarouli added 2 commits April 30, 2025 09:58

Merge branch 'main' into failures-lifecycle-config

550af43

Skip bwc test that doesn't apply anymore

cb0c0d1

gmarouli merged commit 03d7781 into elastic:main Apr 30, 2025
17 checks passed

gmarouli deleted the failures-lifecycle-config branch April 30, 2025 15:22

elasticsearchmachine added the backport pending label Apr 30, 2025

gmarouli mentioned this pull request Apr 30, 2025

[8.19] [Failure store] Introduce dedicated failure store lifecycle configuration (#127314) #127577

Merged

gmarouli mentioned this pull request May 2, 2025

Failure store lifecycle - claim backport transport version #127633

Merged

elasticsearchmachine pushed a commit that referenced this pull request May 2, 2025

Failure store lifecycle - claim backport transport version (#127633)

218c252

In this PR we claim the backport to 8.19 version for the introduction of the failure store lifecycle (#127314).

nik9000 mentioned this pull request May 2, 2025

ESQL: Fix transport versions #127668

Merged

gmarouli removed the backport pending label May 5, 2025

[Failure store] Introduce dedicated failure store lifecycle configuration #127314

[Failure store] Introduce dedicated failure store lifecycle configuration #127314

Uh oh!

Conversation

gmarouli commented Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Apr 24, 2025

Uh oh!

elasticsearchmachine commented Apr 24, 2025

Uh oh!

jbaiera left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

elasticsearchmachine commented Apr 30, 2025

💔 Backport failed

Uh oh!

gmarouli commented Apr 30, 2025

💚 All backports created successfully

Questions ?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gmarouli commented Apr 24, 2025 •

edited

Loading