Skip to content

Conversation

@gmarouli
Copy link
Contributor

@gmarouli gmarouli commented Apr 24, 2025

The failure store is a set of data stream indices that are used to store certain type of ingestion failures. Until this moment they were sharing the configuration of the backing indices. We understand that the two data sets have different lifecycle needs.

We believe that typically the failures will need to be retained much less than the data. Considering this we believe the lifecycle needs of the failures also more limited and they fit better the simplicity of the data stream lifecycle feature.

This allows the user to only set the desired retention and we will perform the rollover and other maintenance tasks without the user having to think about them. Furthermore, having only one lifecycle management feature allows us to ensure that these data is managed by default.

This PR introduces the following:

Configuration

We extend the failure store configuration to allow lifecycle configuration too, this configuration reflects the user's configuration only as shown below:

PUT _data_stream/*/options
{
  "failure_store": {
     "lifecycle": {
       "data_retention": "5d"
     }
  }
}

GET _data_stream/*/options

{
  "data_streams": [
    {
      "name": "my-ds",
      "options": {
        "failure_store": {
          "lifecycle": {
            "data_retention": "5d"
          }
        }
      }
    }
  ]
}

To retrieve the effective configuration you need to use the GET data streams API, see #126668

Functionality

  • The data stream lifecycle (DLM) will manage the failure indices regardless if the failure store is enabled or not. This will ensure that if the failure store gets disabled we will not have stagnant data.
  • The data stream options APIs reflect only the user's configuration.
  • The GET data stream API should be used to check the current state of the effective failure store configuration.
    Telemetry

We extend the data stream failure store telemetry to also include the lifecycle telemetry.

{
  "data_streams": {
     "available": true,
     "enabled": true,
     "data_streams": 10,
     "indices_count": 50,
     "failure_store": {
       "explicitly_enabled_count": 1,
       "effectively_enabled_count": 15,
       "failure_indices_count": 30
       "lifecycle": { 
         "explicitly_enabled_count": 5,
         "effectively_enabled_count": 20,
         "data_retention": {
           "configured_data_streams": 5,
           "minimum_millis": X,
           "maximum_millis": Y,
           "average_millis": Z,
          },
          "effective_retention": {
            "retained_data_streams": 20,
            "minimum_millis": X,
            "maximum_millis": Y, 
            "average_millis": Z
          },
         "global_retention": {
           "max": {
             "defined": false
           },
           "default": {
             "defined": true,  <------ this is the default value applicable for the failure store
             "millis": X
           }
        }
      }
   }
}

Implementation details

  • We ensure that partially reset failure store will create valid failure store configuration.
  • We ensure that when a node communicates with a note with a previous version it will ensure it will not send an invalid failure store configuration enabled: null.

@gmarouli gmarouli added >enhancement :Data Management/Data streams Data streams and their lifecycles labels Apr 24, 2025
@elasticsearchmachine elasticsearchmachine added Team:Data Management Meta label for data/management team v9.1.0 labels Apr 24, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@gmarouli gmarouli added auto-backport Automatically create backport pull requests when merged v8.19.0 labels Apr 24, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @gmarouli, I've created a changelog YAML for you.

@elasticsearchmachine elasticsearchmachine added the serverless-linked Added by automation, don't add manually label Apr 24, 2025
@gmarouli gmarouli requested a review from jbaiera April 24, 2025 12:42
Copy link
Member

@jbaiera jbaiera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have left a couple of non binding questions, but otherwise this LGTM!

// We don't issue any warnings if all data streams are internal data streams
dataStreamOptions.failureStore()
.lifecycle()
.addWarningHeaderIfDataRetentionNotEffective(globalRetentionSettings.get(), onlyInternalDataStreams);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't have the default failure store retention setting yet does it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet, it defaults to the global default right now. In the follow up PR it will only use the failures default. Do you think we should handle it differently?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah, follow up is perfectly fine!

for (DataStream dataStream : project.dataStreams().values()) {
clearErrorStoreForUnmanagedIndices(project, dataStream);
if (dataStream.getDataLifecycle() == null) {
var dataLifecycleEnabled = dataStream.getDataLifecycle() != null && dataStream.getDataLifecycle().enabled();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously we weren't checking if the lifecycle was enabled, I'm assuming there isn't any important logic that follows this that needs executing if the data lifecycle configuration is present but disabled? Also, is present-but-disabled an invalid configuration state for the normal lifecycle?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good questions.

  • A lifecycle can be present by disabled, either for maintenance or if a user wants it disable for some reason.
  • If a lifecycle is not present or if it's disabled, then the lifecycle service should not perform any operations for this data stream, so I think it makes sense to skip it if both the data and failures lifecycle are not enabled.
  • I will double check the code to ensure the data lifecycle being null is handled properly, because this is something that could not happen before.
  • We could keep just the null check here of course, it doesn't cost that much, but it creates this doubt that null and disabled behave differently, that's why I would prefer to include it here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed, all actions check DataStream::isIndexManagedByDataStreamLifecycle() before executing any operation and this method handled a null lifecycle.

ResettableValue.create(new DataStreamFailureStore.Template(ResettableValue.create(true)))
)
)
new DataStreamOptions.Template(DataStreamFailureStore.builder().enabled(true).buildTemplate())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm so happy 😌

*/

package org.elasticsearch.datastreams.options.action;
package org.elasticsearch.action.datastreams;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was looking for where this was now used that it needed to be moved and I think I missed it. Where is this used now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's for serverless.

@gmarouli gmarouli merged commit 03d7781 into elastic:main Apr 30, 2025
17 checks passed
@gmarouli gmarouli deleted the failures-lifecycle-config branch April 30, 2025 15:22
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
8.19 Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 127314

@gmarouli
Copy link
Contributor Author

💚 All backports created successfully

Status Branch Result
8.19

Questions ?

Please refer to the Backport tool documentation

gmarouli added a commit to gmarouli/elasticsearch that referenced this pull request Apr 30, 2025
…tion (elastic#127314)

The failure store is a set of data stream indices that are used to store certain type of ingestion failures. Until this moment they were sharing the configuration of the backing indices. We understand that the two data sets have different lifecycle needs.

We believe that typically the failures will need to be retained much less than the data. Considering this we believe the lifecycle needs of the failures also more limited and they fit better the simplicity of the data stream lifecycle feature.

This allows the user to only set the desired retention and we will perform the rollover and other maintenance tasks without the user having to think about them. Furthermore, having only one lifecycle management feature allows us to ensure that these data is managed by default.

This PR introduces the following:

Configuration

We extend the failure store configuration to allow lifecycle configuration too, this configuration reflects the user's configuration only as shown below:

PUT _data_stream/*/options
{
  "failure_store": {
     "lifecycle": {
       "data_retention": "5d"
     }
  }
}

GET _data_stream/*/options

{
  "data_streams": [
    {
      "name": "my-ds",
      "options": {
        "failure_store": {
          "lifecycle": {
            "data_retention": "5d"
          }
        }
      }
    }
  ]
}
To retrieve the effective configuration you need to use the GET data streams API, see elastic#126668

Functionality

The data stream lifecycle (DLM) will manage the failure indices regardless if the failure store is enabled or not. This will ensure that if the failure store gets disabled we will not have stagnant data.
The data stream options APIs reflect only the user's configuration.
The GET data stream API should be used to check the current state of the effective failure store configuration.
Telemetry
We extend the data stream failure store telemetry to also include the lifecycle telemetry.

{
  "data_streams": {
     "available": true,
     "enabled": true,
     "data_streams": 10,
     "indices_count": 50,
     "failure_store": {
       "explicitly_enabled_count": 1,
       "effectively_enabled_count": 15,
       "failure_indices_count": 30
       "lifecycle": {
         "explicitly_enabled_count": 5,
         "effectively_enabled_count": 20,
         "data_retention": {
           "configured_data_streams": 5,
           "minimum_millis": X,
           "maximum_millis": Y,
           "average_millis": Z,
          },
          "effective_retention": {
            "retained_data_streams": 20,
            "minimum_millis": X,
            "maximum_millis": Y,
            "average_millis": Z
          },
         "global_retention": {
           "max": {
             "defined": false
           },
           "default": {
             "defined": true,  <------ this is the default value applicable for the failure store
             "millis": X
           }
        }
      }
   }
}
Implementation details

We ensure that partially reset failure store will create valid failure store configuration.
We ensure that when a node communicates with a note with a previous version it will ensure it will not send an invalid failure store configuration enabled: null.

(cherry picked from commit 03d7781)

# Conflicts:
#	modules/data-streams/src/main/java/org/elasticsearch/datastreams/DataStreamsPlugin.java
#	modules/data-streams/src/main/java/org/elasticsearch/datastreams/lifecycle/DataStreamLifecycleService.java
#	modules/data-streams/src/test/java/org/elasticsearch/datastreams/lifecycle/DataStreamLifecycleServiceTests.java
#	server/src/main/java/org/elasticsearch/TransportVersions.java
#	server/src/main/java/org/elasticsearch/cluster/metadata/MetadataDataStreamsService.java
#	server/src/test/java/org/elasticsearch/cluster/metadata/DataStreamTests.java
elasticsearchmachine pushed a commit that referenced this pull request May 2, 2025
In this PR we claim the backport to 8.19 version for the introduction of
the failure store lifecycle
(#127314).
nik9000 added a commit to nik9000/elasticsearch that referenced this pull request May 2, 2025
In elastic#127623 we backported elastic#127299 and added a backport transport version
for it - `ESQL_AGGREGATE_METRIC_DOUBLE_BLOCK_8_19` aka `8_841_0_24`.
This brings that version forwards to `main` and adds support for parsing
streams with that version.

In elastic#127639 we backported elastic#126401 and added a backport transport version
for it - `PINNED_RETRIEVER_8_19` aka `8_841_0_23`. This brings that
version forwards to `main` and adds support for parsing streams with
that versions.

In elastic#127633 we a claimed a backport transport version to backport elastic#127314 -
`INTRODUCE_FAILURES_LIFECYCLE_BACKPORT_8_19` aka `8_841_0_23`. That's
the same versions as `PINNED_RETRIEVER_8_19`. It's just that this one is
in `main` and `PINNED_RETRIEVER_8_19` is in `8.19`. To allow me to bring
`PINNED_RETRIEVER_8_19` for wards I've had to revert elastic#127633.

Closes elastic#127667
elasticsearchmachine pushed a commit that referenced this pull request May 3, 2025
In #127623 we backported #127299 and added a backport transport version
for it - `ESQL_AGGREGATE_METRIC_DOUBLE_BLOCK_8_19` aka `8_841_0_24`.
This brings that version forwards to `main` and adds support for parsing
streams with that version.

In #127639 we backported #126401 and added a backport transport version
for it - `PINNED_RETRIEVER_8_19` aka `8_841_0_23`. This brings that
version forwards to `main` and adds support for parsing streams with
that versions.

In #127633 we a claimed a backport transport version to backport #127314
- `INTRODUCE_FAILURES_LIFECYCLE_BACKPORT_8_19` aka `8_841_0_23`. That's
the same versions as `PINNED_RETRIEVER_8_19`. It's just that this one is
in `main` and `PINNED_RETRIEVER_8_19` is in `8.19`. To allow me to bring
`PINNED_RETRIEVER_8_19` for wards I've had to revert #127633.

Closes #127667
elasticsearchmachine pushed a commit that referenced this pull request May 3, 2025
…nfiguration (#127314) (#127577)

* [Failure store] Introduce dedicated failure store lifecycle configuration (#127314)

The failure store is a set of data stream indices that are used to store certain type of ingestion failures. Until this moment they were sharing the configuration of the backing indices. We understand that the two data sets have different lifecycle needs.

We believe that typically the failures will need to be retained much less than the data. Considering this we believe the lifecycle needs of the failures also more limited and they fit better the simplicity of the data stream lifecycle feature.

This allows the user to only set the desired retention and we will perform the rollover and other maintenance tasks without the user having to think about them. Furthermore, having only one lifecycle management feature allows us to ensure that these data is managed by default.

This PR introduces the following:

Configuration

We extend the failure store configuration to allow lifecycle configuration too, this configuration reflects the user's configuration only as shown below:

PUT _data_stream/*/options
{
  "failure_store": {
     "lifecycle": {
       "data_retention": "5d"
     }
  }
}

GET _data_stream/*/options

{
  "data_streams": [
    {
      "name": "my-ds",
      "options": {
        "failure_store": {
          "lifecycle": {
            "data_retention": "5d"
          }
        }
      }
    }
  ]
}
To retrieve the effective configuration you need to use the GET data streams API, see #126668

Functionality

The data stream lifecycle (DLM) will manage the failure indices regardless if the failure store is enabled or not. This will ensure that if the failure store gets disabled we will not have stagnant data.
The data stream options APIs reflect only the user's configuration.
The GET data stream API should be used to check the current state of the effective failure store configuration.
Telemetry
We extend the data stream failure store telemetry to also include the lifecycle telemetry.

{
  "data_streams": {
     "available": true,
     "enabled": true,
     "data_streams": 10,
     "indices_count": 50,
     "failure_store": {
       "explicitly_enabled_count": 1,
       "effectively_enabled_count": 15,
       "failure_indices_count": 30
       "lifecycle": {
         "explicitly_enabled_count": 5,
         "effectively_enabled_count": 20,
         "data_retention": {
           "configured_data_streams": 5,
           "minimum_millis": X,
           "maximum_millis": Y,
           "average_millis": Z,
          },
          "effective_retention": {
            "retained_data_streams": 20,
            "minimum_millis": X,
            "maximum_millis": Y,
            "average_millis": Z
          },
         "global_retention": {
           "max": {
             "defined": false
           },
           "default": {
             "defined": true,  <------ this is the default value applicable for the failure store
             "millis": X
           }
        }
      }
   }
}
Implementation details

We ensure that partially reset failure store will create valid failure store configuration.
We ensure that when a node communicates with a note with a previous version it will ensure it will not send an invalid failure store configuration enabled: null.

(cherry picked from commit 03d7781)

# Conflicts:
#	modules/data-streams/src/main/java/org/elasticsearch/datastreams/DataStreamsPlugin.java
#	modules/data-streams/src/main/java/org/elasticsearch/datastreams/lifecycle/DataStreamLifecycleService.java
#	modules/data-streams/src/test/java/org/elasticsearch/datastreams/lifecycle/DataStreamLifecycleServiceTests.java
#	server/src/main/java/org/elasticsearch/TransportVersions.java
#	server/src/main/java/org/elasticsearch/cluster/metadata/MetadataDataStreamsService.java
#	server/src/test/java/org/elasticsearch/cluster/metadata/DataStreamTests.java

* [CI] Auto commit changes from spotless

---------

Co-authored-by: elasticsearchmachine <[email protected]>
jfreden pushed a commit to jfreden/elasticsearch that referenced this pull request May 12, 2025
…27633)

In this PR we claim the backport to 8.19 version for the introduction of
the failure store lifecycle
(elastic#127314).
jfreden pushed a commit to jfreden/elasticsearch that referenced this pull request May 12, 2025
In elastic#127623 we backported elastic#127299 and added a backport transport version
for it - `ESQL_AGGREGATE_METRIC_DOUBLE_BLOCK_8_19` aka `8_841_0_24`.
This brings that version forwards to `main` and adds support for parsing
streams with that version.

In elastic#127639 we backported elastic#126401 and added a backport transport version
for it - `PINNED_RETRIEVER_8_19` aka `8_841_0_23`. This brings that
version forwards to `main` and adds support for parsing streams with
that versions.

In elastic#127633 we a claimed a backport transport version to backport elastic#127314
- `INTRODUCE_FAILURES_LIFECYCLE_BACKPORT_8_19` aka `8_841_0_23`. That's
the same versions as `PINNED_RETRIEVER_8_19`. It's just that this one is
in `main` and `PINNED_RETRIEVER_8_19` is in `8.19`. To allow me to bring
`PINNED_RETRIEVER_8_19` for wards I've had to revert elastic#127633.

Closes elastic#127667
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Automatically create backport pull requests when merged :Data Management/Data streams Data streams and their lifecycles >enhancement serverless-linked Added by automation, don't add manually Team:Data Management Meta label for data/management team v8.19.0 v9.1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants