Add node feature for failure store, refactor capability names #126885

jbaiera · 2025-04-16T01:55:35Z

Adds a node feature that is conditionally added to the cluster state if the failure store feature flag is enabled. Requires all nodes in the cluster to have the node feature present in order to redirect failed documents to the failure store from the ingest node or from shard level bulk failures.

Additionally, updates the names of some of the capabilities returned from the API's:

failure_store_in_template becomes data_stream_options.failure_store
index_expression_selectors added to search api
lazy-rollover-failure-store replaced by index_expression_selectors
index-expression-selectors replaced by index_expression_selectors to match search api
data_stream_failure_store_cluster_setting replaced by the new node feature

Require it on every node before redirecting data to maintain a consistent response until the cluster is fully updated.

elasticsearchmachine · 2025-04-16T04:17:17Z

Pinging @elastic/es-data-management (Team:Data Management)

…ocuments.

…ssing.

… is missing.

jbaiera · 2025-04-16T19:20:26Z

Moved the FeatureService creation point to be earlier in the NodeConstruction so that IngestService can use the service. We'd like to make use of the FeatureService to ensure that downstream ingestion logic is present and consistent on all nodes before applying failure store logic in the IngestService

cc @elastic/es-core-infra as code owners

…lover's to match

server/src/main/java/org/elasticsearch/ingest/IngestService.java

masseyke · 2025-04-17T20:03:36Z

server/src/main/java/org/elasticsearch/action/bulk/TransportBulkAction.java

        AtomicArray<BulkItemResponse> responses
    ) {
+        // Determine if we have the feature enabled once for entire bulk operation
+        final boolean clusterSupportsFailureStore = featureService.clusterHasFeature(


I assume even for a very large cluster, this check is cheap enough relatively to a bulk request that it's fine to run on every bulk? I think it just loops through all the nodes in the cluster state.

It is indeed a linear check over the list of nodes in the cluster. Unfortunately there's not a great way to hoist this up further than a request-by-request basis without introducing some kind of timer element or observer interface. We need to be responsive to changes in the cluster in a timely manner, and refactoring things to build off a cluster state listener seems like more complexity than is worth at the moment. I'd be hard pressed to try and optimize this further without indication that it's in a bad place.

masseyke · 2025-04-17T20:13:26Z

server/src/test/java/org/elasticsearch/ingest/IngestServiceTests.java

+            new FeatureService(List.of()) {
+                @Override
+                public boolean clusterHasFeature(ClusterState state, NodeFeature feature) {
+                    return DataStream.DATA_STREAM_FAILURE_STORE_FEATURE.equals(feature);
+                }


It stands out as odd that this block of code is repeated over and over again. I don't have a great suggestion though -- maybe a static test helper FeatureService that just says yes to everything? That would avoid someone having to change this in 13 places if they add a new feature that they want used in all the tests.

Yeah, FeatureService really feels like it wants a test implementation. I thought we had one but it seems to be unrelated. These were all to just get the tests happy, only one or two of them have any differentiating logic.

masseyke

I left a few minor questions, but LGTM.

rjernst

NodeConstruction changes look fine.

jbaiera · 2025-04-17T23:02:02Z

@elasticmachine update branch

jbaiera · 2025-04-18T02:40:00Z

@elasticmachine update branch

jbaiera · 2025-04-18T15:58:23Z

@elasticmachine update branch

elasticsearchmachine · 2025-04-18T17:44:00Z

💔 Backport failed

Status	Branch	Result
❌	8.x	Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 126885

…c#126885) Adds a node feature that is conditionally added to the cluster state if the failure store feature flag is enabled. Requires all nodes in the cluster to have the node feature present in order to redirect failed documents to the failure store from the ingest node or from shard level bulk failures.

…126885) (#127091) * Add node feature for failure store, refactor capability names (#126885) Adds a node feature that is conditionally added to the cluster state if the failure store feature flag is enabled. Requires all nodes in the cluster to have the node feature present in order to redirect failed documents to the failure store from the ingest node or from shard level bulk failures. * Fix backporting issues

Adds a node feature that is conditionally added to the cluster state if the failure store feature flag is enabled. Requires all nodes in the cluster to have the node feature present in order to redirect failed documents to the failure store from the ingest node or from shard level bulk failures. (cherry picked from commit d928d1a)

jbaiera added 4 commits April 15, 2025 18:01

Introduce NodeFeature for failure store.

2edf4a7

Require it on every node before redirecting data to maintain a consistent response until the cluster is fully updated.

Remove put settings API capability and replace with node feature.

62772cc

Remove lazy rollover failure store capability

834260e

Rename failure_store_in_template to data_stream_options.failure_store

8e5d202

jbaiera requested a review from gmarouli April 16, 2025 01:55

elasticsearchmachine added needs:triage Requires assignment of a team area label v9.1.0 labels Apr 16, 2025

jbaiera added >non-issue :Data Management/Data streams Data streams and their lifecycles v8.19.0 labels Apr 16, 2025

elasticsearchmachine added Team:Data Management Meta label for data/management team and removed needs:triage Requires assignment of a team area label labels Apr 16, 2025

jbaiera added 4 commits April 16, 2025 00:23

Re-add test node feature because they are load bearing

e318e81

Refactor the node feature checks to apply one time before iterating d…

0aff24c

…ocuments.

Add a test to ensure ingest does not redirect when node feature is mi…

e975537

…ssing.

Add a test to ensure shard failures do not redirect when node feature…

5ab4679

… is missing.

jbaiera requested a review from a team as a code owner April 16, 2025 19:17

jbaiera added 2 commits April 16, 2025 16:18

Add index_expression_selectors to the search capabilities, rename rol…

277dd0a

…lover's to match

Merge branch 'main' into failure-store-feature-update

6cab045

masseyke reviewed Apr 17, 2025

View reviewed changes

server/src/main/java/org/elasticsearch/ingest/IngestService.java Outdated Show resolved Hide resolved

masseyke reviewed Apr 17, 2025

View reviewed changes

masseyke approved these changes Apr 17, 2025

View reviewed changes

rjernst approved these changes Apr 17, 2025

View reviewed changes

Simplify resolver logic

64fcb66

jbaiera added the auto-backport Automatically create backport pull requests when merged label Apr 17, 2025

Merge branch 'main' into failure-store-feature-update

f04e918

Merge branch 'main' into failure-store-feature-update

5de595a

Merge branch 'main' into failure-store-feature-update

9ca74cb

jbaiera mentioned this pull request Apr 18, 2025

Add ability to redirect ingestion failures on data streams to a failure store #126973

Merged

jbaiera merged commit d928d1a into elastic:main Apr 18, 2025
17 checks passed

jbaiera deleted the failure-store-feature-update branch April 18, 2025 17:42

elasticsearchmachine added the backport pending label Apr 18, 2025

jbaiera mentioned this pull request Apr 19, 2025

[8.x] Add node feature for failure store, refactor capability names (#126885) #127091

Merged

jbaiera removed the backport pending label Apr 22, 2025

Add node feature for failure store, refactor capability names #126885

Add node feature for failure store, refactor capability names #126885

Uh oh!

Conversation

jbaiera commented Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Apr 16, 2025

Uh oh!

jbaiera commented Apr 16, 2025

Uh oh!

Uh oh!

masseyke Apr 17, 2025

Choose a reason for hiding this comment

Uh oh!

jbaiera Apr 17, 2025

Choose a reason for hiding this comment

Uh oh!

masseyke Apr 17, 2025

Choose a reason for hiding this comment

Uh oh!

jbaiera Apr 17, 2025

Choose a reason for hiding this comment

Uh oh!

masseyke left a comment

Choose a reason for hiding this comment

Uh oh!

rjernst left a comment

Choose a reason for hiding this comment

Uh oh!

jbaiera commented Apr 17, 2025

Uh oh!

jbaiera commented Apr 18, 2025

Uh oh!

jbaiera commented Apr 18, 2025

Uh oh!

Uh oh!

elasticsearchmachine commented Apr 18, 2025

💔 Backport failed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jbaiera commented Apr 16, 2025 •

edited

Loading