Skip to content

Conversation

@jbaiera
Copy link
Member

@jbaiera jbaiera commented Apr 16, 2025

Adds a node feature that is conditionally added to the cluster state if the failure store feature flag is enabled. Requires all nodes in the cluster to have the node feature present in order to redirect failed documents to the failure store from the ingest node or from shard level bulk failures.

Additionally, updates the names of some of the capabilities returned from the API's:

  • failure_store_in_template becomes data_stream_options.failure_store
  • index_expression_selectors added to search api
  • lazy-rollover-failure-store replaced by index_expression_selectors
  • index-expression-selectors replaced by index_expression_selectors to match search api
  • data_stream_failure_store_cluster_setting replaced by the new node feature

@jbaiera jbaiera requested a review from gmarouli April 16, 2025 01:55
@elasticsearchmachine elasticsearchmachine added needs:triage Requires assignment of a team area label v9.1.0 labels Apr 16, 2025
@elasticsearchmachine elasticsearchmachine added Team:Data Management Meta label for data/management team and removed needs:triage Requires assignment of a team area label labels Apr 16, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@jbaiera jbaiera requested a review from a team as a code owner April 16, 2025 19:17
@jbaiera
Copy link
Member Author

jbaiera commented Apr 16, 2025

Moved the FeatureService creation point to be earlier in the NodeConstruction so that IngestService can use the service. We'd like to make use of the FeatureService to ensure that downstream ingestion logic is present and consistent on all nodes before applying failure store logic in the IngestService

cc @elastic/es-core-infra as code owners

AtomicArray<BulkItemResponse> responses
) {
// Determine if we have the feature enabled once for entire bulk operation
final boolean clusterSupportsFailureStore = featureService.clusterHasFeature(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume even for a very large cluster, this check is cheap enough relatively to a bulk request that it's fine to run on every bulk? I think it just loops through all the nodes in the cluster state.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is indeed a linear check over the list of nodes in the cluster. Unfortunately there's not a great way to hoist this up further than a request-by-request basis without introducing some kind of timer element or observer interface. We need to be responsive to changes in the cluster in a timely manner, and refactoring things to build off a cluster state listener seems like more complexity than is worth at the moment. I'd be hard pressed to try and optimize this further without indication that it's in a bad place.

Comment on lines +2522 to +2526
new FeatureService(List.of()) {
@Override
public boolean clusterHasFeature(ClusterState state, NodeFeature feature) {
return DataStream.DATA_STREAM_FAILURE_STORE_FEATURE.equals(feature);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It stands out as odd that this block of code is repeated over and over again. I don't have a great suggestion though -- maybe a static test helper FeatureService that just says yes to everything? That would avoid someone having to change this in 13 places if they add a new feature that they want used in all the tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, FeatureService really feels like it wants a test implementation. I thought we had one but it seems to be unrelated. These were all to just get the tests happy, only one or two of them have any differentiating logic.

Copy link
Member

@masseyke masseyke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few minor questions, but LGTM.

Copy link
Member

@rjernst rjernst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NodeConstruction changes look fine.

@jbaiera jbaiera added the auto-backport Automatically create backport pull requests when merged label Apr 17, 2025
@jbaiera
Copy link
Member Author

jbaiera commented Apr 17, 2025

@elasticmachine update branch

@jbaiera
Copy link
Member Author

jbaiera commented Apr 18, 2025

@elasticmachine update branch

@jbaiera
Copy link
Member Author

jbaiera commented Apr 18, 2025

@elasticmachine update branch

@jbaiera jbaiera merged commit d928d1a into elastic:main Apr 18, 2025
17 checks passed
@jbaiera jbaiera deleted the failure-store-feature-update branch April 18, 2025 17:42
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
8.x Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 126885

jbaiera added a commit to jbaiera/elasticsearch that referenced this pull request Apr 19, 2025
…c#126885)

Adds a node feature that is conditionally added to the cluster state if the failure store
feature flag is enabled. Requires all nodes in the cluster to have the node feature
present in order to redirect failed documents to the failure store from the ingest node
or from shard level bulk failures.
elasticsearchmachine pushed a commit that referenced this pull request Apr 20, 2025
…126885) (#127091)

* Add node feature for failure store, refactor capability names (#126885)

Adds a node feature that is conditionally added to the cluster state if the failure store
feature flag is enabled. Requires all nodes in the cluster to have the node feature
present in order to redirect failed documents to the failure store from the ingest node
or from shard level bulk failures.

* Fix backporting issues
carlosdelest pushed a commit that referenced this pull request Apr 22, 2025
Adds a node feature that is conditionally added to the cluster state if the failure store
feature flag is enabled. Requires all nodes in the cluster to have the node feature
present in order to redirect failed documents to the failure store from the ingest node
or from shard level bulk failures.

(cherry picked from commit d928d1a)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Automatically create backport pull requests when merged :Data Management/Data streams Data streams and their lifecycles >non-issue Team:Data Management Meta label for data/management team v8.19.0 v9.1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants