Skip to content

Conversation

joegallo
Copy link
Contributor

@joegallo joegallo commented Aug 27, 2025

There's a (likely very rare) race condition in the way we update the enrich alias that can result in outdated entries being stored in the cache. As a workaround, if you execute the policy a second time (without having changed the data in the source index), the cache will certainly contain the correct information.

We update the cache key based on the concrete index name associated with the enrich alias based on the cluster state that we see in a cluster state applier. However, we execute searches (and cache the found values) based on the concrete index name associated with the enrich alias based on the current cluster state (after listeners and appliers have run -- I'm a little squishy on exactly this detail). Suffice it to say, there's a race where if there's a slow enough cluster state applier, then it's possible that we run the search against the old enrich index, but we cache the value as if it was for the new enrich index. That is, it results in an outdated entry being stored in the cache (😱).

The fix is to stop searching through the alias. Instead, we use the same dereferencing logic to ensure that the index that we record the cache entry for is also the concrete index that we searched against.

@joegallo joegallo added >bug :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP Team:Data Management Meta label for data/management team auto-backport Automatically create backport pull requests when merged v9.2.0 v9.1.4 v9.0.7 v8.18.7 v8.19.4 labels Aug 27, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@elasticsearchmachine
Copy link
Collaborator

Hi @joegallo, I've created a changelog YAML for you.

@jbaiera jbaiera self-requested a review August 28, 2025 03:12
Copy link
Member

@jbaiera jbaiera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly a question on how the search runner updates after a policy execution

private SearchRunner createSearchRunner(final ProjectMetadata project, final String indexAlias) {
final Client originClient = new OriginSettingClient(client, ENRICH_ORIGIN);
return (value, maxMatches, reqSupplier, handler) -> {
final String concreteEnrichIndex = getEnrichIndexKey(project, indexAlias);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We capture this enrich index name here when we create the processor, but how will this get updated if we execute an enrich policy and a new index is created?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's all lunatic higher order functions (I say with the highest compliments) -- we capture the client at createSearchRunner invocation time, that is, when a processor is created. But the search runner itself doesn't run until it's invoked for some document in AbstractEnrichProcessor#execute and that's when the concreteEnrichIndex is finally set (and of course the handler is another functional argument so this is all clear as mud when one reads the code).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This search runner captures the project metadata to extract that concrete enrich index name, but it only does this at processor creation time. Is there a possibility that when we execute an enrich policy, thus updating the concrete index to use, but this processor isn't recreated when that happens? It looks like we only create a processor if it's configuration changes after a cluster state update.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeahhhhhhhhhhhh... to be clear, that problem existed before this, though, right? I mean, it seems like it does work... but I must admit I'm not sure I see how it does work just yet.

Copy link
Contributor Author

@joegallo joegallo Aug 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh my! #133752 (edit: merged!)

@joegallo joegallo changed the title Fix enrich cache containing outdated value after policy execution Fix enrich caches outdated value after policy run Aug 28, 2025
@joegallo joegallo requested a review from jbaiera August 28, 2025 21:14
protected Settings nodeSettings() {
return Settings.builder()
// TODO Change this to run with security enabled
// https://github.com/elastic/elasticsearch/issues/75940
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've already added this new test to #75940.

Copy link
Member

@jbaiera jbaiera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@joegallo joegallo merged commit dc57b3e into elastic:main Aug 30, 2025
33 checks passed
@joegallo joegallo deleted the enrich-policy-cache-race-condition branch August 30, 2025 15:52
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
9.1
9.0 Commit could not be cherrypicked due to conflicts
8.18 Commit could not be cherrypicked due to conflicts
8.19 Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 133680

@joegallo
Copy link
Contributor Author

Backports are up, so I'm dropping the backport pending label.

joegallo added a commit to joegallo/elasticsearch that referenced this pull request Aug 30, 2025
elasticsearchmachine pushed a commit that referenced this pull request Aug 30, 2025
…33879)

* Fix enrich caches outdated value after policy run (#133680)

* Trying to get a new build
sarog pushed a commit to portsbuild/elasticsearch that referenced this pull request Sep 11, 2025
sarog pushed a commit to portsbuild/elasticsearch that referenced this pull request Sep 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Automatically create backport pull requests when merged >bug :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP Team:Data Management Meta label for data/management team v8.18.7 v8.19.4 v9.0.7 v9.1.4 v9.2.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants