-
Notifications
You must be signed in to change notification settings - Fork 25.5k
Fix enrich caches outdated value after policy run #133680
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix enrich caches outdated value after policy run #133680
Conversation
Pinging @elastic/es-data-management (Team:Data Management) |
Hi @joegallo, I've created a changelog YAML for you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly a question on how the search runner updates after a policy execution
private SearchRunner createSearchRunner(final ProjectMetadata project, final String indexAlias) { | ||
final Client originClient = new OriginSettingClient(client, ENRICH_ORIGIN); | ||
return (value, maxMatches, reqSupplier, handler) -> { | ||
final String concreteEnrichIndex = getEnrichIndexKey(project, indexAlias); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We capture this enrich index name here when we create the processor, but how will this get updated if we execute an enrich policy and a new index is created?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's all lunatic higher order functions (I say with the highest compliments) -- we capture the client
at createSearchRunner
invocation time, that is, when a processor is created. But the search runner itself doesn't run until it's invoked for some document in AbstractEnrichProcessor#execute
and that's when the concreteEnrichIndex
is finally set (and of course the handler
is another functional argument so this is all clear as mud when one reads the code).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This search runner captures the project metadata to extract that concrete enrich index name, but it only does this at processor creation time. Is there a possibility that when we execute an enrich policy, thus updating the concrete index to use, but this processor isn't recreated when that happens? It looks like we only create a processor if it's configuration changes after a cluster state update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeahhhhhhhhhhhh... to be clear, that problem existed before this, though, right? I mean, it seems like it does work... but I must admit I'm not sure I see how it does work just yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh my! #133752 (edit: merged!)
protected Settings nodeSettings() { | ||
return Settings.builder() | ||
// TODO Change this to run with security enabled | ||
// https://github.com/elastic/elasticsearch/issues/75940 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've already added this new test to #75940.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
💔 Backport failed
You can use sqren/backport to manually backport by running |
Backports are up, so I'm dropping the backport pending label. |
There's a (likely very rare) race condition in the way we update the enrich alias that can result in outdated entries being stored in the cache. As a workaround, if you execute the policy a second time (without having changed the data in the source index), the cache will certainly contain the correct information.
We update the cache key based on the concrete index name associated with the enrich alias based on the cluster state that we see in a cluster state applier. However, we execute searches (and cache the found values) based on the concrete index name associated with the enrich alias based on the current cluster state (after listeners and appliers have run -- I'm a little squishy on exactly this detail). Suffice it to say, there's a race where if there's a slow enough cluster state applier, then it's possible that we run the search against the old enrich index, but we cache the value as if it was for the new enrich index. That is, it results in an outdated entry being stored in the cache (😱).
The fix is to stop searching through the alias. Instead, we use the same dereferencing logic to ensure that the index that we record the cache entry for is also the concrete index that we searched against.