-
Notifications
You must be signed in to change notification settings - Fork 25.5k
Improve PIT context relocation #135231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Improve PIT context relocation #135231
Conversation
Hi @cbuescher, I've created a changelog YAML for you. |
Pinging @elastic/es-search-foundations (Team:Search Foundations) |
eb20fe5
to
3b233da
Compare
This change improves point-in-time (PIT) context relocations in cases where we loose a PIT due to e.g. node shutdowns or restarts. We are already able to retry shards where open PIT contexts have been lost in the case where re-create identical reader contests on wither the same or a different node for read-only and frozen indices (via FrozenEngine / ReadOnlyEngine). However, currently we are re-creating these contexts with every subsequent search because we don’t store the recreated context in a similar way as when we are opening a new PIT. This PR avoids this extra work.
3b233da
to
730efd4
Compare
Hi @dnhatn, can I ask you for a review on this since you probably are still the one most familiar with the PIT id rewrite hook and the general current retry logic for frozen and read-only shards? |
@cbuescher Sure, I'll take a look today or tomorrow. |
beb3f48
to
fb526d4
Compare
I think we already handle this case with sessionId, which should be different each time a node is restarted or between different nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @cbuescher. I've left some questions and comments, but I think re-encoding the PIT when the target nodes have changed is a good idea.
.setPointInTime(new PointInTimeBuilder(pitId).setKeepAlive(TimeValue.timeValueMinutes(2))), | ||
resp -> { | ||
assertHitCount(resp, docCount); | ||
updatedPit.set(resp.pointInTimeId()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we verify that we have update the PIT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. However, its not impossible the PIT id doesn't change. The test restarts all nodes. If all shards in the original PIT get allocated to their original nodes again we will reopen the context in the same location, leaving everything in that particular SearchContextIdForNode the same. When running tests I definitely saw both scenarios.
server/src/main/java/org/elasticsearch/action/search/AbstractSearchAsyncAction.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/action/search/AbstractSearchAsyncAction.java
Show resolved
Hide resolved
} | ||
} | ||
|
||
static <Result extends SearchPhaseResult> BytesReference maybeReEncodeNodeIds( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: I haven't looked into how we handle partial results with re-encoding. Have you considered this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume you are referring to the functionality added with #111516? If I understand that PR correctly we will have SearchContextIdForNode entries in the PIT with a "null" node entry. I think in that case we won't add that shard to the shard iterators of any subsequent search, so we won't get a Result for that shard. That is one reason why I added copying PIT id entries for everything that has no Result from the old to the new encoded ID without change.
Is that what you mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked again at how we handle partial results. The way I see it, when opening the PIT we can tolerate partial results, i.e. failures from certain shards when they are not available. The result will be an entry in the PIT id that has a ShardId but a ShardContextIdForNode with empty node/contextId
The way I see it, we should not change any of these entries, which is what is already happening in this method.
In addition the re-encoding step here doesn't change any entries for shards that have failed in the last search. They shouldn't be included in the "results" list and for that reason not update the related part in the updated PIT id. In cases where these failures are temporary, subsequent searches with the updated ID will try to hit the "old" shard context locations, if any of these can be re-tried we will updated that part of the PIT in a later call.
server/src/main/java/org/elasticsearch/action/search/AbstractSearchAsyncAction.java
Outdated
Show resolved
Hide resolved
updatedShardMap.put(shardId, original.shards().get(shardId)); | ||
} | ||
} | ||
return SearchContextId.encode(updatedShardMap, original.aliasFilter(), mintransportVersion, failures); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another question is how to clean up search contexts from old PITs. For example, when a shard-level request fails, we try to execute it on another copy. In these cases, we re-encode the PIT. If users close the new PIT, the old search context won't be closed. This is not an issue with stateful, but it can be a problem with serverless. Should we close the old search contexts here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean shard-level request that are not related to the search context Missing? I didn't consider this yet, good point I think. I was assuming that if get a result from another node than the original one that would always mean the old context is gone, but you are right we might retry elsewhere for other reasons.
Would you suggest something like the fire-and-forget approach we use e.g. here in TransportSearchAction from this location? I assume a close request is relatively cheap even id the old context doesn't exist any longer and we can treat this as a best-effort attempt. If this fails at some point the context reaper process should clean other stuff up thats over the keepalive limit, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just see TransportSearchAction goes to all shards, it would probably be something more selective to only the shards that had changes in location and use ClearScrollController#closeContexts directly.
@dnhatn thanks a lot for the review. I addressed some of your comments and left questions otherwise. I added closing logic in the re-encoding step and adapted the tests accordingly, however I'm not sure that is exactly what you suggested. Let me know what you think. About your comment above:
This was partially motivated by other planned changes that will enable re-locating PITs between nodes even without retries triggered by subsequent search requests. I will need to think this over a bit more and might get back to you with more questions to deepen my understanding of how sessionIds are currently used. |
@cbuescher Please let me know when it’s ready for another review. Thank you for working on it. |
@dnhatn thanks for the feedback, I added some changes around ignoring failures when re-encoding the PIT and left a small comment regarding partial failures in #135231 (comment). |
Only store retried contexts and rewrite PIT when the feature is enabled.
This change improves point-in-time (PIT) context relocations in cases where we
loose a PIT due to e.g. node shutdowns or restarts. We are already able to retry
shards where open PIT contexts have been lost in the case where re-create
identical reader contests on wither the same or a different node for read-only
and frozen indices (via FrozenEngine / ReadOnlyEngine).
However, currently we are re-creating these contexts with every subsequent search
because we don’t store the recreated context in a similar way as when we are opening
a new PIT. This PR avoids this extra work by making the following changes:
map ReaderContexts held in SearchService by more than just an auto-incremented Long
id that can clash between different nodes. Instead this PR changes the lookup to use the full
ShardSearchContextId
re-writing the PIT id we return with every search request in case where the results
we receive show that the shards involved in answering the PIT request have been
served by a different node than the one originally encoded in the PIT id
The later is possible because the in the original PIT design already provided
hooks for changing the PIT id between search request and we already document
that open point in time ids can change between requests.
Currently, without re-writing the PIT ids, we can get e.g. into situations where
we lose a shard PIT context due to node restart, but are able to re-create it on
another node. If the shards in the group got reassigned to nodes different than
the original node, subsequent requests can get routed for retries to alternating
different nodes. This doesn’t matter in the current situation where we throw away each
re-created PIT context after use, but with this improvement that tracks these
context on the new nodes again, we want to make sure that subsequent searches go
to the same node.