-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Improve PIT context relocation #135231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve PIT context relocation #135231
Conversation
|
Hi @cbuescher, I've created a changelog YAML for you. |
|
Pinging @elastic/es-search-foundations (Team:Search Foundations) |
eb20fe5 to
3b233da
Compare
This change improves point-in-time (PIT) context relocations in cases where we loose a PIT due to e.g. node shutdowns or restarts. We are already able to retry shards where open PIT contexts have been lost in the case where re-create identical reader contests on wither the same or a different node for read-only and frozen indices (via FrozenEngine / ReadOnlyEngine). However, currently we are re-creating these contexts with every subsequent search because we don’t store the recreated context in a similar way as when we are opening a new PIT. This PR avoids this extra work.
3b233da to
730efd4
Compare
|
Hi @dnhatn, can I ask you for a review on this since you probably are still the one most familiar with the PIT id rewrite hook and the general current retry logic for frozen and read-only shards? |
|
@cbuescher Sure, I'll take a look today or tomorrow. |
beb3f48 to
fb526d4
Compare
I think we already handle this case with sessionId, which should be different each time a node is restarted or between different nodes. |
dnhatn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @cbuescher. I've left some questions and comments, but I think re-encoding the PIT when the target nodes have changed is a good idea.
| .setPointInTime(new PointInTimeBuilder(pitId).setKeepAlive(TimeValue.timeValueMinutes(2))), | ||
| resp -> { | ||
| assertHitCount(resp, docCount); | ||
| updatedPit.set(resp.pointInTimeId()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we verify that we have update the PIT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. However, its not impossible the PIT id doesn't change. The test restarts all nodes. If all shards in the original PIT get allocated to their original nodes again we will reopen the context in the same location, leaving everything in that particular SearchContextIdForNode the same. When running tests I definitely saw both scenarios.
server/src/main/java/org/elasticsearch/action/search/AbstractSearchAsyncAction.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/action/search/AbstractSearchAsyncAction.java
Outdated
Show resolved
Hide resolved
| } | ||
| } | ||
|
|
||
| static <Result extends SearchPhaseResult> BytesReference maybeReEncodeNodeIds( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: I haven't looked into how we handle partial results with re-encoding. Have you considered this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume you are referring to the functionality added with #111516? If I understand that PR correctly we will have SearchContextIdForNode entries in the PIT with a "null" node entry. I think in that case we won't add that shard to the shard iterators of any subsequent search, so we won't get a Result for that shard. That is one reason why I added copying PIT id entries for everything that has no Result from the old to the new encoded ID without change.
Is that what you mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked again at how we handle partial results. The way I see it, when opening the PIT we can tolerate partial results, i.e. failures from certain shards when they are not available. The result will be an entry in the PIT id that has a ShardId but a ShardContextIdForNode with empty node/contextId
The way I see it, we should not change any of these entries, which is what is already happening in this method.
In addition the re-encoding step here doesn't change any entries for shards that have failed in the last search. They shouldn't be included in the "results" list and for that reason not update the related part in the updated PIT id. In cases where these failures are temporary, subsequent searches with the updated ID will try to hit the "old" shard context locations, if any of these can be re-tried we will updated that part of the PIT in a later call.
server/src/main/java/org/elasticsearch/action/search/AbstractSearchAsyncAction.java
Outdated
Show resolved
Hide resolved
| updatedShardMap.put(shardId, original.shards().get(shardId)); | ||
| } | ||
| } | ||
| return SearchContextId.encode(updatedShardMap, original.aliasFilter(), mintransportVersion, failures); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another question is how to clean up search contexts from old PITs. For example, when a shard-level request fails, we try to execute it on another copy. In these cases, we re-encode the PIT. If users close the new PIT, the old search context won't be closed. This is not an issue with stateful, but it can be a problem with serverless. Should we close the old search contexts here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean shard-level request that are not related to the search context Missing? I didn't consider this yet, good point I think. I was assuming that if get a result from another node than the original one that would always mean the old context is gone, but you are right we might retry elsewhere for other reasons.
Would you suggest something like the fire-and-forget approach we use e.g. here in TransportSearchAction from this location? I assume a close request is relatively cheap even id the old context doesn't exist any longer and we can treat this as a best-effort attempt. If this fails at some point the context reaper process should clean other stuff up thats over the keepalive limit, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just see TransportSearchAction goes to all shards, it would probably be something more selective to only the shards that had changes in location and use ClearScrollController#closeContexts directly.
|
@dnhatn thanks a lot for the review. I addressed some of your comments and left questions otherwise. I added closing logic in the re-encoding step and adapted the tests accordingly, however I'm not sure that is exactly what you suggested. Let me know what you think. About your comment above:
This was partially motivated by other planned changes that will enable re-locating PITs between nodes even without retries triggered by subsequent search requests. I will need to think this over a bit more and might get back to you with more questions to deepen my understanding of how sessionIds are currently used. |
|
@cbuescher Please let me know when it’s ready for another review. Thank you for working on it. |
|
@dnhatn thanks for the feedback, I added some changes around ignoring failures when re-encoding the PIT and left a small comment regarding partial failures in #135231 (comment). |
Only store retried contexts and rewrite PIT when the feature is enabled.
| this.idGenerator = idGenerator; | ||
| } | ||
|
|
||
| ReaderContext put(ShardSearchContextId contextId, ReaderContext context) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we pass only ReaderContext context and use its id (context.id()) as the key? Another potential issue is that if we put two readers with the same ShardSearchContextId, we might leak one of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we pass only ReaderContext context and use its id
Sure, good suggestion. Will add that change
potential issue is that if we put two readers with the same ShardSearchContextId
We already kind of guard against that case in SearchService#putReaderContext by asserting that there wasn't an existing context already. That should be the only user of that new ActiveReaders class in production. I will move that assertion down into the ActiveReaders class then just to make sure we always run it, or would you prefer throwing an Exception in those cases in prod uses? That would be an increase in strictness though.
|
@cbuescher We can keep the closing part as it is for now, since the new feature is guarded by a feature flag. |
|
@dnhatn thanks for the review, I pushed some more changes. |
|
Hi @dnhatn, thanks for the feedback on Friday. After the discussion we had I now went back to the simpler closing logic (always try to close context on remote node when node ID rewriting indicates a change). I also added some more checks so that in case two threads concurrently retry adding a new context, we disregard the second one and close the context its trying to add. |
dnhatn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cbuescher. I've left two comments: one is optional for your awareness, and the other should be addressed before merging.
| readerContext = null; | ||
| return finalReaderContext; | ||
| } else { | ||
| // we already have a mapping for this context, dont add a new one and use the existing instead |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might encounter a race condition between two retries: one tries to put the relocation key first, while the second detects that the relocation key already exists and tries to get the relocation reader. However, the first retry may not have added the reader to the map yet - only added the relocation key. This isn't a blocker, but it's something to be aware of.
| return createAndPutReaderContext(request, indexService, shard, searcherSupplier, defaultKeepAlive); | ||
| ReaderContext readerContext = null; | ||
| if (PIT_RELOCATION_FEATURE_FLAG.isEnabled()) { | ||
| readerContext = createAndPutRelocatedPitContext( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can re-create a new PIT for an expired PIT, which should be fine. However, if both sessionIds are the same, we might leak the relocation key when removing the reader context. Can you check this and add a test for that case? Alternatively, we could avoid re-creating a new PIT if the sessionIds are the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an interesting aspect already of our current retry abilities for e.g. searchable snapshots I wasn't aware of. It seems we can already bring closed or expired PITs back to life if the underlying engine supports it (i.e. frozen/read-only).
Let me try to rephrase your case to see if I understand the point correctly:
- we open a PIT shard context in session A which gets stored in the regular "activeReaders" map (i.e. id1 =
sessionID=A;id=10;readerId=abcde) - that context expires or gets closed for some reason, but the node stays up so the session is still A
- when we now retry with an "old" PIT id, we don't find the context and enter the relocation code path which with add a relocation mapping e.g. "id1 -> 123" and add the context in that position to the activeReaders
With the current logic we won't be able to remove the relocation mapping because removal with the current sessionId only goes to "activeReaders".
Is that reading correct so far?
I opened a test showing our current ability to retry even on closed/expired PITs for searchable snaphots at #137371, it was surprising to me at first that this works but we probably need to keep it that way.
I considered several ideas to mitigate the problem but for the sake of simplicity would probably opt for excluding these kind of retries (same sessionId as current one) from the new relocation logic and improve on that later. I'll add a change along those lines in a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This adds a test case to RetrySearchIntegTests that shows the current ability to retry expired or closed PITs on read-only or frozen engines, i.e. for searchable snapshots.
|
Thanks @dnhatn for pointing out the last extra cases. I think I addressed one and am still trying to work through the other, but since you mention its not a blocker we can probably do work there as a follow up? |
dnhatn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left some nits and smaller comments, but LGTM. Thanks @cbuescher!
server/src/main/java/org/elasticsearch/action/search/AbstractSearchAsyncAction.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/action/search/AbstractSearchAsyncAction.java
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/action/search/AbstractSearchAsyncAction.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/search/ActiveReaders.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/search/ActiveReaders.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/search/ActiveReaders.java
Outdated
Show resolved
Hide resolved
|
@dnhatn thanks a lot for your diligence and patience with me on this one, really appreciate it. I will merge this after I get another clean CI run. |
|
@cbuescher Thank YOU for all the iterations. |
This change improves point-in-time (PIT) context relocations in cases where we
loose a PIT due to e.g. node shutdowns or restarts. We are already able to retry
shards where open PIT contexts have been lost in the case where re-create
identical reader contests on wither the same or a different node for read-only
and frozen indices (via FrozenEngine / ReadOnlyEngine).
However, currently we are re-creating these contexts with every subsequent search
because we don’t store the recreated context in a similar way as when we are opening
a new PIT. This PR avoids this extra work by making the following changes:
map ReaderContexts held in SearchService by more than just an auto-incremented Long
id that can clash between different nodes. Instead this PR changes the lookup to use the full
ShardSearchContextId
re-writing the PIT id we return with every search request in case where the results
we receive show that the shards involved in answering the PIT request have been
served by a different node than the one originally encoded in the PIT id
The later is possible because the in the original PIT design already provided
hooks for changing the PIT id between search request and we already document
that open point in time ids can change between requests.
Currently, without re-writing the PIT ids, we can get e.g. into situations where
we lose a shard PIT context due to node restart, but are able to re-create it on
another node. If the shards in the group got reassigned to nodes different than
the original node, subsequent requests can get routed for retries to alternating
different nodes. This doesn’t matter in the current situation where we throw away each
re-created PIT context after use, but with this improvement that tracks these
context on the new nodes again, we want to make sure that subsequent searches go
to the same node.