Skip to content

Conversation

@cbuescher
Copy link
Member

@cbuescher cbuescher commented Sep 22, 2025

This change improves point-in-time (PIT) context relocations in cases where we
loose a PIT due to e.g. node shutdowns or restarts. We are already able to retry
shards where open PIT contexts have been lost in the case where re-create
identical reader contests on wither the same or a different node for read-only
and frozen indices (via FrozenEngine / ReadOnlyEngine).
However, currently we are re-creating these contexts with every subsequent search
because we don’t store the recreated context in a similar way as when we are opening
a new PIT. This PR avoids this extra work by making the following changes:

  • map ReaderContexts held in SearchService by more than just an auto-incremented Long
    id that can clash between different nodes. Instead this PR changes the lookup to use the full
    ShardSearchContextId

  • re-writing the PIT id we return with every search request in case where the results
    we receive show that the shards involved in answering the PIT request have been
    served by a different node than the one originally encoded in the PIT id

The later is possible because the in the original PIT design already provided
hooks for changing the PIT id between search request and we already document
that open point in time ids can change between requests.
Currently, without re-writing the PIT ids, we can get e.g. into situations where
we lose a shard PIT context due to node restart, but are able to re-create it on
another node. If the shards in the group got reassigned to nodes different than
the original node, subsequent requests can get routed for retries to alternating
different nodes. This doesn’t matter in the current situation where we throw away each
re-created PIT context after use, but with this improvement that tracks these
context on the new nodes again, we want to make sure that subsequent searches go
to the same node.

@elasticsearchmachine elasticsearchmachine added the Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch label Sep 22, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @cbuescher, I've created a changelog YAML for you.

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-foundations (Team:Search Foundations)

This change improves point-in-time (PIT) context relocations in cases where we
loose a PIT due to e.g. node shutdowns or restarts. We are already able to retry
shards where open PIT contexts have been lost in the case where re-create
identical reader contests on wither the same or a different node for read-only
and frozen indices (via FrozenEngine / ReadOnlyEngine).
However, currently we are re-creating these contexts with every subsequent search
because we don’t store the recreated context in a similar way as when we are opening
a new PIT. This PR avoids this extra work.
@cbuescher cbuescher force-pushed the improve-pit-relocation branch from 3b233da to 730efd4 Compare September 23, 2025 18:02
@cbuescher
Copy link
Member Author

Hi @dnhatn, can I ask you for a review on this since you probably are still the one most familiar with the PIT id rewrite hook and the general current retry logic for frozen and read-only shards?

@cbuescher cbuescher requested a review from dnhatn September 23, 2025 18:04
@dnhatn
Copy link
Member

dnhatn commented Sep 24, 2025

@cbuescher Sure, I'll take a look today or tomorrow.

@cbuescher cbuescher force-pushed the improve-pit-relocation branch from beb3f48 to fb526d4 Compare September 25, 2025 13:20
@dnhatn
Copy link
Member

dnhatn commented Sep 26, 2025

map ReaderContexts held in SearchService by more than just an auto-incremented Long
id that can clash between different nodes. Instead this PR changes the lookup to use the full
ShardSearchContextId

I think we already handle this case with sessionId, which should be different each time a node is restarted or between different nodes.

Copy link
Member

@dnhatn dnhatn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @cbuescher. I've left some questions and comments, but I think re-encoding the PIT when the target nodes have changed is a good idea.

.setPointInTime(new PointInTimeBuilder(pitId).setKeepAlive(TimeValue.timeValueMinutes(2))),
resp -> {
assertHitCount(resp, docCount);
updatedPit.set(resp.pointInTimeId());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we verify that we have update the PIT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. However, its not impossible the PIT id doesn't change. The test restarts all nodes. If all shards in the original PIT get allocated to their original nodes again we will reopen the context in the same location, leaving everything in that particular SearchContextIdForNode the same. When running tests I definitely saw both scenarios.

}
}

static <Result extends SearchPhaseResult> BytesReference maybeReEncodeNodeIds(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: I haven't looked into how we handle partial results with re-encoding. Have you considered this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you are referring to the functionality added with #111516? If I understand that PR correctly we will have SearchContextIdForNode entries in the PIT with a "null" node entry. I think in that case we won't add that shard to the shard iterators of any subsequent search, so we won't get a Result for that shard. That is one reason why I added copying PIT id entries for everything that has no Result from the old to the new encoded ID without change.
Is that what you mean?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked again at how we handle partial results. The way I see it, when opening the PIT we can tolerate partial results, i.e. failures from certain shards when they are not available. The result will be an entry in the PIT id that has a ShardId but a ShardContextIdForNode with empty node/contextId
The way I see it, we should not change any of these entries, which is what is already happening in this method.
In addition the re-encoding step here doesn't change any entries for shards that have failed in the last search. They shouldn't be included in the "results" list and for that reason not update the related part in the updated PIT id. In cases where these failures are temporary, subsequent searches with the updated ID will try to hit the "old" shard context locations, if any of these can be re-tried we will updated that part of the PIT in a later call.

updatedShardMap.put(shardId, original.shards().get(shardId));
}
}
return SearchContextId.encode(updatedShardMap, original.aliasFilter(), mintransportVersion, failures);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another question is how to clean up search contexts from old PITs. For example, when a shard-level request fails, we try to execute it on another copy. In these cases, we re-encode the PIT. If users close the new PIT, the old search context won't be closed. This is not an issue with stateful, but it can be a problem with serverless. Should we close the old search contexts here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean shard-level request that are not related to the search context Missing? I didn't consider this yet, good point I think. I was assuming that if get a result from another node than the original one that would always mean the old context is gone, but you are right we might retry elsewhere for other reasons.
Would you suggest something like the fire-and-forget approach we use e.g. here in TransportSearchAction from this location? I assume a close request is relatively cheap even id the old context doesn't exist any longer and we can treat this as a best-effort attempt. If this fails at some point the context reaper process should clean other stuff up thats over the keepalive limit, no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just see TransportSearchAction goes to all shards, it would probably be something more selective to only the shards that had changes in location and use ClearScrollController#closeContexts directly.

@cbuescher
Copy link
Member Author

@dnhatn thanks a lot for the review. I addressed some of your comments and left questions otherwise. I added closing logic in the re-encoding step and adapted the tests accordingly, however I'm not sure that is exactly what you suggested. Let me know what you think.

About your comment above:

map ReaderContexts held in SearchService by more than just an auto-incremented Long
id that can clash between different nodes. Instead this PR changes the lookup to use the full
ShardSearchContextId

I think we already handle this case with sessionId, which should be different each time a node is restarted or between different nodes.

This was partially motivated by other planned changes that will enable re-locating PITs between nodes even without retries triggered by subsequent search requests. I will need to think this over a bit more and might get back to you with more questions to deepen my understanding of how sessionIds are currently used.

@dnhatn dnhatn self-requested a review September 29, 2025 17:56
@dnhatn
Copy link
Member

dnhatn commented Sep 29, 2025

@cbuescher Please let me know when it’s ready for another review. Thank you for working on it.

@cbuescher
Copy link
Member Author

@dnhatn thanks for the feedback, I added some changes around ignoring failures when re-encoding the PIT and left a small comment regarding partial failures in #135231 (comment).
Before I think this is ready for another review I want to gate some of the new behavior behind a feature flag and will ping you once thats done.

elasticsearchmachine and others added 3 commits October 8, 2025 21:33
this.idGenerator = idGenerator;
}

ReaderContext put(ShardSearchContextId contextId, ReaderContext context) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we pass only ReaderContext context and use its id (context.id()) as the key? Another potential issue is that if we put two readers with the same ShardSearchContextId, we might leak one of them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we pass only ReaderContext context and use its id

Sure, good suggestion. Will add that change

potential issue is that if we put two readers with the same ShardSearchContextId

We already kind of guard against that case in SearchService#putReaderContext by asserting that there wasn't an existing context already. That should be the only user of that new ActiveReaders class in production. I will move that assertion down into the ActiveReaders class then just to make sure we always run it, or would you prefer throwing an Exception in those cases in prod uses? That would be an increase in strictness though.

@dnhatn
Copy link
Member

dnhatn commented Oct 23, 2025

@cbuescher We can keep the closing part as it is for now, since the new feature is guarded by a feature flag.

@cbuescher
Copy link
Member Author

@dnhatn thanks for the review, I pushed some more changes.

@cbuescher
Copy link
Member Author

cbuescher commented Oct 27, 2025

Hi  @dnhatn, thanks for the feedback on Friday. After the discussion we had I now went back to the simpler closing logic (always try to close context on remote node when node ID rewriting indicates a change). I also added some more checks so that in case two threads concurrently retry adding a new context, we disregard the second one and close the context its trying to add.
Let me know if this is what you had in mind and if this PR is good for a final cleanup round on my end. I want to reduce / remove some of the logging I added for tests but would wait for your okay on the functional parts.

@cbuescher
Copy link
Member Author

Hi @dnhatn, thanks for your direct feedback yesterday again, I tried addressing your concerns by further separating the code paths for putting regular ReaderContexts and relocated contexts with 3dca6e0. Every change related to relocating contexts should be guarded by the feature flag now.

Copy link
Member

@dnhatn dnhatn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cbuescher. I've left two comments: one is optional for your awareness, and the other should be addressed before merging.

readerContext = null;
return finalReaderContext;
} else {
// we already have a mapping for this context, dont add a new one and use the existing instead
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might encounter a race condition between two retries: one tries to put the relocation key first, while the second detects that the relocation key already exists and tries to get the relocation reader. However, the first retry may not have added the reader to the map yet - only added the relocation key. This isn't a blocker, but it's something to be aware of.

return createAndPutReaderContext(request, indexService, shard, searcherSupplier, defaultKeepAlive);
ReaderContext readerContext = null;
if (PIT_RELOCATION_FEATURE_FLAG.isEnabled()) {
readerContext = createAndPutRelocatedPitContext(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can re-create a new PIT for an expired PIT, which should be fine. However, if both sessionIds are the same, we might leak the relocation key when removing the reader context. Can you check this and add a test for that case? Alternatively, we could avoid re-creating a new PIT if the sessionIds are the same.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting aspect already of our current retry abilities for e.g. searchable snapshots I wasn't aware of. It seems we can already bring closed or expired PITs back to life if the underlying engine supports it (i.e. frozen/read-only).
Let me try to rephrase your case to see if I understand the point correctly:

  • we open a PIT shard context in session A which gets stored in the regular "activeReaders" map (i.e. id1 = sessionID=A;id=10;readerId=abcde)
  • that context expires or gets closed for some reason, but the node stays up so the session is still A
  • when we now retry with an "old" PIT id, we don't find the context and enter the relocation code path which with add a relocation mapping e.g. "id1 -> 123" and add the context in that position to the activeReaders

With the current logic we won't be able to remove the relocation mapping because removal with the current sessionId only goes to "activeReaders".
Is that reading correct so far?

I opened a test showing our current ability to retry even on closed/expired PITs for searchable snaphots at #137371, it was surprising to me at first that this works but we probably need to keep it that way.
I considered several ideas to mitigate the problem but for the sake of simplicity would probably opt for excluding these kind of retries (same sessionId as current one) from the new relocation logic and improve on that later. I'll add a change along those lines in a bit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For simplicity, I opted for excluding contexts with similar session ids from the relocation logic for now in 1a302ac and added the tests from #137371 to check we can still retry PITs on those cases, only without the improvements added here.

@dnhatn dnhatn self-requested a review October 30, 2025 00:56
This adds a test case to RetrySearchIntegTests that shows the current
ability to retry expired or closed PITs on read-only or frozen engines,
i.e. for searchable snapshots.
@cbuescher
Copy link
Member Author

Thanks @dnhatn for pointing out the last extra cases. I think I addressed one and am still trying to work through the other, but since you mention its not a blocker we can probably do work there as a follow up?

Copy link
Member

@dnhatn dnhatn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left some nits and smaller comments, but LGTM. Thanks @cbuescher!

@cbuescher
Copy link
Member Author

@dnhatn thanks a lot for your diligence and patience with me on this one, really appreciate it. I will merge this after I get another clean CI run.

@dnhatn
Copy link
Member

dnhatn commented Nov 1, 2025

@cbuescher Thank YOU for all the iterations.

@cbuescher cbuescher merged commit 58fe751 into elastic:main Nov 2, 2025
34 checks passed
cbuescher added a commit to cbuescher/elasticsearch that referenced this pull request Nov 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>enhancement :Search Foundations/Search Catch all for Search Foundations Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants