Improve PIT context relocation #135231

cbuescher · 2025-09-22T21:03:01Z

This change improves point-in-time (PIT) context relocations in cases where we
loose a PIT due to e.g. node shutdowns or restarts. We are already able to retry
shards where open PIT contexts have been lost in the case where re-create
identical reader contests on wither the same or a different node for read-only
and frozen indices (via FrozenEngine / ReadOnlyEngine).
However, currently we are re-creating these contexts with every subsequent search
because we don’t store the recreated context in a similar way as when we are opening
a new PIT. This PR avoids this extra work by making the following changes:

map ReaderContexts held in SearchService by more than just an auto-incremented Long
id that can clash between different nodes. Instead this PR changes the lookup to use the full
ShardSearchContextId
re-writing the PIT id we return with every search request in case where the results
we receive show that the shards involved in answering the PIT request have been
served by a different node than the one originally encoded in the PIT id

The later is possible because the in the original PIT design already provided
hooks for changing the PIT id between search request and we already document
that open point in time ids can change between requests.
Currently, without re-writing the PIT ids, we can get e.g. into situations where
we lose a shard PIT context due to node restart, but are able to re-create it on
another node. If the shards in the group got reassigned to nodes different than
the original node, subsequent requests can get routed for retries to alternating
different nodes. This doesn’t matter in the current situation where we throw away each
re-created PIT context after use, but with this improvement that tracks these
context on the new nodes again, we want to make sure that subsequent searches go
to the same node.

elasticsearchmachine · 2025-09-22T21:03:25Z

Hi @cbuescher, I've created a changelog YAML for you.

elasticsearchmachine · 2025-09-22T21:03:25Z

Pinging @elastic/es-search-foundations (Team:Search Foundations)

This change improves point-in-time (PIT) context relocations in cases where we loose a PIT due to e.g. node shutdowns or restarts. We are already able to retry shards where open PIT contexts have been lost in the case where re-create identical reader contests on wither the same or a different node for read-only and frozen indices (via FrozenEngine / ReadOnlyEngine). However, currently we are re-creating these contexts with every subsequent search because we don’t store the recreated context in a similar way as when we are opening a new PIT. This PR avoids this extra work.

cbuescher · 2025-09-23T18:04:12Z

Hi @dnhatn, can I ask you for a review on this since you probably are still the one most familiar with the PIT id rewrite hook and the general current retry logic for frozen and read-only shards?

dnhatn · 2025-09-24T17:44:31Z

@cbuescher Sure, I'll take a look today or tomorrow.

dnhatn · 2025-09-26T04:22:25Z

map ReaderContexts held in SearchService by more than just an auto-incremented Long
id that can clash between different nodes. Instead this PR changes the lookup to use the full
ShardSearchContextId

I think we already handle this case with sessionId, which should be different each time a node is restarted or between different nodes.

dnhatn

Thanks, @cbuescher. I've left some questions and comments, but I think re-encoding the PIT when the target nodes have changed is a good idea.

dnhatn · 2025-09-26T04:57:07Z

...ernalClusterTest/java/org/elasticsearch/xpack/searchablesnapshots/RetrySearchIntegTests.java

+                    .setPointInTime(new PointInTimeBuilder(pitId).setKeepAlive(TimeValue.timeValueMinutes(2))),
+                resp -> {
+                    assertHitCount(resp, docCount);
+                    updatedPit.set(resp.pointInTimeId());


Can we verify that we have update the PIT?

Good point. However, its not impossible the PIT id doesn't change. The test restarts all nodes. If all shards in the original PIT get allocated to their original nodes again we will reopen the context in the same location, leaving everything in that particular SearchContextIdForNode the same. When running tests I definitely saw both scenarios.

server/src/main/java/org/elasticsearch/action/search/AbstractSearchAsyncAction.java

dnhatn · 2025-09-26T06:19:49Z

server/src/main/java/org/elasticsearch/action/search/AbstractSearchAsyncAction.java

+        }
+    }
+
+    static <Result extends SearchPhaseResult> BytesReference maybeReEncodeNodeIds(


Question: I haven't looked into how we handle partial results with re-encoding. Have you considered this?

I assume you are referring to the functionality added with #111516? If I understand that PR correctly we will have SearchContextIdForNode entries in the PIT with a "null" node entry. I think in that case we won't add that shard to the shard iterators of any subsequent search, so we won't get a Result for that shard. That is one reason why I added copying PIT id entries for everything that has no Result from the old to the new encoded ID without change.
Is that what you mean?

I looked again at how we handle partial results. The way I see it, when opening the PIT we can tolerate partial results, i.e. failures from certain shards when they are not available. The result will be an entry in the PIT id that has a ShardId but a ShardContextIdForNode with empty node/contextId
The way I see it, we should not change any of these entries, which is what is already happening in this method.
In addition the re-encoding step here doesn't change any entries for shards that have failed in the last search. They shouldn't be included in the "results" list and for that reason not update the related part in the updated PIT id. In cases where these failures are temporary, subsequent searches with the updated ID will try to hit the "old" shard context locations, if any of these can be re-tried we will updated that part of the PIT in a later call.

server/src/main/java/org/elasticsearch/action/search/AbstractSearchAsyncAction.java

dnhatn · 2025-09-26T06:30:25Z

server/src/main/java/org/elasticsearch/action/search/AbstractSearchAsyncAction.java

+                    updatedShardMap.put(shardId, original.shards().get(shardId));
+                }
+            }
+            return SearchContextId.encode(updatedShardMap, original.aliasFilter(), mintransportVersion, failures);


Another question is how to clean up search contexts from old PITs. For example, when a shard-level request fails, we try to execute it on another copy. In these cases, we re-encode the PIT. If users close the new PIT, the old search context won't be closed. This is not an issue with stateful, but it can be a problem with serverless. Should we close the old search contexts here?

You mean shard-level request that are not related to the search context Missing? I didn't consider this yet, good point I think. I was assuming that if get a result from another node than the original one that would always mean the old context is gone, but you are right we might retry elsewhere for other reasons.
Would you suggest something like the fire-and-forget approach we use e.g. here in TransportSearchAction from this location? I assume a close request is relatively cheap even id the old context doesn't exist any longer and we can treat this as a best-effort attempt. If this fails at some point the context reaper process should clean other stuff up thats over the keepalive limit, no?

I just see TransportSearchAction goes to all shards, it would probably be something more selective to only the shards that had changes in location and use ClearScrollController#closeContexts directly.

cbuescher · 2025-09-26T20:39:05Z

@dnhatn thanks a lot for the review. I addressed some of your comments and left questions otherwise. I added closing logic in the re-encoding step and adapted the tests accordingly, however I'm not sure that is exactly what you suggested. Let me know what you think.

About your comment above:

map ReaderContexts held in SearchService by more than just an auto-incremented Long
id that can clash between different nodes. Instead this PR changes the lookup to use the full
ShardSearchContextId

I think we already handle this case with sessionId, which should be different each time a node is restarted or between different nodes.

This was partially motivated by other planned changes that will enable re-locating PITs between nodes even without retries triggered by subsequent search requests. I will need to think this over a bit more and might get back to you with more questions to deepen my understanding of how sessionIds are currently used.

dnhatn · 2025-09-29T17:57:45Z

@cbuescher Please let me know when it’s ready for another review. Thank you for working on it.

cbuescher · 2025-10-08T21:30:59Z

@dnhatn thanks for the feedback, I added some changes around ignoring failures when re-encoding the PIT and left a small comment regarding partial failures in #135231 (comment).
Before I think this is ready for another review I want to gate some of the new behavior behind a feature flag and will ping you once thats done.

Only store retried contexts and rewrite PIT when the feature is enabled.

cbuescher added >enhancement :Search Foundations/Search Catch all for Search Foundations v9.2.0 labels Sep 22, 2025

elasticsearchmachine added the Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch label Sep 22, 2025

cbuescher mentioned this pull request Sep 22, 2025

Draft: PIT context relocation on shard relocation #132251

Draft

cbuescher force-pushed the improve-pit-relocation branch from eb20fe5 to 3b233da Compare September 23, 2025 15:16

cbuescher added 2 commits September 23, 2025 20:01

Update docs/changelog/135231.yaml

730efd4

cbuescher force-pushed the improve-pit-relocation branch from 3b233da to 730efd4 Compare September 23, 2025 18:02

cbuescher requested a review from dnhatn September 23, 2025 18:04

cbuescher added 2 commits September 25, 2025 14:24

Add keepalive to second and third query

ad4219c

Merge branch 'main' into improve-pit-relocation

fb526d4

cbuescher force-pushed the improve-pit-relocation branch from beb3f48 to fb526d4 Compare September 25, 2025 13:20

dnhatn reviewed Sep 26, 2025

View reviewed changes

Adressing review comments

6bcb45f

dnhatn self-requested a review September 29, 2025 17:56

elasticsearchmachine added v9.3.0 and removed v9.2.0 labels Oct 2, 2025

Rework failure usage in updating PIT id

841c5b3

elasticsearchmachine and others added 3 commits October 8, 2025 21:33

[CI] Auto commit changes from spotless

b76a959

Merge branch 'main' into improve-pit-relocation

237ea4c

Adding feature flag

9085194

Only store retried contexts and rewrite PIT when the feature is enabled.

Improve PIT context relocation #135231

Are you sure you want to change the base?

Improve PIT context relocation #135231

Conversation

cbuescher commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Sep 22, 2025

Uh oh!

elasticsearchmachine commented Sep 22, 2025

Uh oh!

cbuescher commented Sep 23, 2025

Uh oh!

dnhatn commented Sep 24, 2025

Uh oh!

dnhatn commented Sep 26, 2025

Uh oh!

dnhatn left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cbuescher commented Sep 26, 2025

Uh oh!

dnhatn commented Sep 29, 2025

Uh oh!

cbuescher commented Oct 8, 2025

Uh oh!

Uh oh!

cbuescher commented Sep 22, 2025 •

edited

Loading