Reduce ClusterState retention in retry closures#20858
Reduce ClusterState retention in retry closures#20858HarishNarasimhanK wants to merge 7 commits intoopensearch-project:mainfrom
Conversation
PR Reviewer Guide 🔍(Review updated until commit 5d0316d)Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Latest suggestions up to 5d0316d Explore these optional code suggestions:
Previous suggestionsSuggestions up to commit f61b6a9
Suggestions up to commit f61b6a9
Suggestions up to commit 892ee63
Suggestions up to commit 52f23ab
Suggestions up to commit 670bd25
|
|
Persistent review updated to latest commit 64a4b05 |
|
❌ Gradle check result for 64a4b05: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
Persistent review updated to latest commit ee994e5 |
|
❌ Gradle check result for ee994e5: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
Persistent review updated to latest commit 1ceb433 |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #20858 +/- ##
============================================
- Coverage 73.28% 73.28% -0.01%
- Complexity 72490 72537 +47
============================================
Files 5819 5819
Lines 331398 331421 +23
Branches 47887 47890 +3
============================================
+ Hits 242875 242878 +3
- Misses 68984 69056 +72
+ Partials 19539 19487 -52 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
PR Code Analyzer ❗AI-powered 'Code-Diff-Analyzer' found issues on commit b8784a1.
The table above displays the top 10 most important findings. Pull Requests Author(s): Please update your Pull Request according to the report above. Repository Maintainer(s): You can Thanks. |
|
Persistent review updated to latest commit b8784a1 |
...ain/java/org/opensearch/action/support/clustermanager/TransportClusterManagerNodeAction.java
Outdated
Show resolved
Hide resolved
...ain/java/org/opensearch/action/support/clustermanager/TransportClusterManagerNodeAction.java
Show resolved
Hide resolved
|
Persistent review updated to latest commit 9ed0978 |
|
Persistent review updated to latest commit ebe17a4 |
|
❕ Gradle check result for ebe17a4: UNSTABLE Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure. |
|
Persistent review updated to latest commit 4195b48 |
|
Persistent review updated to latest commit fd80fda |
|
❌ Gradle check result for fd80fda: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
Persistent review updated to latest commit fd80fda |
|
❌ Gradle check result for fd80fda: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
Persistent review updated to latest commit 670bd25 |
|
❌ Gradle check result for 670bd25: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
Persistent review updated to latest commit 52f23ab |
|
❌ Gradle check result for 52f23ab: null Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: 𝐇𝐚𝐫𝐢𝐬𝐡 𝐍𝐚𝐫𝐚𝐬𝐢𝐦𝐡𝐚𝐧 𝐊 <163456392+HarishNarasimhanK@users.noreply.github.com>
|
Persistent review updated to latest commit 892ee63 |
|
Persistent review updated to latest commit f61b6a9 |
|
Persistent review updated to latest commit f61b6a9 |
|
Persistent review updated to latest commit 5d0316d |
Description
1. Goal
In OpenSearch, snapshot deletion is a cluster manager routed operation. When a delete request is received, the cluster manager creates internal callback objects (listeners) to track the operation and notify the caller once it completes. These listeners inadvertently hold a reference to a large in-memory object called
ClusterState, which contains the entire cluster's metadata, routing information, and index definitions.When a snapshot deletion gets stuck or takes a long time to complete, users or automated systems may retry the delete request multiple times. As listeners accumulate from repeated retries, multiple
ClusterStateobjects get pinned on the heap, causing the cluster manager node's memory usage to grow until it runs out of memory.This change fixes the issue by ensuring that the listeners only hold the small pieces of information they actually need (a version number and a node identifier) instead of the entire
ClusterStateobject. This allows the largeClusterStateobjects to be garbage collected immediately, preventing the memory buildup.2. Current Workflow
This section traces the lifecycle of a snapshot delete request from the REST API to the point where the listener is stored in
SnapshotsService.A client sends
DELETE /_snapshot/{repository}/{snapshot}.The REST layer (
RestDeleteSnapshotAction) constructs aDeleteSnapshotRequestand passes it toNodeClient.NodeClientdispatches the request toTransportDeleteSnapshotAction, which extendsTransportClusterManagerNodeAction.The base class creates an
AsyncSingleActioninstance to manage the request lifecycle.AsyncSingleActionfetches the currentClusterStateand callsdoStart(clusterState).If the local node is the cluster manager,
doStart()wraps the original listener usinggetDelegateForLocalExecute(clusterState). This wrapper contains a lambda for retry logic that references theclusterStateparameter. Due to Java lambda capture semantics, the entireClusterStateobject is implicitly retained by this lambda for as long as the listener exists.The wrapped listener is passed into
TransportDeleteSnapshotAction.clusterManagerOperation(), which callssnapshotsService.deleteSnapshots(request, listener). The listener still carries the capturedClusterStatereference inside its retry lambda.Inside
SnapshotsService, the deletion is submitted as a cluster state update. Once the cluster state is updated to record the deletion, the listener is stored in thesnapshotDeletionListenersmap (keyed by the deletion UUID) in order to notify the client when the deletion completes.3. Issue with Current Workflow
The listener stored in
snapshotDeletionListenerssits in the map until the repository-level deletion reaches a terminal state. If the deletion is stuck (due to slow I/O, stuck segment uploads, large repository cleanup, or any other reason), the listener remains insnapshotDeletionListenersindefinitely, and the capturedClusterStatecannot be garbage collected.For each subsequent delete request,
SnapshotsServiceadds another listener tosnapshotDeletionListenersthrough the same path. As delete requests accumulate, these listeners pile up, each pinning aClusterStateobject on the heap. The cluster manager node's heap usage grows monotonically with each repeated delete, eventually leading toOutOfMemoryError.4. Requirements
Reduce the size of the data retained by retry closures. Instead of capturing the full
ClusterStateobject, closures should only hold the minimal primitives required for retry decisions.Preserve existing retry behavior and backward compatibility.
5. Approach: Extract Primitives Before Closure Creation
The retry closures in
TransportClusterManagerNodeActiononly need two pieces of information from theClusterStateto make retry decisions: the cluster state version (along) and the cluster manager node's ephemeral ID (aString). By extracting these values before creating any lambda or anonymous class, the closures capture only these lightweight primitives. The fullClusterStateobject is no longer referenced by any closure and becomes eligible for garbage collection immediately.Sequence Diagram
Implementation Steps
In
getDelegateForLocalExecuteinsideAsyncSingleAction, the cluster state version and the cluster manager node are extracted before the lambda is created. The lambda now references only these extracted values, so the fullClusterStateis no longer retained.A new overloaded
retryOnMasterChangemethod is added that accepts the version and cluster manager node directly. It extracts the ephemeral ID from the cluster manager node and passes it to the predicate builder.The original
retryOnMasterChangemethod that accepts a fullClusterStateis kept as a convenience bridge. It extracts the version and cluster manager node and delegates to the new overload.The
retrymethod signature is updated to accept the version and cluster manager node instead of the fullClusterState. It uses the persistent node ID and version to construct the cluster state observer via a new primitives-based constructor.A new overloaded
buildmethod is added toClusterManagerNodeChangePredicatethat accepts the version and ephemeral ID directly. The existingbuildmethod that accepts a fullClusterStateis refactored to extract these values and delegate to the new overload.Two new constructors are added to
ClusterStateObserverthat accept the cluster manager node ID and version as primitives, instead of requiring a fullClusterStateobject.The
StoredStateinner class insideClusterStateObserveris refactored to support construction from primitives. The existing constructor that accepts aClusterStatenow delegates to the new primitives-based constructor.6. Validation
The fix was validated by reproducing the memory retention issue on a local cluster and comparing heap dumps before and after the change.
Reproduction Setup
Added a
Thread.sleep()call inBlobStoreRepository.doDeleteShardSnapshots()to simulate a long-running deletion that stays stuck.Created 500 indices with heavy mappings (50+ fields each), multiple aliases per index, and a small number of documents per index to inflate the
ClusterStatesize.Created one snapshot per index (500 snapshots total) using a filesystem-based snapshot repository.
Spammed delete requests for all snapshots repeatedly, so that listeners accumulate in
snapshotDeletionListenerswhile the deletion is stuck.Captured heap dumps from the cluster manager node and compared the retained size of listener chains.
Results
Before (without fix)
After (with fix)
Related Issues
Resolves #15065
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.