Don't consider retetion leases hold by replicas for remote store back… by gbbafna · Pull Request #20874 · opensearch-project/OpenSearch

gbbafna · 2026-03-15T13:04:19Z

…ed indices

Description

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Issue details

Primary was on 5d4 . Replica was fd2c
Primary relocated to 5d4 to 7ed8f at 17:42:13,292
Replica came up on 5d4 at 17:42:22 .
Retention lease of 5d4 was never removed on the new primary and the seq no didn't increase it ever.

"seq_no":{"max_seq_no":15249281,"local_checkpoint":-1,"global_checkpoint":-1},
"retention_leases":{"primary_term":1,"version":344724,"leases":[{"id":"peer_recovery/nodeA","retaining_seq_no":0,"timestamp":1773411457159,"source":"peer recovery"},{"id":"peer_recovery/nodeB","retaining_seq_no":15250120,"timestamp":1773411457159,"source":"peer recovery"}]}

This seq_no=0 retention lease led to shard size growing huge , as deletes never got expunged .

Check List

Functionality includes testing.
API changes companion pull request created, if applicable.
Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…ed indices Signed-off-by: Gaurav Bafna <gbbafna@amazon.com>

github-actions · 2026-03-15T13:05:39Z

PR Reviewer Guide 🔍

(Review updated until commit `13b5af9`)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 Multiple PR themes Sub-PR theme: Fix retention lease expiration for remote store backed indices Relevant files: server/src/main/java/org/opensearch/index/seqno/ReplicationTracker.java Sub-PR theme: Add test for retention lease after primary relocation in remote store Relevant files: server/src/internalClusterTest/java/org/opensearch/remotestore/RemoteIndexPrimaryRelocationIT.java
⚡ Recommended focus areas for review Incomplete Fix For remote store enabled indices, only the primary shard's retention lease ID is kept. This means all replica peer-recovery retention leases are excluded from `leaseIdsForCurrentPeers`, which causes them to be expired/removed. While this is the intended fix for the described bug, it could be problematic if a replica is in the middle of peer recovery and needs its retention lease to remain valid. Consider whether replicas that are currently recovering should also be included. final Set<String> leaseIdsForCurrentPeers; if (indexSettings.isRemoteStoreEnabled()) { leaseIdsForCurrentPeers = Collections.singleton(getPeerRecoveryRetentionLeaseId(routingTable.primaryShard().currentNodeId())); } else { leaseIdsForCurrentPeers = routingTable.assignedShards() .stream() .map(ReplicationTracker::getPeerRecoveryRetentionLeaseId) .collect(Collectors.toSet()); } Missing assertBusy Timeout The `assertBusy` call does not specify a custom timeout. The default timeout may be insufficient for the retention lease sync interval of 5 seconds plus propagation time. Consider specifying an explicit timeout (e.g., 30 seconds) to avoid flaky test failures. assertBusy(() -> { final IndicesStatsResponse statsResponse = client().admin().indices().prepareStats("test").get(); for (ShardStats shardStats : statsResponse.getShards()) { if (shardStats.getShardRouting().primary() == false) { continue; } final RetentionLeases retentionLeases = shardStats.getRetentionLeaseStats().retentionLeases(); final List<RetentionLease> peerRecoveryLeases = retentionLeases.leases() .stream() .filter(l -> ReplicationTracker.PEER_RECOVERY_RETENTION_LEASE_SOURCE.equals(l.source())) .collect(Collectors.toList()); assertEquals( "expected exactly one peer recovery retention lease but got " + peerRecoveryLeases, 1, peerRecoveryLeases.size() ); // the single lease should be for the current primary node assertEquals( ReplicationTracker.getPeerRecoveryRetentionLeaseId(shardStats.getShardRouting().currentNodeId()), peerRecoveryLeases.get(0).id() ); } }); Unused Variable The variable `dataNodeB` is assigned but never used in the test. The target node is determined dynamically from the cluster state instead. This variable can be removed to avoid confusion. final String dataNodeB = internalCluster().startDataOnlyNode();

github-actions · 2026-03-15T13:05:57Z

PR Code Suggestions ✨

Latest suggestions up to 13b5af9
Explore these optional code suggestions:

Category	Suggestion	Impact
Possible issue	Guard against null primary shard reference `routingTable.primaryShard()` can return `null` if no primary shard is assigned (e.g., during recovery or relocation). Calling `currentNodeId()` on a null shard will throw a `NullPointerException`. Add a null check before accessing the primary shard. server/src/main/java/org/opensearch/index/seqno/ReplicationTracker.java [290] -leaseIdsForCurrentPeers = Collections.singleton(getPeerRecoveryRetentionLeaseId(routingTable.primaryShard().currentNodeId())); +ShardRouting primaryShard = routingTable.primaryShard(); +leaseIdsForCurrentPeers = (primaryShard != null && primaryShard.assignedToNode()) + ? Collections.singleton(getPeerRecoveryRetentionLeaseId(primaryShard.currentNodeId())) + : Collections.emptySet(); Suggestion importance[1-10]: 6 __ Why: The suggestion raises a valid concern about potential `NullPointerException` if `routingTable.primaryShard()` returns null. However, this method is called in `primaryMode` (asserted on line 284), which typically guarantees a primary shard exists, making this a lower-probability edge case rather than a critical bug.	Low
General	Ensure relocation targets the correct node The filter selects any data node that is not the source node, but `dataNodeB` was just started and is the intended target. If there are more than two data nodes, the target could be an unintended node. Use `dataNodeB`'s name directly to resolve the `DiscoveryNode` to ensure the move command targets the correct node. server/src/internalClusterTest/java/org/opensearch/remotestore/RemoteIndexPrimaryRelocationIT.java [97-99] -.filter(n -> n.getId().equals(sourceNodeId) == false) +.filter(n -> n.getName().equals(dataNodeB)) .findFirst() .orElseThrow(); Suggestion importance[1-10]: 5 __ Why: The suggestion is valid - using `dataNodeB` directly would be more precise and deterministic. However, since only two data nodes exist at this point in the test (one started before and `dataNodeB`), the filter `n.getId().equals(sourceNodeId) == false` effectively selects `dataNodeB` anyway, making this a minor improvement rather than a critical fix. Note that the improved code uses `getName()` while `dataNodeB` is a node name, which should work correctly.	Low

Previous suggestions

Suggestions up to commit 51f4569

Category	Suggestion	Impact
General	Ensure correct target node is selected The filter selects any data node that is not the source node, but `dataNodeB` was just started and is the intended target. If there are more than two data nodes, the test may pick an unintended node as the relocation target, making the test non-deterministic. Filter explicitly for `dataNodeB`'s name or ID to ensure the correct target node is selected. server/src/internalClusterTest/java/org/opensearch/remotestore/RemoteIndexPrimaryRelocationIT.java [97-99] -.filter(n -> n.getId().equals(sourceNodeId) == false) +.filter(n -> n.getName().equals(dataNodeB)) .findFirst() .orElseThrow(); Suggestion importance[1-10]: 6 __ Why: The suggestion is valid - filtering by `dataNodeB` name instead of "not the source node" makes the test deterministic and ensures the intended target is selected. However, `DiscoveryNode` uses `getId()` not `getName()`, so the improved code using `getName()` may not work correctly with the `DiscoveryNode` API.	Low
Possible issue	Guard against null primary shard reference `routingTable.primaryShard()` can return `null` if there is no primary shard assigned (e.g., during certain cluster states). Calling `currentNodeId()` on a null shard would cause a `NullPointerException`. Add a null check or use a safe accessor before calling `currentNodeId()`. server/src/main/java/org/opensearch/index/seqno/ReplicationTracker.java [290] -leaseIdsForCurrentPeers = Collections.singleton(getPeerRecoveryRetentionLeaseId(routingTable.primaryShard().currentNodeId())); +ShardRouting primaryShard = routingTable.primaryShard(); +if (primaryShard == null \|\| primaryShard.currentNodeId() == null) { + leaseIdsForCurrentPeers = Collections.emptySet(); +} else { + leaseIdsForCurrentPeers = Collections.singleton(getPeerRecoveryRetentionLeaseId(primaryShard.currentNodeId())); +} Suggestion importance[1-10]: 5 __ Why: While the null check is a valid defensive programming concern, in practice `getRetentionLeases` is only called when `primaryMode` is asserted (line 284), meaning a primary shard should always be assigned. The suggestion is valid but addresses an edge case that may not occur in practice given the existing assertion.	Low

server/src/main/java/org/opensearch/index/seqno/ReplicationTracker.java

github-actions · 2026-03-15T14:08:17Z

❌ Gradle check result for 51f4569: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

.../src/internalClusterTest/java/org/opensearch/remotestore/RemoteIndexPrimaryRelocationIT.java

…l the time Signed-off-by: Gaurav Bafna <gbbafna@amazon.com>

github-actions · 2026-03-16T07:21:58Z

Persistent review updated to latest commit 13b5af9

github-actions · 2026-03-16T08:47:31Z

✅ Gradle check result for 13b5af9: SUCCESS

codecov · 2026-03-16T08:52:06Z

Codecov Report

❌ Patch coverage is 66.66667% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.41%. Comparing base (4d6ccf1) to head (13b5af9).
⚠️ Report is 6 commits behind head on main.

Files with missing lines	Patch %	Lines
...org/opensearch/index/seqno/ReplicationTracker.java	66.66%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main   #20874      +/-   ##
============================================
+ Coverage     73.30%   73.41%   +0.10%     
- Complexity    72280    72348      +68     
============================================
  Files          5796     5797       +1     
  Lines        330263   330304      +41     
  Branches      47663    47674      +11     
============================================
+ Hits         242102   242489     +387     
+ Misses        68754    68420     -334     
+ Partials      19407    19395      -12

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Don't consider retetion leases hold by replicas for remote store back…

51f4569

…ed indices Signed-off-by: Gaurav Bafna <gbbafna@amazon.com>

shourya035 reviewed Mar 15, 2026

View reviewed changes

server/src/main/java/org/opensearch/index/seqno/ReplicationTracker.java Show resolved Hide resolved

opensearch-ci-bot mentioned this pull request Mar 15, 2026

[AUTOCUT] Gradle Check Flaky Test Report for RemoteStoreKafkaIT #17693

Open

shwetathareja reviewed Mar 16, 2026

View reviewed changes

.../src/internalClusterTest/java/org/opensearch/remotestore/RemoteIndexPrimaryRelocationIT.java Outdated Show resolved Hide resolved

Make retention lease sync at 5 sec to deterministically repo issue al…

13b5af9

…l the time Signed-off-by: Gaurav Bafna <gbbafna@amazon.com>

gbbafna added the skip-changelog label Mar 16, 2026

gbbafna marked this pull request as ready for review March 16, 2026 08:48

gbbafna requested review from a team, Bukhtawar, CEHENKLE, Rishikesh1159, anasalkouz, andrross, ashking94, cwperks, dbwiddis, jed326, kotwanikunal, mch2, msfroh, owaiskazi19, reta, sachinpkale, saratvemulapalli and sohami as code owners March 16, 2026 08:48

shwetathareja approved these changes Mar 16, 2026

View reviewed changes

shwetathareja merged commit 9bd4634 into opensearch-project:main Mar 16, 2026
56 of 57 checks passed

opensearch-ci-bot mentioned this pull request Mar 16, 2026

[AUTOCUT] Gradle Check Flaky Test Report for IndexFieldDataServiceTests #19730

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't consider retetion leases hold by replicas for remote store back…#20874

Don't consider retetion leases hold by replicas for remote store back…#20874
shwetathareja merged 2 commits intoopensearch-project:mainfrom
gbbafna:rl-leak-fix

gbbafna commented Mar 15, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 15, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

github-actions bot commented Mar 15, 2026

Uh oh!

Uh oh!

github-actions bot commented Mar 16, 2026

Uh oh!

github-actions bot commented Mar 16, 2026

Uh oh!

Uh oh!

codecov bot commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

gbbafna commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issues

Issue details

Check List

Uh oh!

github-actions bot commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Reviewer Guide 🔍

(Review updated until commit 13b5af9)

Uh oh!

github-actions bot commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Code Suggestions ✨

Previous suggestions

Uh oh!

Uh oh!

github-actions bot commented Mar 15, 2026

Uh oh!

Uh oh!

github-actions bot commented Mar 16, 2026

Uh oh!

github-actions bot commented Mar 16, 2026

Uh oh!

Uh oh!

codecov bot commented Mar 16, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gbbafna commented Mar 15, 2026 •

edited

Loading

github-actions bot commented Mar 15, 2026 •

edited

Loading

(Review updated until commit `13b5af9`)

github-actions bot commented Mar 15, 2026 •

edited

Loading