Add leak detection to Store #121482

nicktindall · 2025-02-01T08:18:30Z

When working on ES-10641, I noticed one category of slow shut down is due to the only possible relocation failing repeatedly, which is in turn caused by the the shard lock being held on the target node.
The message on the shard lock is closing shard which indicates the lock was once held by a Store which never released the shard lock.

We recently made a change to dump hot threads when shard creation was prevented by a lingering shard lock, and in instances of the above error occurring you can now see there is no active thread doing anything related to closing the shard, so I suspect there may be a store reference leaking somewhere, preventing the final release of the shard lock.

This PR adds leak tracking to the Store's refCounter in the hope that we might observe the leak in CI.

…ection_to_store

elasticsearchmachine · 2025-02-03T23:46:11Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

nicktindall · 2025-02-03T23:47:12Z

server/src/main/java/org/elasticsearch/index/store/Store.java

+        ShardLock shardLock,
+        OnClose onClose,
+        boolean hasIndexSort,
+        boolean detectLeaks


I don't like the need to be able to turn leak detection off, but there were a few places where we don't get access to the store to manage its lifecycle correctly.

I think we should try avoid this flag. Browing through the changes, it is used for

ShardSnapshotTaskRunnerTests

SourceOnlySnapshotRepository

You suggested a solution for the first usage. Is it possible to manage the lifecycle properly for the 2nd use case? It seems to me that we do have full access to the tmpStore? I haven't dug into it deepy so likely miss something.

Yeah agreed, I think it is a smell. I can look a bit harder for alternatives.

I think with tmpStore we can probably use ActionListener.runBefore(context, tempStore::close) or similar. I'll do some digging to make sure that works.

Or add it to toClose which already seems to track that lifecycle.

server/src/test/java/org/elasticsearch/repositories/blobstore/ShardSnapshotTaskRunnerTests.java

nicktindall · 2025-02-03T23:52:17Z

...java/org/elasticsearch/xpack/searchablesnapshots/store/SearchableSnapshotDirectoryTests.java

            final Store store = new Store(shardId, indexSettings, directory, new DummyShardLock(shardId));
-            store.incRef();
-            releasables.add(store::decRef);
+            releasables.add(store::close);


I'm not sure why the lifecycle was being managed like above, and whether I've fundamentally changed something here. It seems like we should have a single reference from creation, so we don't need to incRef?

ywangd · 2025-02-10T05:18:29Z

server/src/test/java/org/elasticsearch/repositories/blobstore/ShardSnapshotTaskRunnerTests.java

+        // Don't trigger leak detection
+        dummyStores.forEach(Store::close);
+        dummyStores.clear();


I think this probably does not work since the dummyContext helper method is also used by a different test class BlobStoreRepositoryTests. One possible solution is to let the caller manages it. The helper method returns a SnapshotShardContext which has store() method to access the store and close it?

nicktindall · 2025-02-13T05:02:09Z

I'm actually wondering if this is a good idea or provides value.

It may potentially add lots of noise to every integrated test failure, it may be that when we fail with an assertion error we cause a lot of spurious "leaks" because of action listeners not being triggered or shutdowns not completing .

An example is https://buildkite.com/elastic/elasticsearch-pull-request/builds/55921#0194eeb5-c568-464b-951b-1ed8fed475b0

I don't think the failure had anything to do with store leaks, but it proceeded to spew out pages of detected "leaks".

Another thing is Stores are probably long-lived with many interactions over their lifetime, so the chances of getting a full picture of a leak within the 25 retained interactions is probably slim.

ywangd · 2025-02-13T07:34:14Z

Ah OK that's unfortunate. I don't have any good suggestion. Ideally it would be nice to trigger the leak report when we log the hot-threads. But I guess that is not feasible with the current LeakTracker and I am not sure how much work is needed to make it possible.

nicktindall · 2025-02-21T02:59:49Z

I don't think this will be useful, we need a different approach for tracking leaks in long-lived objects like a store I believe.

Add leak detection to Store

0010760

elasticsearchmachine added the v9.1.0 label Feb 1, 2025

nicktindall added 8 commits February 1, 2025 20:27

Disable leak detection for tempStore

0600c53

Change lifecycle for test store

e03a213

Prevent test leak

0f70cff

Disable leak detection for unit test

18b9e00

Clean up stores in test

98f8f0e

Merge remote-tracking branch 'origin/main' into ES-10641_add_leak_det…

11c1ad6

…ection_to_store

Don't keep wrapped ref counter in scope

1443d34

Merge remote-tracking branch 'origin/main' into ES-10641_add_leak_det…

faced46

…ection_to_store

nicktindall added >test Issues or PRs that are addressing/adding tests :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) labels Feb 3, 2025

nicktindall marked this pull request as ready for review February 3, 2025 23:45

Merge branch 'main' into ES-10641_add_leak_detection_to_store

9521d7e

elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Feb 3, 2025

nicktindall commented Feb 3, 2025

View reviewed changes

server/src/test/java/org/elasticsearch/repositories/blobstore/ShardSnapshotTaskRunnerTests.java Outdated Show resolved Hide resolved

nicktindall commented Feb 3, 2025

View reviewed changes

nicktindall requested a review from ywangd February 3, 2025 23:52

nicktindall changed the title ~~WIP Add leak detection to Store~~ Add leak detection to Store Feb 3, 2025

nicktindall added 4 commits February 4, 2025 17:15

Merge branch 'main' into ES-10641_add_leak_detection_to_store

94613cc

Merge branch 'main' into ES-10641_add_leak_detection_to_store

20cc251

Remove detectLeaks constructor arg

e1d9e4e

Merge branch 'main' into ES-10641_add_leak_detection_to_store

f39924e

ywangd reviewed Feb 10, 2025

View reviewed changes

nicktindall added 3 commits February 10, 2025 18:09

Make ShardSnapshotContext lifecycle logic reusable

d27ee2d

Merge branch 'main' into ES-10641_add_leak_detection_to_store

f8cb23a

Merge branch 'main' into ES-10641_add_leak_detection_to_store

9b7fe8c

nicktindall closed this Feb 21, 2025

nicktindall deleted the ES-10641_add_leak_detection_to_store branch April 14, 2025 05:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add leak detection to Store #121482

Add leak detection to Store #121482

Uh oh!

nicktindall commented Feb 1, 2025 •

edited

Loading

Uh oh!

elasticsearchmachine commented Feb 3, 2025

Uh oh!

nicktindall Feb 3, 2025

Uh oh!

ywangd Feb 10, 2025

Uh oh!

nicktindall Feb 10, 2025 •

edited

Loading

Uh oh!

Uh oh!

nicktindall Feb 3, 2025

Uh oh!

ywangd Feb 10, 2025

Uh oh!

nicktindall commented Feb 13, 2025

Uh oh!

ywangd commented Feb 13, 2025

Uh oh!

nicktindall commented Feb 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add leak detection to Store #121482

Add leak detection to Store #121482

Uh oh!

Conversation

nicktindall commented Feb 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Feb 3, 2025

Uh oh!

nicktindall Feb 3, 2025

Choose a reason for hiding this comment

Uh oh!

ywangd Feb 10, 2025

Choose a reason for hiding this comment

Uh oh!

nicktindall Feb 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nicktindall Feb 3, 2025

Choose a reason for hiding this comment

Uh oh!

ywangd Feb 10, 2025

Choose a reason for hiding this comment

Uh oh!

nicktindall commented Feb 13, 2025

Uh oh!

ywangd commented Feb 13, 2025

Uh oh!

nicktindall commented Feb 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nicktindall commented Feb 1, 2025 •

edited

Loading

nicktindall Feb 10, 2025 •

edited

Loading