Skip to content

Conversation

nicktindall
Copy link
Contributor

@nicktindall nicktindall commented Feb 1, 2025

When working on ES-10641, I noticed one category of slow shut down is due to the only possible relocation failing repeatedly, which is in turn caused by the the shard lock being held on the target node.
The message on the shard lock is closing shard which indicates the lock was once held by a Store which never released the shard lock.

We recently made a change to dump hot threads when shard creation was prevented by a lingering shard lock, and in instances of the above error occurring you can now see there is no active thread doing anything related to closing the shard, so I suspect there may be a store reference leaking somewhere, preventing the final release of the shard lock.

This PR adds leak tracking to the Store's refCounter in the hope that we might observe the leak in CI.

@nicktindall nicktindall added >test Issues or PRs that are addressing/adding tests :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) labels Feb 3, 2025
@nicktindall nicktindall marked this pull request as ready for review February 3, 2025 23:45
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Feb 3, 2025
ShardLock shardLock,
OnClose onClose,
boolean hasIndexSort,
boolean detectLeaks
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the need to be able to turn leak detection off, but there were a few places where we don't get access to the store to manage its lifecycle correctly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should try avoid this flag. Browing through the changes, it is used for

  1. ShardSnapshotTaskRunnerTests
  2. SourceOnlySnapshotRepository

You suggested a solution for the first usage. Is it possible to manage the lifecycle properly for the 2nd use case? It seems to me that we do have full access to the tmpStore? I haven't dug into it deepy so likely miss something.

Copy link
Contributor Author

@nicktindall nicktindall Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah agreed, I think it is a smell. I can look a bit harder for alternatives.

I think with tmpStore we can probably use ActionListener.runBefore(context, tempStore::close) or similar. I'll do some digging to make sure that works.

Or add it to toClose which already seems to track that lifecycle.

final Store store = new Store(shardId, indexSettings, directory, new DummyShardLock(shardId));
store.incRef();
releasables.add(store::decRef);
releasables.add(store::close);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why the lifecycle was being managed like above, and whether I've fundamentally changed something here. It seems like we should have a single reference from creation, so we don't need to incRef?

@nicktindall nicktindall requested a review from ywangd February 3, 2025 23:52
@nicktindall nicktindall changed the title WIP Add leak detection to Store Add leak detection to Store Feb 3, 2025
Comment on lines 63 to 65
// Don't trigger leak detection
dummyStores.forEach(Store::close);
dummyStores.clear();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this probably does not work since the dummyContext helper method is also used by a different test class BlobStoreRepositoryTests. One possible solution is to let the caller manages it. The helper method returns a SnapshotShardContext which has store() method to access the store and close it?

@nicktindall
Copy link
Contributor Author

I'm actually wondering if this is a good idea or provides value.

It may potentially add lots of noise to every integrated test failure, it may be that when we fail with an assertion error we cause a lot of spurious "leaks" because of action listeners not being triggered or shutdowns not completing .

An example is https://buildkite.com/elastic/elasticsearch-pull-request/builds/55921#0194eeb5-c568-464b-951b-1ed8fed475b0

I don't think the failure had anything to do with store leaks, but it proceeded to spew out pages of detected "leaks".

Another thing is Stores are probably long-lived with many interactions over their lifetime, so the chances of getting a full picture of a leak within the 25 retained interactions is probably slim.

@ywangd
Copy link
Member

ywangd commented Feb 13, 2025

Ah OK that's unfortunate. I don't have any good suggestion. Ideally it would be nice to trigger the leak report when we log the hot-threads. But I guess that is not feasible with the current LeakTracker and I am not sure how much work is needed to make it possible.

@nicktindall
Copy link
Contributor Author

I don't think this will be useful, we need a different approach for tracking leaks in long-lived objects like a store I believe.

@nicktindall nicktindall deleted the ES-10641_add_leak_detection_to_store branch April 14, 2025 05:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) Team:Distributed Coordination Meta label for Distributed Coordination team >test Issues or PRs that are addressing/adding tests v9.1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants