-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Add leak detection to Store #121482
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add leak detection to Store #121482
Conversation
Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) |
ShardLock shardLock, | ||
OnClose onClose, | ||
boolean hasIndexSort, | ||
boolean detectLeaks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like the need to be able to turn leak detection off, but there were a few places where we don't get access to the store to manage its lifecycle correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should try avoid this flag. Browing through the changes, it is used for
- ShardSnapshotTaskRunnerTests
- SourceOnlySnapshotRepository
You suggested a solution for the first usage. Is it possible to manage the lifecycle properly for the 2nd use case? It seems to me that we do have full access to the tmpStore
? I haven't dug into it deepy so likely miss something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah agreed, I think it is a smell. I can look a bit harder for alternatives.
I think with tmpStore
we can probably use ActionListener.runBefore(context, tempStore::close)
or similar. I'll do some digging to make sure that works.
Or add it to toClose
which already seems to track that lifecycle.
server/src/test/java/org/elasticsearch/repositories/blobstore/ShardSnapshotTaskRunnerTests.java
Outdated
Show resolved
Hide resolved
final Store store = new Store(shardId, indexSettings, directory, new DummyShardLock(shardId)); | ||
store.incRef(); | ||
releasables.add(store::decRef); | ||
releasables.add(store::close); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why the lifecycle was being managed like above, and whether I've fundamentally changed something here. It seems like we should have a single reference from creation, so we don't need to incRef
?
// Don't trigger leak detection | ||
dummyStores.forEach(Store::close); | ||
dummyStores.clear(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this probably does not work since the dummyContext
helper method is also used by a different test class BlobStoreRepositoryTests
. One possible solution is to let the caller manages it. The helper method returns a SnapshotShardContext
which has store()
method to access the store and close it?
I'm actually wondering if this is a good idea or provides value. It may potentially add lots of noise to every integrated test failure, it may be that when we fail with an assertion error we cause a lot of spurious "leaks" because of action listeners not being triggered or shutdowns not completing . An example is https://buildkite.com/elastic/elasticsearch-pull-request/builds/55921#0194eeb5-c568-464b-951b-1ed8fed475b0 I don't think the failure had anything to do with store leaks, but it proceeded to spew out pages of detected "leaks". Another thing is Stores are probably long-lived with many interactions over their lifetime, so the chances of getting a full picture of a leak within the 25 retained interactions is probably slim. |
Ah OK that's unfortunate. I don't have any good suggestion. Ideally it would be nice to trigger the leak report when we log the hot-threads. But I guess that is not feasible with the current LeakTracker and I am not sure how much work is needed to make it possible. |
I don't think this will be useful, we need a different approach for tracking leaks in long-lived objects like a store I believe. |
When working on ES-10641, I noticed one category of slow shut down is due to the only possible relocation failing repeatedly, which is in turn caused by the the shard lock being held on the target node.
The message on the shard lock is
closing shard
which indicates the lock was once held by aStore
which never released the shard lock.We recently made a change to dump hot threads when shard creation was prevented by a lingering shard lock, and in instances of the above error occurring you can now see there is no active thread doing anything related to closing the shard, so I suspect there may be a store reference leaking somewhere, preventing the final release of the shard lock.
This PR adds leak tracking to the
Store
's refCounter in the hope that we might observe the leak in CI.