Checkpoint hanging when object store is enabled#1647
Checkpoint hanging when object store is enabled#1647KiKoS0 wants to merge 2 commits intomicrosoft:mainfrom
Conversation
TrackLastVersion is called once per store during IN_PROGRESS. Each call creates a new semaphore and overwrites lastVersionTransactionsDone, orphaning the previous one in the waitingList. DecrementActiveTransactions only releases the last one. ProcessWaitingListAsync blocks forever on the orphaned semaphore. Since both stores share the same transaction counter, we only need one semaphore per version. If TrackLastVersion has already been called for a given version, subsequent calls return immediately. Includes a regression test that fails without the fix.
9be6a9a to
879bc71
Compare
There was a problem hiding this comment.
Pull request overview
Fixes a checkpoint deadlock in Tsavorite’s state machine when two-store (main + object store) checkpoints call TrackLastVersion for the same version, which could orphan a semaphore in waitingList and hang ProcessWaitingListAsync (seen in Garnet with object store + in-flight transactions).
Changes:
- Prevent
TrackLastVersionfrom creating/enqueuing more than one semaphore per version. - Add a regression test that exercises calling
TrackLastVersiontwice for the same transaction version and verifies no orphaned waiters remain.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| libs/storage/Tsavorite/cs/src/core/Index/Checkpointing/StateMachineDriver.cs | Adds a guard in TrackLastVersion to avoid enqueueing duplicate semaphores for the same version (prevents deadlock). |
| libs/storage/Tsavorite/cs/test/StateMachineDriverTests.cs | Adds a regression test to validate the fix and prevent recurrence. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| [Test] | ||
| public async Task TrackLastVersionCalledTwiceDoesNotDeadlock() | ||
| { | ||
| var epoch = new LightEpoch(); |
There was a problem hiding this comment.
LightEpoch implements IDisposable and tracks active instances globally; this test never disposes it. Over many tests this can leak instance IDs and eventually fail with "Exceeded maximum number of active LightEpoch instances". Use using var epoch = new LightEpoch(); (or a try/finally) to ensure disposal.
| var epoch = new LightEpoch(); | |
| using var epoch = new LightEpoch(); |
22e8a9f to
6f2dabf
Compare
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
6f2dabf to
96eb445
Compare
We noticed a primary Garnet hanging on a checkpoint forever so I took a closer look at a heap dump, noticed this hanging stack and traced it to a semaphore deadlock that can happen when the Object store is enabled and there are in-flight transactions that need to be awaited before a checkpoint can proceed.
It was easily reproducible locally, I just hammered transactional commands and
BGSAVEaggressively (i can share it if needed)This has also broken the replication link but i'm not yet sure if that's a different issue or a cascading effect of this issue yet.
TrackLastVersionis called once per store duringIN_PROGRESS. Each call creates a new semaphore and overwriteslastVersionTransactionsDone, orphaning the previous one in thewaitingList.DecrementActiveTransactionsonly releases the last one.ProcessWaitingListAsyncblocks forever on the orphaned semaphore.Since both stores share the same transaction counter, we only need one semaphore per version. If
TrackLastVersionhas already been called for a given version, subsequent calls return immediately.Includes a regression test that fails without the fix.