Skip to content

Checkpoint hanging when object store is enabled#1647

Open
KiKoS0 wants to merge 2 commits intomicrosoft:mainfrom
KiKoS0:riadh/fix-deadlock
Open

Checkpoint hanging when object store is enabled#1647
KiKoS0 wants to merge 2 commits intomicrosoft:mainfrom
KiKoS0:riadh/fix-deadlock

Conversation

@KiKoS0
Copy link
Copy Markdown
Contributor

@KiKoS0 KiKoS0 commented Mar 27, 2026

We noticed a primary Garnet hanging on a checkpoint forever so I took a closer look at a heap dump, noticed this hanging stack and traced it to a semaphore deadlock that can happen when the Object store is enabled and there are in-flight transactions that need to be awaited before a checkpoint can proceed.

It was easily reproducible locally, I just hammered transactional commands and BGSAVE aggressively (i can share it if needed)

This has also broken the replication link but i'm not yet sure if that's a different issue or a cascading effect of this issue yet.

STACK 11
00007e9d56365678 00007f9539410b90 ( ) System.Threading.SemaphoreSlim+TaskNode
  00007e9d563656d0 00007f953a9be440 (1) Tsavorite.core.StateMachineDriver+<ProcessWaitingListAsync>d__34
    00007e9d56365758 00007f953a9be7f8 (0) Tsavorite.core.StateMachineDriver+<RunStateMachine>d__35
      00007e9d563657d0 00007f953a9bebb0 (0) Tsavorite.core.StateMachineDriver+<RunAsync>d__28
        00007e9d56365850 00007f953a9bef58 (0) Garnet.server.DatabaseManagerBase+<InitiateCheckpointAsync>d__70
          00007e9d563658f0 00007f953a9bf358 (0) Garnet.server.DatabaseManagerBase+<TakeCheckpointAsync>d__55
            00007e9d56365990 00007f953a9bfa78 (0) Garnet.server.SingleDatabaseManager+<TaskCheckpointBasedOnAofSizeLimitAsync>d__16
              00007e9ad6c00420 00007f953a1fc740 (1) Garnet.server.StoreWrapper+<AutoCheckpointBasedOnAofSizeLimit>d__77

TrackLastVersion is called once per store during IN_PROGRESS. Each call creates a new semaphore and overwrites lastVersionTransactionsDone, orphaning the previous one in the waitingList. DecrementActiveTransactions only releases the last one. ProcessWaitingListAsync blocks forever on the orphaned semaphore.

Since both stores share the same transaction counter, we only need one semaphore per version. If TrackLastVersion has already been called for a given version, subsequent calls return immediately.

Includes a regression test that fails without the fix.

Copilot AI review requested due to automatic review settings March 27, 2026 02:31
TrackLastVersion is called once per store during IN_PROGRESS. Each call creates
a new semaphore and overwrites lastVersionTransactionsDone, orphaning the
previous one in the waitingList. DecrementActiveTransactions only releases the
last one. ProcessWaitingListAsync blocks forever on the orphaned semaphore.

Since both stores share the same transaction counter, we only need one
semaphore per version. If TrackLastVersion has already been called for a given
version, subsequent calls return immediately.

Includes a regression test that fails without the fix.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a checkpoint deadlock in Tsavorite’s state machine when two-store (main + object store) checkpoints call TrackLastVersion for the same version, which could orphan a semaphore in waitingList and hang ProcessWaitingListAsync (seen in Garnet with object store + in-flight transactions).

Changes:

  • Prevent TrackLastVersion from creating/enqueuing more than one semaphore per version.
  • Add a regression test that exercises calling TrackLastVersion twice for the same transaction version and verifies no orphaned waiters remain.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
libs/storage/Tsavorite/cs/src/core/Index/Checkpointing/StateMachineDriver.cs Adds a guard in TrackLastVersion to avoid enqueueing duplicate semaphores for the same version (prevents deadlock).
libs/storage/Tsavorite/cs/test/StateMachineDriverTests.cs Adds a regression test to validate the fix and prevent recurrence.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

[Test]
public async Task TrackLastVersionCalledTwiceDoesNotDeadlock()
{
var epoch = new LightEpoch();
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LightEpoch implements IDisposable and tracks active instances globally; this test never disposes it. Over many tests this can leak instance IDs and eventually fail with "Exceeded maximum number of active LightEpoch instances". Use using var epoch = new LightEpoch(); (or a try/finally) to ensure disposal.

Suggested change
var epoch = new LightEpoch();
using var epoch = new LightEpoch();

Copilot uses AI. Check for mistakes.
@KiKoS0 KiKoS0 force-pushed the riadh/fix-deadlock branch from 22e8a9f to 6f2dabf Compare March 27, 2026 02:59
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@KiKoS0 KiKoS0 force-pushed the riadh/fix-deadlock branch from 6f2dabf to 96eb445 Compare March 27, 2026 15:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants