Skip to content

Conversation

@badrishc
Copy link
Collaborator

@badrishc badrishc commented Mar 8, 2025

Changes:

  • Share Epoch and StateMachineDriver (SMD) across main and object stores
  • Garnet checkpointing uses this shared driver to checkpoint both stores with a single state machine
  • Transactions use SMD to register transactions, so that a single transaction version is maintained across the two stores.
  • Transactions no longer barrier during the (v) -> (v+1) switchover. We ensure correctness by verifying the transaction version after all locks are acquired.
  • Move NumActiveLockingSessions (renamed to NumActiveTransactions) to SMD and ensure correct (v) -> (v+1) switchover with active transactions.
  • Removed targetVersion concept so that checkpoint versions always increment in steps of 1.
  • Refactored CheckpointVersionShift to a pair of CheckpointVersionShiftStart and CheckpointVersionShiftEnd to help demarcate start and end of fuzzy region in the AOF ((region that contains mix of v and v+1 log entries).
  • Fix AOF replay to correctly process single-key operations and transactions in fuzzy region. Specifically,
    • checkpoint_start marker: Track ReplicationCheckpointStartOffset in AOF
    • Process all (v) ops and txns in the fuzzy region of AOF, store away the (v+1) entries in a fuzzy buffer
    • checkpoint_end marker: Take a checkpoint, potentially truncating until ReplicationCheckpointStartOffset, replay all the stored (v+1) ops and txns from the fuzzy buffer.

Notes:

  • Recovery logic is left unmodified, it assumes checkpoints are separate across main and object stores. This allows the PR to remain backwards compatible.

@TalZaccai TalZaccai requested a review from TedHartMS March 11, 2025 18:25
Copy link
Contributor

@TedHartMS TedHartMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approved with one comment

@badrishc badrishc requested review from Copilot and vazois March 13, 2025 00:17
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements a single state machine shared between the main and object stores. Key changes include unifying checkpoint markers and processing (with fuzzy buffering for (v+1) entries), refactoring transaction version registration across stores, and updating state machine driver initialization and usage throughout the codebase.

Reviewed Changes

Copilot reviewed 63 out of 63 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
libs/server/AOF/AofProcessor.cs Refactored AOF processing methods to use unified checkpoint markers and include fuzzy region buffering.
libs/cluster/Server/Replication/ReplicationManager.cs Updated checkpoint version shift functions to use separate start/end callbacks and removed deprecated properties.
libs/server/Servers/GarnetServerOptions.cs Added Epoch and StateMachineDriver fields with corresponding initialization methods.
libs/server/StoreWrapper.cs Unified the TakeCheckpoint APIs by removing store type parameters and adjusted checkpoint task logic.
libs/cluster/Server/ClusterProvider.cs Modified checkpoint initiation logic for replicas to use the checkpoint start marker offset.
libs/server/Lua/LuaRunner.cs Updated transaction handling with acquisition, verification, and proper ending of transaction versions.
libs/server/Transaction/TransactionManager.cs Integrated stateMachineDriver for transaction version management and updated lock acquisition/release functions.
libs/server/AOF/AofEntryType.cs Renamed and deprecated old checkpoint markers to support the unified checkpointing mechanism.
libs/server/AOF/AofHeader.cs Increased the AOF header version from 1 to 2 to reflect changes in format.
libs/cluster/Server/Replication/ReplicationLogCheckpointManager.cs Adjusted callbacks to support separate start and end checkpoint shift notifications.
libs/cluster/Server/Replication/ReplicaOps/ReplicaReplayTask.cs Revised replica replay logic to update replication checkpoint offsets on encountering the start marker.
libs/storage/Tsavorite/cs/src/core/Allocator/IStreamingSnapshotIteratorFunctions.cs Renamed parameter from targetVersion to nextVersion for clarity.
libs/storage/Tsavorite/cs/src/core/ClientSession/ClientSession.cs Simplified lock acquisition/release functions and streamlined transaction version updates.
libs/cluster/Server/Replication/PrimaryOps/DisklessReplication/ReplicationSnapshotIterator.cs Removed redundant commit marker enqueue in stop-processing logic.
libs/cluster/Server/ClusterConfig.cs Added a logger parameter to the HandleConfigEpochCollision method for additional diagnostic output.
libs/host/GarnetServer.cs Called initialization on server options and updated object store settings invocation accordingly.
libs/server/Resp/RespServerSession.cs Passed the state machine driver into the TransactionManager constructor for consistent transaction version handling.
libs/server/Storage/Session/StorageSession.cs Exposed the StateMachineDriver from server options to the storage session.
libs/storage/Tsavorite/cs/src/core/Allocator/AllocatorBase.cs Removed tracking of active locking sessions as part of the refactoring for transaction version handling.
libs/server/Resp/AdminCommands.cs Updated checkpoint command calls to match the new TakeCheckpoint API signature.
Comments suppressed due to low confidence (2)

libs/server/AOF/AofHeader.cs:15

  • Increasing the AofHeaderVersion to 2 may impact backward compatibility. Ensure that old AOF formats are either upgraded or handled appropriately during replay.
const byte AofHeaderVersion = 2;

libs/server/StoreWrapper.cs:852

  • [nitpick] Consider using the logical AND operator (&&) instead of the bitwise AND (&) for clarity if short-circuit behavior is desired.
tryIncremental &= objectStore.CanTakeIncrementalCheckpoint(checkpointType, out var guid2) & checkpointResult.token == guid2;

@vazois vazois self-assigned this Mar 13, 2025
@badrishc badrishc merged commit 25f10dc into main Mar 14, 2025
33 checks passed
@badrishc badrishc deleted the badrishc/two-store-checkpoint branch March 14, 2025 23:45
@github-actions github-actions bot locked and limited conversation to collaborators May 14, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants