Lightweight shard fingerprint consistency validation #849
Closed
mattisonchao
started this conversation in
Proposal
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Proposal: Lightweight shard fingerprint consistency validation
TL;DR
This proposal introduces a Dual-Fingerprint Mechanism to validate strict data consistency across shard replicas. This mechanism consists of two components:
Fingerprint
WAL CRC (Log Integrity): Since the Write-Ahead Log (WAL) is the source of truth for the Replicated State Machine, we utilize the existing chained CRC to verify that the replication stream is identical across replicas.
DB CRC (Execution Integrity): To ensure the database state matches the log, we introduce a new cumulative CRC based on mutable operations (write batches). This validates that the application logic is deterministic and no logical inconsistencies occur during execution.
Validation
Instead of relying on heavy active RPC consistency checks, we adopt a passive observability approach. Each DataServer independently reports its fingerprint status via metrics. An external collector (e.g., Prometheus) aggregates these metrics to detect divergences and trigger alerts.
Motivation
While the RSM model guarantees consistency in theory, in practice, replicas can diverge due to software or hardware faults. We have observed issues in production, such as non-deterministic apply logic. The logic bugs in the Apply() phase (where the log entry updates the DB) rely on volatile inputs—such as in-memory counters not backed by the WAL, system time, or map iteration order. This causes the Leader's in-memory state to differ from that of the Followers.
Currently, Oxia lacks a tool to "prove" that Replica A and Replica B contain the exact same data at a specific log index.
Proposed Design
Fingerprint - WAL (Log Integrity)
The Write-Ahead Log serves as the state machine's input stream. Every LogEntry already contains a CRC checksum of its payload. We expose the cumulative (chained) CRC of the WAL stream at the committed offset to verify that the replication layer has successfully replicated the exact same byte stream to all followers.
Fingerprint - DB (Execution Integrity)
This is a new mechanism to verify the deterministic execution of the state machine. We introduced a cumulative CRC32 of the Write Batch Representation (Repr) instead of scanning the physical database.
The cumulative checksum chains each batch incrementally:
To survive restarts and snapshots without re-computation, the calculated checksum is inserted back into the same batch as a hidden system key (
__oxia/checksum). This ensures the data and its proof are committed atomically.Validation (Passive Observability)
We decouple the "Proof Generation" from the "Verification Logic."
Feature Negotiation & Rolling Upgrade
To support safe rolling upgrades where new and old nodes coexist:
BecomeLeaderRequestis updated to include afeatures_supportedfield.GetInfoRPC reporting its supported features.GetInfo. Thenegotiate()algorithm requires all members to support a feature before enabling it. If not supported (mixed versions), the feature remains disabled to prevent false positives.BecomeLeaderRequest. The leader then replicatesFeatureEnableRequestcontrol messages through the WAL, ensuring all replicas enable the feature deterministically.GetInfoare treated as supporting no features.Observability
To address the Time Skew problem (where replicas are at different offsets at scrape time), we cannot directly compare the real-time CRC. Instead, we use a Milestone Reporting Strategy.
A configurable checksum scheduler (default: 5 minutes) periodically proposes
RecordChecksumRequestcontrol messages through the WAL. Since these are replicated log entries, all replicas record their checksum at the same logical offset, enabling direct cross-replica comparison regardless of scrape timing.oxia_dataserver_db_checksumshard,oxia_namespace,commit_offsetcommit_offset.oxia_dataserver_wal_checksumshard,oxia_namespace,commit_offsetcommit_offset.Configuration:
scheduler.checksum.interval(default:5m, set to0to disable).Implementation PRs
Beta Was this translation helpful? Give feedback.
All reactions