Skip to content

Commit cb48491

Browse files
craig[bot]arulajmanijeffswensonpav-kvyuzefovich
committed
155540: kvserver: add split trigger support to TestReplicaLifecycleDataDriven r=pav-kv a=arulajmani This patch adds a few directives to construct and evaluate split triggers. We then use these directives to demonstrate that leases are correctly copied over from the LHS to the RHS. The interesting case is around leader leases, where if the LHS has a leader lease, the RHS gets an expiration based lease. For the other two lease types, the RHS's lease type stays the same. Epic: none Release note: None 155984: isession: add savepoint support r=jeffswenson a=jeffswenson This adds Savepoint support to the internal session. LDR needs save points for performance reasons. It allows it to handle individual LWW losers without aborting the entire replication batch. Release note: none Epic: CRDB-48647 156031: kvstorage: check RangeTombstone on loading replicas r=arulajmani a=pav-kv The last commit adds load-time checking of the invariant that the `RaftReplicaID` can not survive if there is a `RangeTombstone` with a higher `NextReplicaID`. For context, the relevant invariants are: - A replica exists iff `RaftReplicaID` exists - If a replica exists then `ReplicaID` >= `RangeTombstone.NextReplicaID` NB: it follows that there can be a stale/no-op `RangeTombstone` with a non-zero `NextReplicaID`. With the current state of the code, this is possible. We don't always remove `RangeTombstone` when it's stale, and don't use `RangeTombstone` existence as a proof of a replica's non-existence. The `RaftReplicaID` key is the source of truth to that extent. A bunch of preparatory commits refactor the `IterateIDPrefixKeys` helper (used at load-time to discover all `RangeID`s on a store) to support visiting more than one type of key at a time. Previously, we were doing two passes over the `RangeID`-local space (and fetching a third kind of key would mean a third pass). Now, the keyspace is scanned once. Other benefits of the change: - Loading `RangeTombstone` might be useful if we want to [migrate](https://cockroachlabs.slack.com/archives/C02KHQMF2US/p1761237444633689) `RaftReplicaID/RangeTombstone` into one key in state machine engine. - With more integration, we may eliminate one more replicas loading [pass](https://github.com/cockroachdb/cockroach/blob/4183e88d8fdffdd75146d4bdf156d52f819bbb93/pkg/kv/kvserver/store.go#L2329-L2341) in `Store.Start()`, and unify the various "loaded replica" structs and assertions. Just need to read a few more keys while visiting the `RangeID`. - Reducing the total number of passes will become beneficial with separated storages. We are "winning back" capacity for doing an extra pass in a second engine. Epic: CRDB-55218 156119: revert "sql: store bundle when TestStreamerTightBudget fails" r=yuzefovich a=yuzefovich This reverts commit 48f11d2. The test hasn't failed in like year and a half and the captured stmt bundle didn't actually give any more insight into why it rarely failed. Informs: #119675. Release note: None 156374: sql: sort rulesForRelease versions in descending order r=celiala a=celiala This PR fixes incorrect ordering in the `rulesForReleases` array: - Issue detected by [claude-code-pr-review](https://github.com/cockroachdb/cockroach/actions/runs/18817950946) - I verified via [original PR](#97213) that the array should be sorted in descending order. ## Background The `rulesForReleases` array in `pkg/sql/schemachanger/scplan/plan.go` stores schema changer rule registries for different cluster versions. The array is documented to be in descending order (newest version first), and the `GetRulesRegistryForRelease()` function depends on this ordering to correctly match cluster versions with their corresponding rule sets. ## Bug The array had V25_2 and V25_3 in ascending order instead of descending: **Before (incorrect):** ```go var rulesForReleases = []rulesForRelease{ {activeVersion: clusterversion.Latest, rulesRegistry: current.GetRegistry()}, {activeVersion: clusterversion.V25_2, rulesRegistry: release_25_2.GetRegistry()}, {activeVersion: clusterversion.V25_3, rulesRegistry: release_25_3.GetRegistry()}, } ``` **After (correct):** ```go var rulesForReleases = []rulesForRelease{ // NB: sort versions in descending order, i.e. newest supported version first. {activeVersion: clusterversion.Latest, rulesRegistry: current.GetRegistry()}, {activeVersion: clusterversion.V25_3, rulesRegistry: release_25_3.GetRegistry()}, {activeVersion: clusterversion.V25_2, rulesRegistry: release_25_2.GetRegistry()}, } ``` ## Impact `GetRulesRegistryForRelease()` iterates through the array in order and returns the first entry where `activeVersion.IsActive()` is true. With the incorrect ascending order, a cluster running version 25.3 would incorrectly get the 25.2 rules instead of the 25.3 rules. ## Changes - Fixed ordering in `pkg/sql/schemachanger/scplan/plan.go` - Updated test expectations in `pkg/cli/testdata/declarative-rules/invalid_version` - Regenerated test output in `pkg/cli/testdata/declarative-rules/deprules` Epic: None Release note (bug fix): Fix rulesForReleases ordering to correctly match cluster versions with schema changer rule sets. Co-authored-by: Arul Ajmani <[email protected]> Co-authored-by: Jeff Swenson <[email protected]> Co-authored-by: Pavel Kalinnikov <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: Celia La <[email protected]>
6 parents 088887e + 22b383e + 4a13d9b + ba2a5d4 + 6bd3146 + f307ebb commit cb48491

File tree

22 files changed

+1052
-551
lines changed

22 files changed

+1052
-551
lines changed

pkg/cli/declarative_print_rules.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ a given corpus file.
3737
Version: version,
3838
})
3939
if rules == nil {
40-
fmt.Printf("unsupported version number, the supported versions are: \n")
40+
fmt.Printf("unsupported version number, the supported versions are:\n")
4141
for _, v := range scplan.GetReleasesForRulesRegistries() {
4242
fmt.Printf(" %s\n", v)
4343
}

pkg/cli/testdata/declarative-rules/deprules

Lines changed: 266 additions & 250 deletions
Large diffs are not rendered by default.

pkg/cli/testdata/declarative-rules/invalid_version

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@ invalid_version
22
-----
33
----
44
debug declarative-print-rules 1.1 op
5-
unsupported version number, the supported versions are:
5+
unsupported version number, the supported versions are:
66
latest
7-
1000025.2
87
1000025.3
8+
1000025.2

pkg/kv/kvserver/batcheval/cmd_end_transaction.go

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1407,6 +1407,9 @@ func splitTriggerHelper(
14071407
// This avoids running the consistency checker on the RHS immediately after
14081408
// the split.
14091409
lastTS := hlc.Timestamp{}
1410+
// TODO(arul): instead of fetching the consistency checker timestamp here
1411+
// like this, we should instead pass it using the SplitTriggerHelperInput to
1412+
// make it easier to test.
14101413
if _, err := storage.MVCCGetProto(ctx, batch,
14111414
keys.QueueLastProcessedKey(split.LeftDesc.StartKey, "consistencyChecker"),
14121415
hlc.Timestamp{}, &lastTS, storage.MVCCGetOptions{}); err != nil {

pkg/kv/kvserver/kvstorage/BUILD.bazel

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,7 @@ go_test(
4242
"cluster_version_test.go",
4343
"datadriven_test.go",
4444
"destroy_test.go",
45+
"init_test.go",
4546
"initial_test.go",
4647
"stateloader_test.go",
4748
],
@@ -66,6 +67,7 @@ go_test(
6667
"//pkg/util/hlc",
6768
"//pkg/util/leaktest",
6869
"//pkg/util/log",
70+
"//pkg/util/randutil",
6971
"//pkg/util/stop",
7072
"//pkg/util/tracing",
7173
"//pkg/util/uuid",

pkg/kv/kvserver/kvstorage/datadriven_test.go

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ import (
1515

1616
"github.com/cockroachdb/cockroach/pkg/clusterversion"
1717
"github.com/cockroachdb/cockroach/pkg/keys"
18+
"github.com/cockroachdb/cockroach/pkg/kv/kvserver/kvserverpb"
1819
"github.com/cockroachdb/cockroach/pkg/raft/raftpb"
1920
"github.com/cockroachdb/cockroach/pkg/roachpb"
2021
"github.com/cockroachdb/cockroach/pkg/storage"
@@ -94,6 +95,14 @@ func (e *env) handleNewReplica(
9495
return desc
9596
}
9697

98+
func (e *env) handleRangeTombstone(
99+
t *testing.T, ctx context.Context, rangeID roachpb.RangeID, next roachpb.ReplicaID,
100+
) {
101+
require.NoError(t, MakeStateLoader(rangeID).SetRangeTombstone(
102+
ctx, e.eng, kvserverpb.RangeTombstone{NextReplicaID: next},
103+
))
104+
}
105+
97106
func TestDataDriven(t *testing.T) {
98107
defer leaktest.AfterTest(t)()
99108

@@ -146,6 +155,12 @@ func TestDataDriven(t *testing.T) {
146155
); desc != nil {
147156
fmt.Fprintln(&buf, desc)
148157
}
158+
159+
case "range-tombstone":
160+
rangeID := dd.ScanArg[roachpb.RangeID](t, d, "range-id")
161+
nextID := dd.ScanArg[roachpb.ReplicaID](t, d, "next-replica-id")
162+
e.handleRangeTombstone(t, ctx, rangeID, nextID)
163+
149164
case "load-and-reconcile":
150165
replicas, err := LoadAndReconcileReplicas(ctx, e.eng)
151166
if err != nil {

pkg/kv/kvserver/kvstorage/init.go

Lines changed: 115 additions & 91 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ import (
99
"bytes"
1010
"cmp"
1111
"context"
12+
"maps"
1213
"slices"
1314
"time"
1415

@@ -144,21 +145,27 @@ func checkCanInitializeEngine(ctx context.Context, eng storage.Engine) error {
144145
return err
145146
}
146147

147-
// IterateIDPrefixKeys helps visit system keys that use RangeID prefixing (such
148+
// readKeyFn reads the given key, and unmarshals the value into the given proto.
149+
// Returns false if the key does not exist, or is requested out of order.
150+
type readKeyFn func(roachpb.Key, protoutil.Message) (bool, error)
151+
152+
// scanRangeIDFn reports the existence of a RangeID, and allows reading
153+
// RangeID-local keys via the readKeyFn callback.
154+
type scanRangeIDFn func(roachpb.RangeID, readKeyFn) error
155+
156+
// iterateRangeIDKeys helps visit storage keys that use RangeID prefixing (such
148157
// as RaftHardStateKey, RangeTombstoneKey, and many others). Such keys could in
149-
// principle exist at any RangeID, and this helper efficiently discovers all the
150-
// keys of the desired type (as specified by the supplied `keyFn`) and, for each
151-
// key-value pair discovered, unmarshals it into `msg` and then invokes `f`.
158+
// principle exist for any RangeID.
159+
//
160+
// The helper visits all RangeIDs that have any keys, and for each range calls
161+
// the scanRangeID function. The implementation of this function can request any
162+
// subset of RangeID-local keys via the readKeyFn callback. All keys must be
163+
// requested in sorted order.
152164
//
153165
// Iteration stops on the first error (and will pass through that error).
154-
func IterateIDPrefixKeys(
155-
ctx context.Context,
156-
reader storage.Reader,
157-
keyFn func(roachpb.RangeID) roachpb.Key,
158-
msg protoutil.Message,
159-
f func(_ roachpb.RangeID) error,
166+
func iterateRangeIDKeys(
167+
ctx context.Context, reader storage.Reader, scanRangeID scanRangeIDFn,
160168
) error {
161-
rangeID := roachpb.RangeID(1)
162169
// NB: Range-ID local keys have no versions and no intents.
163170
iter, err := reader.NewMVCCIterator(ctx, storage.MVCCKeyIterKind, storage.IterOptions{
164171
UpperBound: keys.LocalRangeIDPrefix.PrefixEnd().AsRawKey(),
@@ -168,60 +175,62 @@ func IterateIDPrefixKeys(
168175
}
169176
defer iter.Close()
170177

171-
for {
172-
bumped := false
173-
mvccKey := storage.MakeMVCCMetadataKey(keyFn(rangeID))
174-
iter.SeekGE(mvccKey)
178+
iter.SeekGE(storage.MakeMVCCMetadataKey(keys.LocalRangeIDPrefix.AsRawKey()))
179+
iterOK, iterErr := iter.Valid()
180+
if !iterOK || iterErr != nil {
181+
return iterErr
182+
}
175183

176-
if ok, err := iter.Valid(); !ok {
177-
return err
184+
getKeyFn := func(key roachpb.Key, msg protoutil.Message) (bool, error) {
185+
if !iterOK || iterErr != nil {
186+
return iterOK, iterErr
178187
}
179-
180-
unsafeKey := iter.UnsafeKey()
181-
182-
if !bytes.HasPrefix(unsafeKey.Key, keys.LocalRangeIDPrefix) {
183-
// Left the local keyspace, so we're done.
184-
return nil
188+
unsafeKey := iter.UnsafeKey().Key
189+
comp := unsafeKey.Compare(key)
190+
if comp < 0 {
191+
iter.SeekGE(storage.MakeMVCCMetadataKey(key))
192+
if iterOK, iterErr = iter.Valid(); !iterOK || iterErr != nil {
193+
return iterOK, iterErr
194+
}
195+
unsafeKey = iter.UnsafeKey().Key
196+
comp = unsafeKey.Compare(key)
197+
if comp < 0 {
198+
return false, errors.AssertionFailedf("SeekGE undershot key %s", key)
199+
}
185200
}
186-
187-
curRangeID, _, _, _, err := keys.DecodeRangeIDKey(unsafeKey.Key)
188-
if err != nil {
189-
return err
201+
if comp > 0 {
202+
return false, nil
190203
}
191-
192-
if curRangeID > rangeID {
193-
// `bumped` is always `false` here, but let's be explicit.
194-
if !bumped {
195-
rangeID = curRangeID
196-
bumped = true
197-
}
198-
mvccKey = storage.MakeMVCCMetadataKey(keyFn(rangeID))
204+
// Found the key (comp == 0). Parse and report the value.
205+
var meta enginepb.MVCCMetadata
206+
if err := iter.ValueProto(&meta); err != nil {
207+
return false, errors.Errorf("unable to unmarshal %s into MVCCMetadata", unsafeKey)
199208
}
200-
201-
if !unsafeKey.Key.Equal(mvccKey.Key) {
202-
if !bumped {
203-
// Don't increment the rangeID if it has already been incremented
204-
// above, or we could skip past a value we ought to see.
205-
rangeID++
206-
bumped = true // for completeness' sake; continuing below anyway
207-
}
208-
continue
209+
val := roachpb.Value{RawBytes: meta.RawBytes}
210+
if err := val.GetProto(msg); err != nil {
211+
return false, errors.Errorf("unable to unmarshal %s into %T", unsafeKey, msg)
209212
}
213+
return true, nil
214+
}
210215

211-
ok, err := storage.MVCCGetProto(
212-
ctx, reader, unsafeKey.Key, hlc.Timestamp{}, msg, storage.MVCCGetOptions{})
216+
for iterOK && iterErr == nil {
217+
rangeID, _, _, _, err := keys.DecodeRangeIDKey(iter.UnsafeKey().Key)
213218
if err != nil {
214219
return err
215-
}
216-
if !ok {
217-
return errors.Errorf("unable to unmarshal %s into %T", unsafeKey.Key, msg)
218-
}
219-
220-
if err := f(rangeID); err != nil {
220+
} else if err := scanRangeID(rangeID, getKeyFn); err != nil {
221221
return iterutil.Map(err)
222+
} else if !iterOK || iterErr != nil {
223+
return iterErr
224+
}
225+
newRangeID, _, _, _, err := keys.DecodeRangeIDKey(iter.UnsafeKey().Key)
226+
if err != nil {
227+
return err
228+
} else if newRangeID <= rangeID {
229+
iter.SeekGE(storage.MakeMVCCMetadataKey(keys.MakeRangeIDPrefix(rangeID + 1)))
230+
iterOK, iterErr = iter.Valid()
222231
}
223-
rangeID++
224232
}
233+
return iterErr
225234
}
226235

227236
// ReadStoreIdent reads the StoreIdent from the store.
@@ -408,6 +417,7 @@ type Replica struct {
408417
ReplicaID roachpb.ReplicaID
409418
Desc *roachpb.RangeDescriptor // nil for uninitialized Replica
410419

420+
tombstone kvserverpb.RangeTombstone
411421
hardState raftpb.HardState // internal to kvstorage, see migration in LoadAndReconcileReplicas
412422
}
413423

@@ -454,9 +464,12 @@ func (m replicaMap) getOrMake(rangeID roachpb.RangeID) Replica {
454464
return ent
455465
}
456466

457-
func (m replicaMap) setReplicaID(rangeID roachpb.RangeID, replicaID roachpb.ReplicaID) {
467+
func (m replicaMap) setReplicaIDAndTombstone(
468+
rangeID roachpb.RangeID, replicaID roachpb.ReplicaID, ts kvserverpb.RangeTombstone,
469+
) {
458470
ent := m.getOrMake(rangeID)
459471
ent.ReplicaID = replicaID
472+
ent.tombstone = ts
460473
m[rangeID] = ent
461474
}
462475

@@ -499,52 +512,57 @@ func loadReplicas(ctx context.Context, eng storage.Engine) ([]Replica, error) {
499512
}
500513
}
501514

502-
// Load replicas from disk based on their RaftReplicaID and HardState.
515+
// Scan all RangeIDs present on the store, and load ReplicaID and HardState of
516+
// those that correspond to an existing replica (uninitialized or initialized).
517+
//
518+
// INVARIANT: a RangeID with no replica can only have a RangeTombstone key.
519+
// INVARIANT: all replicas have a persisted ReplicaID.
520+
// INVARIANT: ReplicaID >= RangeTombstone.NextReplicaID.
503521
//
504-
// INVARIANT: all replicas have a persisted full ReplicaID (i.e. a "ReplicaID from disk").
522+
// NB: RangeIDs that only have a RangeTombstone are effectively skipped here
523+
// as uninteresting. If a non-RangeTombstone key is found for a RangeID, there
524+
// must be a replica of this range, so we load it and check invariants later.
505525
//
506526
// TODO(tbg): tighten up the case where we see a RaftReplicaID but no HardState.
507527
// This leads to the general desire to validate the internal consistency of the
508528
// entire raft state (i.e. HardState, TruncatedState, Log).
509-
{
510-
logEvery := log.Every(10 * time.Second)
511-
var i int
512-
var msg kvserverpb.RaftReplicaID
513-
if err := IterateIDPrefixKeys(ctx, eng, func(rangeID roachpb.RangeID) roachpb.Key {
514-
return keys.RaftReplicaIDKey(rangeID)
515-
}, &msg, func(rangeID roachpb.RangeID) error {
516-
if logEvery.ShouldLog() && i > 0 { // only log if slow
517-
log.KvExec.Infof(ctx, "loaded replica ID for %d/%d replicas", i, len(s))
518-
}
519-
i++
520-
s.setReplicaID(rangeID, msg.ReplicaID)
521-
return nil
522-
}); err != nil {
523-
return nil, err
529+
logEvery := log.Every(10 * time.Second)
530+
var i int
531+
if err := iterateRangeIDKeys(ctx, eng, func(id roachpb.RangeID, get readKeyFn) error {
532+
if logEvery.ShouldLog() && i > 0 { // only log if slow
533+
log.KvExec.Infof(ctx, "loaded state for %d/%d replicas", i, len(s))
524534
}
525-
log.KvExec.Infof(ctx, "loaded replica ID for %d/%d replicas", len(s), len(s))
526-
527-
logEvery = log.Every(10 * time.Second)
528-
i = 0
535+
i++
536+
// NB: the keys must be requested in sorted order here.
537+
buf := keys.MakeRangeIDPrefixBuf(id)
538+
var ts kvserverpb.RangeTombstone
539+
if ok, err := get(buf.RangeTombstoneKey(), &ts); err != nil {
540+
return err
541+
} else if !ok {
542+
ts = kvserverpb.RangeTombstone{} // just in case it was mutated
543+
}
544+
// NB: the keys must be requested in sorted order here.
529545
var hs raftpb.HardState
530-
if err := IterateIDPrefixKeys(ctx, eng, func(rangeID roachpb.RangeID) roachpb.Key {
531-
return keys.RaftHardStateKey(rangeID)
532-
}, &hs, func(rangeID roachpb.RangeID) error {
533-
if logEvery.ShouldLog() && i > 0 { // only log if slow
534-
log.KvExec.Infof(ctx, "loaded Raft state for %d/%d replicas", i, len(s))
535-
}
536-
i++
537-
s.setHardState(rangeID, hs)
538-
return nil
539-
}); err != nil {
540-
return nil, err
546+
if ok, err := get(buf.RaftHardStateKey(), &hs); err != nil {
547+
return err
548+
} else if ok {
549+
s.setHardState(id, hs)
541550
}
542-
log.KvExec.Infof(ctx, "loaded Raft state for %d/%d replicas", len(s), len(s))
543-
}
544-
sl := make([]Replica, 0, len(s))
545-
for _, repl := range s {
546-
sl = append(sl, repl)
551+
// NB: the keys must be requested in sorted order here.
552+
var rID kvserverpb.RaftReplicaID
553+
if ok, err := get(buf.RaftReplicaIDKey(), &rID); err != nil {
554+
return err
555+
} else if ok {
556+
// NB: the tombstone can be empty.
557+
s.setReplicaIDAndTombstone(id, rID.ReplicaID, ts)
558+
}
559+
return nil
560+
}); err != nil {
561+
return nil, err
547562
}
563+
log.KvExec.Infof(ctx, "loaded state for %d/%d replicas", len(s), len(s))
564+
565+
sl := slices.AppendSeq(make([]Replica, 0, len(s)), maps.Values(s))
548566
slices.SortFunc(sl, func(a, b Replica) int {
549567
return cmp.Compare(a.RangeID, b.RangeID)
550568
})
@@ -583,6 +601,12 @@ func LoadAndReconcileReplicas(ctx context.Context, eng storage.Engine) ([]Replic
583601
if repl.ReplicaID == 0 {
584602
return nil, errors.AssertionFailedf("no RaftReplicaID for %s", repl.Desc)
585603
}
604+
// INVARIANT: ReplicaID >= RangeTombstone.NextReplicaID.
605+
if repl.ReplicaID < repl.tombstone.NextReplicaID {
606+
return nil, errors.AssertionFailedf(
607+
"r%d: RaftReplicaID %d survived RangeTombstone %+v",
608+
repl.RangeID, repl.ReplicaID, repl.tombstone)
609+
}
586610

587611
if repl.Desc != nil {
588612
// INVARIANT: a Replica's RangeDescriptor always contains the local Store,

0 commit comments

Comments
 (0)