Skip to content

Commit 65a4ba7

Browse files
craig[bot]tbgandy-kimball
committed
142653: kvserver, storage: clean up, fix bugs in, and test snapshot receive path r=tbg a=tbg When a snapshot is received, we write the incoming KV pairs into a number of SSTs. With the introduction of excise and shared/external SSTs (disagg storage), this code had become quite crufty and supported - in principle - multiple modes that made it quite brittle and under-tested. In this PR, we "bake in" excises: for the SST covering the MVCC span, we don't lay down a range deletion but instead *always* use excise. We now also always allow splitting of large MVCC SSTs (into reasonably-sized SSTs). This works against the previous proliferation of options, and enabled various code improvements. The code around range keys (MVCC rangedels) was cleaned up too, and missing fragmentation was added. For "regular" rangedels (ClearRange), fragmentation was missing entirely and was added as well. We now have a datadriven unit test for the MultiSSTWriter, which shows precisely what's contained in each produced SST. This doesn't mean the receive path is well-tested - far from it - but at least there's a solid foundation to build on top of. Overall, the code here is much improved and easier to reason about. Epic: CRDB-46488 Release Note: None 144215: vecindex: update vector index key and value encodings r=drewkimball a=andy-kimball First commit: When multiple workers are racing to create the root partition, it's possible to trigger an error when the root partition already exists and is in a state other than Ready (e.g. Splitting). Fix this by creating the root metadata record only if doesn't already exist. If it does exist, just return the current metadata. Second commit: Make two updates to how KV keys and values are encoded for vector indexes: 1. Add Family ID to the end of the metadata KV key encoding. This is required by the KV split code, which checks the family ID to ensure that splits don't split column families for the same row across ranges. 2. Encode the partition state fields in the metadata KV value. This is needed for the new non-transactional background split code. Epic: CRDB-42943 Release note: None Co-authored-by: Tobias Grieger <[email protected]> Co-authored-by: Andrew Kimball <[email protected]>
3 parents 61fba7b + 06a7238 + 015d87f commit 65a4ba7

19 files changed

+585
-374
lines changed

pkg/kv/kvserver/BUILD.bazel

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -254,6 +254,7 @@ go_library(
254254
"@com_github_cockroachdb_pebble//:pebble",
255255
"@com_github_cockroachdb_pebble//objstorage",
256256
"@com_github_cockroachdb_pebble//objstorage/remote",
257+
"@com_github_cockroachdb_pebble//rangedel",
257258
"@com_github_cockroachdb_pebble//rangekey",
258259
"@com_github_cockroachdb_pebble//sstable/block",
259260
"@com_github_cockroachdb_pebble//vfs",
@@ -564,6 +565,7 @@ go_test(
564565
"@com_github_cockroachdb_errors//oserror",
565566
"@com_github_cockroachdb_logtags//:logtags",
566567
"@com_github_cockroachdb_pebble//:pebble",
568+
"@com_github_cockroachdb_pebble//rangekey",
567569
"@com_github_cockroachdb_pebble//sstable",
568570
"@com_github_cockroachdb_pebble//vfs",
569571
"@com_github_cockroachdb_redact//:redact",

pkg/kv/kvserver/client_merge_test.go

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3905,11 +3905,15 @@ func TestStoreRangeMergeRaftSnapshot(t *testing.T) {
39053905
}
39063906
keySpans := rditer.MakeReplicatedKeySpans(inSnap.Desc)
39073907
sstFileWriters := map[string]sstFileWriter{}
3908-
for _, span := range keySpans {
3908+
for i, span := range keySpans {
39093909
file := &storage.MemObject{}
39103910
writer := storage.MakeIngestionSSTWriter(ctx, st, file)
3911-
if err := writer.ClearRawRange(span.Key, span.EndKey, true /* pointKeys */, true /* rangeKeys */); err != nil {
3912-
return err
3911+
if i < len(keySpans)-1 {
3912+
// The last span is the MVCC span, and is always cleared via Excise.
3913+
// See multiSSTWriter.
3914+
if err := writer.ClearRawRange(span.Key, span.EndKey, true /* pointKeys */, true /* rangeKeys */); err != nil {
3915+
return err
3916+
}
39133917
}
39143918
sstFileWriters[string(span.Key)] = sstFileWriter{
39153919
span: span,

pkg/kv/kvserver/kv_snapshot_strategy.go

Lines changed: 13 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -115,41 +115,10 @@ func (kvSS *kvBatchSnapshotStrategy) Receive(
115115
return noSnap, errors.AssertionFailedf("last span in multiSSTWriter did not equal the user key span: %s", keyRanges[len(keyRanges)-1].String())
116116
}
117117

118-
// TODO(aaditya): Remove once we support flushableIngests for shared and
119-
// external files in the engine.
120-
121-
// TODO(tbg): skipClearForMVCCSpan should equal true (since we always excise
122-
// the MVCC span), but there appear to be bugs in this code as demonstrated by
123-
// TestRaftSnapshotsWithMVCCRangeKeysEverywhere failing with the proposed
124-
// change:
125-
//
126-
// * panic: failed to put range key in sst: pebble: spans must be added in order: /Local/RangeID/75/r":a"/0,0 > /Local/RangeID/75/r""/0,0
127-
// ^- we added the ":a" key first, now we're trying to add the r"" key.
128-
//
129-
// My basic understanding of this failure mode is:
130-
// - first, a range del covering the range-local replicated key span `/Range/75/{r-s}` is added in (msstw.initSST)
131-
// - the fragmenter's start key is now `/Range/75/r`.
132-
// - when the `/Range/75/r/{:a-:z}` rangedel, is handled, it enters through `msstw.PutRangeKey`
133-
// - since `skipClearForMVCCSpan` is set, this does not update the fragmenter.
134-
// - the range deletion is added to the current SST.
135-
// - when the first key >= `/Range/75/r` is added, `msstw.finalizeSST` flushes
136-
// the fragmenter to the current SST.
137-
// - we hit the above error, since the SST already contains the rangedel
138-
// starting at `/Range/75/r:a` and we're not attempting to add a rangedel
139-
// `/Range/75/r` with smaller start key.
140-
//
141-
// The crucial problem is skipping the fragmenter. This seems to have to do with
142-
// wanting to allow callers to use `skipClearForMVCCSpan`, but this is going to
143-
// be dead code.
144-
//
145-
// Another note: CI passes when this is unconditionally set to `false`,
146-
// indicating that we have poor coverage of the shared and external code paths.
147-
skipClearForMVCCSpan := header.SharedReplicate || header.ExternalReplicate
148-
149118
// The last key range is the user key span.
150119
localRanges := keyRanges[:len(keyRanges)-1]
151120
mvccRange := keyRanges[len(keyRanges)-1]
152-
msstw, err := newMultiSSTWriter(ctx, kvSS.st, kvSS.scratch, localRanges, mvccRange, kvSS.sstChunkSize, skipClearForMVCCSpan, header.RangeKeysInOrder)
121+
msstw, err := newMultiSSTWriter(ctx, kvSS.st, kvSS.scratch, localRanges, mvccRange, kvSS.sstChunkSize, header.RangeKeysInOrder)
153122
if err != nil {
154123
return noSnap, err
155124
}
@@ -294,19 +263,18 @@ func (kvSS *kvBatchSnapshotStrategy) Receive(
294263
}
295264

296265
inSnap := IncomingSnapshot{
297-
SnapUUID: snapUUID,
298-
SSTStorageScratch: kvSS.scratch,
299-
FromReplica: header.RaftMessageRequest.FromReplica,
300-
Desc: header.State.Desc,
301-
DataSize: dataSize,
302-
SSTSize: sstSize,
303-
SharedSize: sharedSize,
304-
raftAppliedIndex: header.State.RaftAppliedIndex,
305-
msgAppRespCh: make(chan raftpb.Message, 1),
306-
sharedSSTs: sharedSSTs,
307-
externalSSTs: externalSSTs,
308-
includesRangeDelForLastSpan: !skipClearForMVCCSpan,
309-
clearedSpans: keyRanges,
266+
SnapUUID: snapUUID,
267+
SSTStorageScratch: kvSS.scratch,
268+
FromReplica: header.RaftMessageRequest.FromReplica,
269+
Desc: header.State.Desc,
270+
DataSize: dataSize,
271+
SSTSize: sstSize,
272+
SharedSize: sharedSize,
273+
raftAppliedIndex: header.State.RaftAppliedIndex,
274+
msgAppRespCh: make(chan raftpb.Message, 1),
275+
sharedSSTs: sharedSSTs,
276+
externalSSTs: externalSSTs,
277+
clearedSpans: keyRanges,
310278
}
311279

312280
timingTag.stop("totalTime")

0 commit comments

Comments
 (0)