Skip to content

Commit 4cf4f85

Browse files
committed
colblk: clarify PrefixBytes comments
1 parent 2752abb commit 4cf4f85

File tree

2 files changed

+47
-26
lines changed

2 files changed

+47
-26
lines changed

sstable/colblk/column.go

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -72,8 +72,7 @@ type ColumnWriter interface {
7272
DataType(col int) DataType
7373
// Finish serializes the column at the specified index, writing the column's
7474
// data to buf at offset, and returning the offset at which the next column
75-
// should be encoded. Finish also returns a column descriptor describing the
76-
// encoding of the column, which will be serialized within the block header.
75+
// should be encoded.
7776
//
7877
// The supplied buf must have enough space at the provided offset to fit the
7978
// column. The caller may use Size() to calculate the exact size required.

sstable/colblk/prefix_bytes.go

Lines changed: 46 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -26,9 +26,15 @@ import (
2626
// distinguished only by different timestamp suffixes. With columnar blocks
2727
// enabling the timestamp to be placed in a separate column, the multiple
2828
// version problem becomes one of efficiently handling exact duplicate keys.
29-
// PrefixBytes builds off of the RawBytes encoding, introducing n/bundleSize+1
30-
// additional slices for encoding n/bundleSize bundle prefixes and 1 block-level
31-
// shared prefix for the column.
29+
// PrefixBytes builds off of the RawBytes encoding, introducing additional
30+
// slices for encoding (n+bundleSize-1)/bundleSize bundle prefixes and 1
31+
// block-level shared prefix for the column.
32+
//
33+
// Unlike the original prefix compression performed by rowblk (inherited from
34+
// LevelDB and RocksDB), PrefixBytes does not perform all prefix compression
35+
// relative to the previous key. Rather it performs prefix compression relative
36+
// to the first key of a key's bundle. This can result in less compression, but
37+
// simplifies reverse iteration and allows iteration to be largely stateless.
3238
//
3339
// To understand the PrefixBytes layout, we'll work through an example using
3440
// these 15 keys:
@@ -87,17 +93,25 @@ import (
8793
// 18 | 13 | 36 | ........
8894
// 19 | 14 | 36 | ........
8995
//
90-
// The offset column in the table points to the start and end index within the
91-
// RawBytes data array for each of the 20 slices defined above (the 15 key
92-
// suffixes + 4 bundle key prefixes + block key prefix). Offset[0] is the length
93-
// of the first slice which is always anchored at data[0]. The data columns
94-
// display the portion of the data array the slice covers. For row slices, an
95-
// empty suffix column indicates that the slice is identical to the slice at the
96-
// previous index which is indicated by the slice's offset being equal to the
97-
// previous slice's offset. Due to the lexicographic sorting, the key at row i
98-
// can't be a prefix of the key at row i-1 or it would have sorted before the
99-
// key at row i-1. And if the key differs then only the differing bytes will be
100-
// part of the suffix and not contained in the bundle prefix.
96+
// The 'end offset' column in the table encodes the exclusive offset within the
97+
// string data section where each of the slices end. Each slice starts at the
98+
// previous slice's end offset. The first slice (the block prefix)'s start
99+
// offset is implicitly zero. Note that this differs from the plain RawBytes
100+
// encoding which always stores a zero offset at the beginning of the offsets
101+
// array to avoid special-casing the first slice. The block prefix already
102+
// requires special-casing, so materializing the zero start offset is not
103+
// needed.
104+
//
105+
// The table above defines 20 slices: the 1 block key prefix, the 4 bundle key
106+
// prefixes and the 15 key suffixes. Offset[0] is the length of the first slice
107+
// which is always anchored at data[0]. The data columns display the portion of
108+
// the data array the slice covers. For row slices, an empty suffix column
109+
// indicates that the slice is identical to the slice at the previous index
110+
// which is indicated by the slice's offset being equal to the previous slice's
111+
// offset. Due to the lexicographic sorting, the key at row i can't be a prefix
112+
// of the key at row i-1 or it would have sorted before the key at row i-1. And
113+
// if the key differs then only the differing bytes will be part of the suffix
114+
// and not contained in the bundle prefix.
101115
//
102116
// The end result of this encoding is that we can store the 119 bytes of the 15
103117
// keys plus their start and end offsets (which would naively consume 15*4=60
@@ -117,7 +131,14 @@ import (
117131
// | RawBytes |
118132
// | |
119133
// | A modified RawBytes encoding is used to store the data slices. A |
120-
// | PrefixBytes column storing n keys will encode 2+n+n/bundleSize |
134+
// | PrefixBytes column storing n keys will encode |
135+
// | |
136+
// | 1 block prefix |
137+
// | + |
138+
// | (n + bundleSize-1)/bundleSize bundle prefixes |
139+
// | + |
140+
// | n row suffixes |
141+
// | |
121142
// | slices. Unlike the RawBytes encoding, the first offset encoded |
122143
// | is not guaranteed to be zero. In the PrefixBytes encoding, the |
123144
// | first offset encodes the length of the column-wide prefix. The |
@@ -148,17 +169,18 @@ import (
148169
// # Reads
149170
//
150171
// This encoding provides O(1) access to any row by calculating the bundle for
151-
// the row (5*(row/4)), then the row's index within the bundle (1+(row%4)). If
152-
// the slice's offset equals the previous slice's offset then we step backward
153-
// until we find a non-empty slice or the start of the bundle (a variable number
154-
// of steps, but bounded by the bundle size).
172+
// the row (see bundleOffsetIndexForRow), then the per-row's suffix (see
173+
// rowSuffixIndex). If the per-row suffix's end offset equals the previous
174+
// offset, then the row is a duplicate key and we need to step backward until we
175+
// find a non-empty slice or the start of the bundle (a variable number of
176+
// steps, but bounded by the bundle size).
155177
//
156178
// Forward iteration can easily reuse the previous row's key with a check on
157-
// whether the row's slice is empty. Reverse iteration can reuse the next row's
158-
// key by looking at the next row's offset to determine whether we are in the
159-
// middle of a run of equal keys or at an edge. When reverse iteration steps
160-
// over an edge it has to continue backward until a non-empty slice is found
161-
// (just as in absolute positioning).
179+
// whether the row's slice is empty. Reverse iteration within a run of equal
180+
// keys can reuse the next row's key. When reverse iteration steps backward from
181+
// a non-empty slice onto an empty slice, it must continue backward until a
182+
// non-empty slice is found (just as in absolute positioning) to discover the
183+
// row suffix that is duplicated.
162184
//
163185
// The Seek{GE,LT} routines first binary search on the first key of each bundle
164186
// which can be retrieved without data movement because the bundle prefix is

0 commit comments

Comments
 (0)