@@ -26,9 +26,15 @@ import (
2626// distinguished only by different timestamp suffixes. With columnar blocks
2727// enabling the timestamp to be placed in a separate column, the multiple
2828// version problem becomes one of efficiently handling exact duplicate keys.
29- // PrefixBytes builds off of the RawBytes encoding, introducing n/bundleSize+1
30- // additional slices for encoding n/bundleSize bundle prefixes and 1 block-level
31- // shared prefix for the column.
29+ // PrefixBytes builds off of the RawBytes encoding, introducing additional
30+ // slices for encoding (n+bundleSize-1)/bundleSize bundle prefixes and 1
31+ // block-level shared prefix for the column.
32+ //
33+ // Unlike the original prefix compression performed by rowblk (inherited from
34+ // LevelDB and RocksDB), PrefixBytes does not perform all prefix compression
35+ // relative to the previous key. Rather it performs prefix compression relative
36+ // to the first key of a key's bundle. This can result in less compression, but
37+ // simplifies reverse iteration and allows iteration to be largely stateless.
3238//
3339// To understand the PrefixBytes layout, we'll work through an example using
3440// these 15 keys:
@@ -87,17 +93,25 @@ import (
8793// 18 | 13 | 36 | ........
8894// 19 | 14 | 36 | ........
8995//
90- // The offset column in the table points to the start and end index within the
91- // RawBytes data array for each of the 20 slices defined above (the 15 key
92- // suffixes + 4 bundle key prefixes + block key prefix). Offset[0] is the length
93- // of the first slice which is always anchored at data[0]. The data columns
94- // display the portion of the data array the slice covers. For row slices, an
95- // empty suffix column indicates that the slice is identical to the slice at the
96- // previous index which is indicated by the slice's offset being equal to the
97- // previous slice's offset. Due to the lexicographic sorting, the key at row i
98- // can't be a prefix of the key at row i-1 or it would have sorted before the
99- // key at row i-1. And if the key differs then only the differing bytes will be
100- // part of the suffix and not contained in the bundle prefix.
96+ // The 'end offset' column in the table encodes the exclusive offset within the
97+ // string data section where each of the slices end. Each slice starts at the
98+ // previous slice's end offset. The first slice (the block prefix)'s start
99+ // offset is implicitly zero. Note that this differs from the plain RawBytes
100+ // encoding which always stores a zero offset at the beginning of the offsets
101+ // array to avoid special-casing the first slice. The block prefix already
102+ // requires special-casing, so materializing the zero start offset is not
103+ // needed.
104+ //
105+ // The table above defines 20 slices: the 1 block key prefix, the 4 bundle key
106+ // prefixes and the 15 key suffixes. Offset[0] is the length of the first slice
107+ // which is always anchored at data[0]. The data columns display the portion of
108+ // the data array the slice covers. For row slices, an empty suffix column
109+ // indicates that the slice is identical to the slice at the previous index
110+ // which is indicated by the slice's offset being equal to the previous slice's
111+ // offset. Due to the lexicographic sorting, the key at row i can't be a prefix
112+ // of the key at row i-1 or it would have sorted before the key at row i-1. And
113+ // if the key differs then only the differing bytes will be part of the suffix
114+ // and not contained in the bundle prefix.
101115//
102116// The end result of this encoding is that we can store the 119 bytes of the 15
103117// keys plus their start and end offsets (which would naively consume 15*4=60
@@ -117,7 +131,14 @@ import (
117131// | RawBytes |
118132// | |
119133// | A modified RawBytes encoding is used to store the data slices. A |
120- // | PrefixBytes column storing n keys will encode 2+n+n/bundleSize |
134+ // | PrefixBytes column storing n keys will encode |
135+ // | |
136+ // | 1 block prefix |
137+ // | + |
138+ // | (n + bundleSize-1)/bundleSize bundle prefixes |
139+ // | + |
140+ // | n row suffixes |
141+ // | |
121142// | slices. Unlike the RawBytes encoding, the first offset encoded |
122143// | is not guaranteed to be zero. In the PrefixBytes encoding, the |
123144// | first offset encodes the length of the column-wide prefix. The |
@@ -148,17 +169,18 @@ import (
148169// # Reads
149170//
150171// This encoding provides O(1) access to any row by calculating the bundle for
151- // the row (5*(row/4)), then the row's index within the bundle (1+(row%4)). If
152- // the slice's offset equals the previous slice's offset then we step backward
153- // until we find a non-empty slice or the start of the bundle (a variable number
154- // of steps, but bounded by the bundle size).
172+ // the row (see bundleOffsetIndexForRow), then the per-row's suffix (see
173+ // rowSuffixIndex). If the per-row suffix's end offset equals the previous
174+ // offset, then the row is a duplicate key and we need to step backward until we
175+ // find a non-empty slice or the start of the bundle (a variable number of
176+ // steps, but bounded by the bundle size).
155177//
156178// Forward iteration can easily reuse the previous row's key with a check on
157- // whether the row's slice is empty. Reverse iteration can reuse the next row's
158- // key by looking at the next row's offset to determine whether we are in the
159- // middle of a run of equal keys or at an edge. When reverse iteration steps
160- // over an edge it has to continue backward until a non-empty slice is found
161- // (just as in absolute positioning) .
179+ // whether the row's slice is empty. Reverse iteration within a run of equal
180+ // keys can reuse the next row's key. When reverse iteration steps backward from
181+ // a non-empty slice onto an empty slice, it must continue backward until a
182+ // non-empty slice is found (just as in absolute positioning) to discover the
183+ // row suffix that is duplicated .
162184//
163185// The Seek{GE,LT} routines first binary search on the first key of each bundle
164186// which can be retrieved without data movement because the bundle prefix is
0 commit comments