The streaming format is designed to be easy and fast to write or parse.
It only uses 64 bit unsigned variable sized ints to store the information.
The encoding of Uvarint is:
- unsigned integers are serialized 7 bits at a time, starting with the least significant bits
- the most significant bit (msb) in each output byte indicates if there is a continuation byte (msb = 1)
All values in this format can fit in a 64 bit unsigned value. See encoding/binary.ReadUvarint for a reference implementation.
This format has data and index split in two files, so the index can be quickly read before any decoding starts.
This index allows to keep track of the last occurence of a block, so it can be deallocated immediately afterwards.
| Content | Type | Values |
|---|---|---|
| Format ID | UvarInt | 0x1 (always) |
| MaxBlockSize | UvarInt | >= 512 |
This is the index parsing algorithm in overall terms.
for {
offset = ReadVarUint()
switch offset {
// NEW BLOCK
case 0:
x = ReadVarUint()
if x > MaxBlockSize { ERROR }
blockSize = MaxBlockSize - x
block = ReadBytesFromDataStream(blockSize)
// END OF STREAM
case 1<<64 - 1:
x = ReadVarUint()
if x > MaxBlockSize { ERROR }
blockSize = MaxBlockSize - x
block = ReadBytesFromDataStream(blockSize)
// Stream terminator
x := ReadVarUint()
if x != 0 { ERROR }
break
// DEDUPLICATED BLOCK
default:
SourceBlockNum = CurrentBlock - offset
if SourceBlockNum < 0 { ERROR }
}
}Block sizes are stored as MaxSize - Size, so fixed block sizes are all stored as size '0'.
The deduplicated offset is backwards from the the current block, so if the current block is the same as the previous, it will be encoded as '1'. If it is two blocks back, 2, etc.
Format 2 has block definitions and data interleaved. It only has a minor difference to Format 1, since it includes a Maximum backreference Length, helping the decoder to deallocate blocks.
| Content | Type | Values |
|---|---|---|
| Format ID | UvarInt | 0x2 (always) |
| MaxBlockSize | UvarInt | >= 512 |
| MaxLength | UvarInt | >= 1 |
In addition to Maximum Block Size, a MaxLength is also added, which indicates the maximum backreference distance
of this stream. This means that any offsets will be less or equal to MaxLength.
This is the decoding loop. MaxLength blocks should be kept in memory while the decoding is taking place.
Data is read from in between block definitions, if offset is 0.
for {
offset = ReadVarUint()
switch offset {
// NEW BLOCK
case 0:
x = ReadVarUint()
if x > MaxBlockSize { ERROR }
blockSize = MaxBlockSize - x
block = ReadBytes(blockSize)
// END OF STREAM
case 1<<64 - 1:
x = ReadVarUint()
if x > MaxBlockSize { ERROR }
blockSize = MaxBlockSize - x
block = ReadBytes(blockSize)
// Stream terminator
x := ReadVarUint()
if x != 0 { ERROR }
break
// DEDUPLICATED BLOCK
default:
if offset > MaxLength { ERROR }
SourceBlockNum = CurrentBlock - offset
if SourceBlockNum < 0 { ERROR }
}
}