-
Notifications
You must be signed in to change notification settings - Fork 177
Overview
A general B+-tree implementation usually has the following drawbacks:
-
B-tree stores an entire key inside not only leaf nodes but also index (internal) nodes including root node. As the length of key gets longer, the fan-out degree of each node gets smaller, and it makes the height of entire tree longer. Since the number of disk accesses is proportional to the height of tree (because we should traverse the tree from root to leaf), storing the entire key is inefficient for disk I/O. Furthermore, we should compare the entire key in each node with query while traversing from root to leaf. This B-tree implementation is inefficient to index variable-length keys.
-
If a file layout is not block-aligned, it will require more IO operations than expected. Although the logical view of a file provided by the OS file system is byte-addressable, the actual device I/O operation is performed on a per-block basis. When we try to read a node written over two blocks, the device driver has to read the two blocks even though whatever the node size is. To avoid this additional overhead, the size of a node should be fitted into multiple size of a block, and different types of data should not be interleaved in a single block.
-
B+-tree implementation can rely on the OS file system cache to cache B-tree nodes in memory. However, the file system cache is a shared resource among other processes, and it has been widely known that the OS page cache significantly increases the latency and slows down the IO performance on Solid-State Drives (SSDs) especially.
To address these limitations, we propose ForestDB, a fast key-value storage engine that is based on Hierarchical B+-tree Trie data structure, called HB+-Trie. ForestDB provides efficient indexing and retrieval on a large number of variable-length keys over tree-like structures, in terms of block device I/O operations. In addition, documents and internal index nodes are stored as a block-aligned form so that we can minimize the number of block I/O accesses while indexing the documents.
ForestDB supports Multi-Version Concurrency Control (MVCC) and persists database changes in an append-only manner. This allows us to have multiple concurrent readers and writers. Writers are all serialized, but can interleave with readers, which means that an arbitrary number of readers can access a given database instance concurrently without being blocked by writers. However, we usually recommend one writer and multiple readers for each database instance to avoid synchronization overhead among multiple writers.
ForestDB's main index structure is a variant of trie (i.e., prefix tree) whose nodes are B+-trees. Each B+-tree has its own nodes where all leaf nodes of each B+-tree have pointers to other B+-trees (sub-trees) or actual documents. There is a root B+-tree on top of HB+-trie, and other sub-trees are created on-demand as new nodes are created in trie.

As in the original trie, HB+-trie splits a document key into fixed-length chunks. The size of each chunk is configurable, and we set 4 bytes as default. Each chunk is used as a key for each level of B+-tree sequentially. Searching the index structure starts by retrieving the root B+-tree with the first (leftmost) chunk as a key. If a pointer from a leaf node of the root B+-tree points to a target document, the lookup operation is terminated. Otherwise (i.e., the pointer points to the next-level B+-tree), we continue the index traversal at the next-level tree with the next chunk recursively until the target document is found. This retrieval process is basically same as that of the original trie.
There are several benefits of HB+-trie over typical tree-like structures. First, dealing with a long key can be fast and space-efficient. Since we do not need to compare an entire key for each node, a lookup operation can be faster than tree-like structures which compare an entire key for each node on the path from a root to leaf nodes. Second, we do not store an entire key in HB+-trie, while typical B+-tree should provide appropriate space (up to key size * fan-out degree) for every node to store keys.
HB+-trie also provides a common prefix compression scheme as the original trie. This scheme skips any common branch among keys sharing the common prefix, reducing unnecessary tree traversals. The tree using the first different chunk (i.e. the chunk next to the common prefix) as a key stores the entire common prefix inside its header to reconstruct an original key for all searches passing through the tree. The next figure shows an example when chunk size is one byte. The number in the triangle (i.e. n-th) denotes the chunk number used as a key in the tree, and the string represents the common prefix skipped by the tree.

We maintain two separate indexes per ForestDB instance, which are ID and Seq indexes, respectively. An ID index (i.e., primary index) uses a document ID as its index key, while an Seq index uses a sequence number (64 bits) as its index key. A sequence number (maintained per database instance) is incremented for each mutation (i.e., insert, update, delete) and assigned to the corresponding document, which allows us to look up a document with its sequence number on an Seq index.
HB+-trie is used as an ID index, while a B+-tree as Seq-index. We use the same B+-tree implementation for HB+-trie node and Seq index. Since the key size of this B+-tree is fixed, there is no space overhead to support a variable-length key.
We can get lots of benefit from HB+-trie when keys are sufficiently long (8 bytes at least) and their distribution is uniform random (e.g. in case of hash values). The next figure shows an example of random keys when chunk size is 2 bytes. Since there is no common prefix, they can be distinguished by the first chunk and we do not need to create sub-trees to compare next chunks. Suppose that the chunk size is n-bit and key distribution is (ideal) uniform random, up to 2^n keys can be indexed by storing only their first chunks in the first B+-tree. This makes HB+-trie much more efficient in terms of space occupied by the index structure and the number of disk accesses, compared to the naive B+-tree implementation that stores entire keys in its index nodes.
