-
Notifications
You must be signed in to change notification settings - Fork 177
Overview
A general B+-tree implementation usually has the following drawbacks:
-
B-tree stores an entire key inside not only leaf nodes but also index (internal) nodes including root node. As the length of key gets longer, the fan-out degree of each node gets smaller, and it makes the height of entire tree longer. Since the number of disk accesses is proportional to the height of tree (because we should traverse the tree from root to leaf), storing the entire key is inefficient for disk I/O. Furthermore, we should compare the entire key in each node with query while traversing from root to leaf. This B-tree implementation is inefficient to index variable-length keys.
-
If a file layout is not block-aligned, it will require more IO operations than expected. Although the logical view of a file provided by the OS file system is byte-addressable, the actual device I/O operation is performed on a per-block basis. When we try to read a node written over two blocks, the device driver has to read the two blocks even though whatever the node size is. To avoid this additional overhead, the size of a node should be fitted into multiple size of a block, and different types of data should not be interleaved in a single block.
-
B+-tree implementation can rely on the OS file system cache to cache B-tree nodes in memory. However, the file system cache is a shared resource among other processes, and it has been widely known that the OS page cache significantly increases the latency and slows down the IO performance on Solid-State Drives (SSDs) especially.
To address these limitations, we propose ForestDB, a fast key-value storage engine that is based on Hierarchical B+-tree Trie data structure, called HB+-Trie. ForestDB provides efficient indexing and retrieval on a large number of variable-length keys over tree-like structures, in terms of block device I/O. In addition, documents and internal index nodes are stored as a block-aligned form so that we can minimize the number of block I/O accesses while indexing the documents.
The main index structure is a variant of trie (i.e., prefix tree) whose nodes are B+-trees. Each B+-tree has its own nodes where all leaf nodes of each B+-tree have pointers to other B+-trees (sub-trees) or actual documents. There is a root B+-tree on top of HB+-trie, and other sub-trees are created on-demand as new nodes are created in trie.

As in the original trie, HB+-trie splits a document key into fixed-length chunks. The size of each chunk is configurable, and we set 4 bytes as default. Each chunk is used as a key for each level of B+-tree sequentially. Searching the index structure starts by retrieving the root B+-tree with the first (leftmost) chunk as a key. If a pointer from a leaf node of the root B+-tree points to a target document, the lookup operation is terminated. Otherwise (i.e., the pointer points to the next-level B+-tree), we continue the index traversal at the next-level tree with the next chunk recursively until the target document is found. This retrieval process is basically same as that of the original trie.
There are several benefits of HB+-trie over typical tree-like structures. First, dealing with a long key can be fast and space-efficient. Since we do not need to compare an entire key for each node, a lookup operation can be faster than tree-like structures which compare an entire key for each node on the path from a root to leaf nodes. Second, we do not store an entire key in HB+-trie, while typical B+-tree should provide appropriate space (up to key size * fan-out degree) for every node to store keys.
HB+-trie also provides a common prefix compression scheme as the original trie. This scheme skips any common branch among keys sharing the common prefix, reducing unnecessary tree traversals. The tree using the first different chunk (i.e. the chunk next to the common prefix) as a key stores the entire common prefix inside its header to reconstruct an original key for all searches passing through the tree. The next figure shows an example when chunk size is one byte. The number in the triangle (i.e. n-th) denotes the chunk number used as a key in the tree, and the string represents the common prefix skipped by the tree.