-
Notifications
You must be signed in to change notification settings - Fork 2.5k
HashMap Index
vinoth chandar edited this page Mar 27, 2017
·
4 revisions
Use HFile (or Voldemort's RO File Format or TFile), something that offers random key lookups, to build a persistent Hashmap to store a mapping from recordKey => fileId, such that
- Index should support all commit/rollback semantics that Hoodie does (BloomIndex accomplishes this trivially)
- Global lookup i.e even though the
hoodieKeyprovided has both a recordKey and partitionPath, the lookup/update happens purely on recordKey - Is reasonably fast & can handle billions of keys
Going forward, we will use the term hashmap to denote such a persistent hashmap on the backing filesystem.
HFile should provide an persistent hashmap on HDFS for . We statically over-provision buckets (say 1000), by hashing recordKey and each bucket contains a few HashMaps (one.
The Spark DAG here looks like below.
|Input RDD[HoodieRecord] | => |Hash by recordKey into buckets| => | Lookup HFiles in each bucket | => insert or update
- Given our key and value is fixed length (100 bytes total), would a more simpler implementation work better?
- Do we still need to introduce a notion of caching? How effective is the caching on Filesystem?