-
Notifications
You must be signed in to change notification settings - Fork 2.5k
HashMap Index
Use HFile (or Voldemort's RO File Format or TFile), something that offers random key lookups, to build a persistent Hashmap to store a mapping from recordKey => fileId, such that
- Index should support all commit/rollback semantics that Hoodie does (BloomIndex accomplishes this trivially)
- Global lookup i.e even though the
hoodieKeyprovided has both a recordKey and partitionPath, the lookup/update happens purely on recordKey - Is reasonably fast & can handle billions of keys
Going forward, we will use the term hashmap to denote such a persistent hashmap on the backing filesystem.
- We hash
recordKeyinto buckets (statically over-provision at say 1000). - Each bucket has a X hashmaps, contains all keys mapped to the bucket.
-
tagLocationlooks up all hashmaps within each bucket -
updateLocationwill generate a new hashmap into bucket, with new keys for the bucket - Periodically, all hashmaps are merged back into 1, bound lookup time in #3
The Spark DAG here looks like below.
_____________________ ____________________________ __________________________________________
| RDD[HoodieRecord] | => |Hash(recordKey) to buckets| => | Check against all HFiles within bucket | => insert or update
_____________________ ____________________________ __________________________________________
Spark DAG for updating location.
_____________________ _________________________________________________ ________________________________
| RDD[WriteStatus] | => |Filter out updates & hash(recordkey) to buckets| => | Add new hashmap into bucket |
_____________________ _________________________________________________ __________________________________________
Q: Given our key and value is fixed length (100 bytes total), would a more simpler implementation work better? **A: **
Q: Do we still need to introduce a notion of caching? How effective is the caching on Filesystem? **A: **
Q: Should the hashmap be sorted, would it help us take advantage of key ranges? A: With uuid based keys, it does not matter. But if the recordKey is timebased, we can significantly cut down comparisions. So we should do it if possible.