HashMap Index

Goal

Use HFile (or Voldemort's RO File Format or TFile), something that offers random key lookups, to build a persistent Hashmap to store a mapping from recordKey => fileId, such that

Index should support all commit/rollback semantics that Hoodie does (BloomIndex accomplishes this trivially)
Global lookup i.e even though the hoodieKey provided has both a recordKey and partitionPath, the lookup/update happens purely on recordKey
Is reasonably fast & can handle billions of keys

Going forward, we will use the term hashmap to denote such a persistent hashmap on the backing filesystem.

Basic Idea

We hash recordKey into buckets (statically over-provision at say 1000).
Each bucket has a X hashmaps, contains all keys mapped to the bucket.
tagLocation looks up all hashmaps within each bucket
updateLocation will generate a new hashmap into bucket, with new keys for the bucket
Periodically, all hashmaps are merged back into 1, bound lookup time in #3

tagLocation

The Spark DAG here looks like below.

_____________________    ____________________________    __________________________________________   
| RDD[HoodieRecord] | => |Hash(recordKey) to buckets| => | Check against all HFiles within bucket | => insert or update 
_____________________    ____________________________    __________________________________________

updateLocation

Spark DAG for updating location.

_____________________   _________________________________________________    ________________________________  
| RDD[WriteStatus] | => |Filter out updates & hash(recordkey) to buckets| => | Add new hashmap into bucket | 
_____________________   _________________________________________________    __________________________________________

rollbackCommit

Open questions

Q: Given our key and value is fixed length (100 bytes total), would a more simpler implementation work better? **A: **

Q: Do we still need to introduce a notion of caching? How effective is the caching on Filesystem? **A: **

Q: Should the hashmap be sorted, would it help us take advantage of key ranges? A: With uuid based keys, it does not matter. But if the recordKey is timebased, we can significantly cut down comparisions. So we should do it if possible.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HashMap Index

Goal

Basic Idea

tagLocation

updateLocation

rollbackCommit

Open questions

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally