Skip to content

HashMap Index

vinoth chandar edited this page Mar 27, 2017 · 4 revisions

Goal

Use HFile (or Voldemort's RO File Format or TFile), something that offers random key lookups, to build a persistent Hashmap to store a mapping from recordKey => fileId, such that

  • Index should support all commit/rollback semantics that Hoodie does (BloomIndex accomplishes this trivially)
  • Global lookup i.e even though the hoodieKey provided has both a recordKey and partitionPath, the lookup/update happens purely on recordKey
  • Is reasonably fast & can handle billions of keys

Going forward, we will use the term hashmap to denote such a persistent hashmap on the backing filesystem.

Basic Idea

  1. We hash recordKey into buckets (statically over-provision at say 1000).
  2. Each bucket has a X hashmaps, contains all keys mapped to the bucket.
  3. tagLocation looks up all hashmaps within each bucket
  4. updateLocation will generate a new hashmap into bucket, with new keys for the bucket
  5. Periodically, all hashmaps are merged back into 1, bound lookup time in #3

tagLocation

The Spark DAG here looks like below.

_____________________    ____________________________    __________________________________________   
| RDD[HoodieRecord] | => |Hash(recordKey) to buckets| => | Check against all HFiles within bucket | => insert or update 
_____________________    ____________________________    __________________________________________  

updateLocation

Spark DAG for updating location.

_____________________   _________________________________________________    ________________________________  
| RDD[WriteStatus] | => |Filter out updates & hash(recordkey) to buckets| => | Add new hashmap into bucket | 
_____________________   _________________________________________________    __________________________________________  

rollbackCommit

Open questions

Q: Given our key and value is fixed length (100 bytes total), would a more simpler implementation work better? **A: **

Q: Do we still need to introduce a notion of caching? How effective is the caching on Filesystem? **A: **

Q: Should the hashmap be sorted, would it help us take advantage of key ranges? A: With uuid based keys, it does not matter. But if the recordKey is timebased, we can significantly cut down comparisions. So we should do it if possible.

Clone this wiki locally