Skip to content

HashMap Index

vinoth chandar edited this page Mar 27, 2017 · 4 revisions

Goal

Use HFile (or Voldemort's RO File Format or TFile), something that offers random key lookups, to build a persistent Hashmap to store a mapping from recordKey => fileId, such that

  • Index should support all commit/rollback semantics that Hoodie does (BloomIndex accomplishes this trivially)
  • Global lookup i.e even though the hoodieKey provided has both a recordKey and partitionPath, the lookup/update happens purely on recordKey
  • Is reasonably fast & can handle billions of keys

Going forward, we will use the term hashmap to denote such a persistent hashmap on the backing filesystem.

Basic Idea

HFile should provide an persistent hashmap on HDFS for . We statically over-provision buckets (say 1000), by hashing recordKey and each bucket contains a few HashMaps (one.

tagLocation

The Spark DAG here looks like below.


|Input RDD[HoodieRecord] | => |Hash by recordKey into buckets| => | Lookup HFiles in each bucket | => insert or update


updateLocation

rollbackCommit

Open questions

  • Given our key and value is fixed length (100 bytes total), would a more simpler implementation work better?
  • Do we still need to introduce a notion of caching? How effective is the caching on Filesystem?

Clone this wiki locally