HashMap Index

Goal

Use HFile (or Voldemort's RO File Format or TFile), something that offers random key lookups, to build a persistent Hashmap to store a mapping from recordKey => fileId, such that

Index should support all commit/rollback semantics that Hoodie does (BloomIndex accomplishes this trivially)
Global lookup i.e even though the hoodieKey provided has both a recordKey and partitionPath, the lookup/update happens purely on recordKey
Is reasonably fast & can handle billions of keys

Going forward, we will use the term hashmap to denote such a persistent hashmap on the backing filesystem.

Basic Idea

HFile should provide an persistent hashmap on HDFS for . We statically over-provision buckets (say 1000), by hashing recordKey and each bucket contains a few HashMaps (one.

tagLocation

The Spark DAG here looks like below.

updateLocation

rollbackCommit

Open questions

Given our key and value is fixed length (100 bytes total), would a more simpler implementation work better?
Do we still need to introduce a notion of caching? How effective is the caching on Filesystem?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HashMap Index

Goal

Basic Idea

tagLocation

updateLocation

rollbackCommit

Open questions

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally