-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Search before asking
- I searched in the issues and found nothing similar.
Motivation
At present, for scalar indexes we mainly support the Bitmap index. Many thanks to @leaves12138 for the implementation, which has established a solid framework for scalar indexing. However, a conventional Bitmap index has significant limitations and does not work well for scenarios with high data cardinality, such as int, double, and string types.
A global B-Tree index can be built on any comparable data type, providing efficient point lookups and range queries. In addition, it is more amenable to distributed parallel processing during updates and reads. Therefore, this issue proposes implementing a distributed global B-Tree index.
Solution
The basic implementation will be built on the SST FileFormat introduced in #6734, providing point lookup and range query capabilities to cover most common SQL filter predicates.
index construction
Index construction can be efficiently implemented via range shuffle in Flink or Spark: different writer tasks are responsible for writing index files for their assigned key ranges, and a commit task then performs a unified commit of all generated files. As below:
index query
The metadata of B-Tree index files will record information such as the file’s min key, max key, and whether it contains nulls (hasNulls). During index planning, we can use this metadata to prune candidate files, then query the remaining files in parallel and merge the results.
We will bring more details in future PRs.
Anything else?
No response
Are you willing to submit a PR?
- I'm willing to submit a PR!