-
Notifications
You must be signed in to change notification settings - Fork 13
Description
Search before asking
- I searched in the issues and found nothing similar.
Motivation
In modern analytical and AI-driven workloads, efficient data retrieval—especially for high-cardinality filters or similarity-based queries—is critical. Traditional file-level or partition-level metadata (e.g., min/max statistics) often fall short when queries involve selective predicates on non-partition columns or require nearest-neighbor lookups in vector spaces. This leads to excessive I/O, scanning irrelevant files, and poor end-to-end latency.
To address this, we introduce global indexing in Paimon—a unified, table-wide index structure that spans all data files across partitions and snapshots.
Solution
Proposed Solution
We propose building global indexes in Paimon using Row Tracking (which assigns each record a stable, globally unique Row ID) and Data Evolution (ensuring consecutive row ID without gaps).
We will support two index types:
- Bitmap-based inverted indexes for fast scalar filtering (e.g.,type = X),
- Vector indexes via our in-house vector engine Lumina for efficient DiskANN search.
Index construction and lookup are distributed, enabling analytical engines like Spark and StarRocks to skip irrelevant data and fetch only matching records—dramatically improving query performance.
Anything else?
Are you willing to submit a PR?
- I'm willing to submit a PR!