feat: optimize parquet reads with page-level filtering#232
feat: optimize parquet reads with page-level filtering#232liangjie3138 wants to merge 32 commits into
Conversation
|
liangjie.liang seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
|
Thank you for your contribution! This is a highly complex and important feature, and your work on it is greatly appreciated. Given the large scope of this PR, would it be possible to split it into smaller, focused changes? For example, separating the bucket predicate logic from the Parquet point lookup improvements could make each part easier to review and move forward incrementally. Also, could you please fix the CI failures first so we can begin the review process? We truly recognize the effort behind this change and look forward to helping get it merged smoothly. |
There was a problem hiding this comment.
Pull request overview
Implements multi-level Parquet read optimizations (bucket selection + page/row-group filtering) by leveraging Parquet page indexes (ColumnIndex/OffsetIndex) and adding page-level prefetching to reduce I/O and decode work.
Changes:
- Added page-index-based filtering infrastructure (
ColumnIndexFilter,RowRanges) and a page-filtered row-group reader with page-range prefetch support. - Integrated page-level filtering/prefetch into
ParquetFileBatchReader/FileReaderWrapper, and enabled writing page indexes via a new writer option. - Added bucket-id derivation from predicates (
BucketSelectConverter) and expanded scan bucket filtering to support multiple buckets.
Reviewed changes
Copilot reviewed 28 out of 28 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| src/paimon/format/parquet/row_ranges.h | Introduces RowRanges abstraction for page/row-range selection. |
| src/paimon/format/parquet/row_ranges.cpp | Implements range union/intersection/overlap/add logic used by page filtering. |
| src/paimon/format/parquet/parquet_writer_builder.cpp | Enables Parquet page index writing behind an option. |
| src/paimon/format/parquet/parquet_format_defs.h | Adds new read/write options for page-index functionality. |
| src/paimon/format/parquet/parquet_file_batch_reader.h | Adds page-index filtering API and logging member. |
| src/paimon/format/parquet/parquet_file_batch_reader.cpp | Wires page-level filtering + eager prepare to start prebuffer earlier. |
| src/paimon/format/parquet/page_filtered_row_group_reader.h | Declares page-filtered row group read + page-range computation. |
| src/paimon/format/parquet/page_filtered_row_group_reader.cpp | Implements decode skipping + page-range prefetch logic for filtered reads. |
| src/paimon/format/parquet/page_filtered_row_group_reader_test.cpp | Adds end-to-end tests for page filtering and page-range computation. |
| src/paimon/format/parquet/file_reader_wrapper.h | Extends wrapper to support page-filtered RG reads and page-range prebuffering. |
| src/paimon/format/parquet/file_reader_wrapper.cpp | Implements page-filtered RG scheduling + unified PreBufferRanges prefetch. |
| src/paimon/format/parquet/column_index_filter.h | Adds ColumnIndex-based predicate evaluation for page selection. |
| src/paimon/format/parquet/column_index_filter.cpp | Implements ColumnIndex-based page matching and RowRanges generation. |
| src/paimon/format/parquet/column_index_filter_test.cpp | Adds RowRanges unit tests + ColumnIndexFilter integration tests. |
| src/paimon/format/parquet/CMakeLists.txt | Registers new parquet sources/tests; adds Arrow source include path. |
| src/paimon/core/operation/key_value_file_store_scan.cpp | Derives bucket filter from predicates when not explicitly set. |
| src/paimon/core/operation/file_store_scan.h | Changes bucket filter to optional<set<int32_t>>; adds helpers. |
| src/paimon/core/operation/file_store_scan.cpp | Updates bucket filtering logic to handle multiple buckets. |
| src/paimon/core/operation/bucket_select_converter.h | Declares predicate→bucket-id derivation helper. |
| src/paimon/core/operation/bucket_select_converter.cpp | Implements bucket-id derivation compatible with Java hashing. |
| src/paimon/core/operation/bucket_select_converter_test.cpp | Adds tests for bucket derivation across predicate shapes/types. |
| src/paimon/core/operation/merge_file_split_read.cpp | Refactors loops to index-based iteration. |
| src/paimon/core/operation/abstract_split_read.cpp | Refactors loop to index-based iteration. |
| src/paimon/core/mergetree/compact/sort_merge_reader_with_min_heap.cpp | Refactors loop to index-based iteration. |
| src/paimon/common/utils/arrow/arrow_input_stream_adapter.h | Tracks outstanding async reads for safe destruction. |
| src/paimon/common/utils/arrow/arrow_input_stream_adapter.cpp | Waits for pending futures; prunes finished futures. |
| src/paimon/CMakeLists.txt | Registers new core operation source + test. |
| cmake_modules/arrow.diff | Patches Arrow Parquet reader to add PreBufferRanges/WhenBufferedRanges and cached page-range reads. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
GetRecordBatchReader already issues PreBuffer internally driven by ArrowReaderProperties::pre_buffer=true, so the wrapper's manual PreBufferRanges call would tear down and rebuild cached_source_, redundantly re-issuing IO on remote filesystems. The manual path is only needed when page-level ranges must be merged with column-chunk ranges in a single PreBuffer; when no RG is page-filtered, the internal PreBuffer covers everything.
Replace pending_filtered_reads_ map (and PageFilteredRowGroupMeta struct) with on-demand reconstruction of the per-RG reader inputs from row_group_row_ranges_ + target_column_indices_ inside Next() lazy-init. Mirrors how the fully-matched path uses Arrow's stateless GetRecordBatchReader, removing the consume-once lifecycle invariant. Backward seek into a consumed page-filtered RG now rebuilds the reader instead of erroring with "missing pending read metadata". Test updated to assert the success path. Trade-off: ComputePageRanges runs once per RG entry (us scale) instead of being cached; OffsetIndex Thrift parse repeats on each entry. Net is zero per-session memory accumulation and a simpler stateless model. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
| // from orc, making it impossible to find "safe" IO positions for error recovery testing. | ||
| if (file_format == "parquet") { | ||
| GTEST_SKIP() << "Skipping parquet IOException test - IO patterns differ from orc"; | ||
| } |
There was a problem hiding this comment.
Since other tests are now passing, can we remove the skip on the Parquet write IO exception test?
It’d be good to have it back in the suite if the try-catch cause is fixed.
Purpose
Linked issue: close ##137
Implement multi-level filtering optimization for Parquet file reading. By leveraging ColumnIndex statistics, the reader can skip non-matching data at the bucket, row group, and page levels, reducing I/O and decoding overhead.
Main Features
Page-level data filtering
EQUAL,NOT_EQUAL,LESS_THAN,GREATER_THAN,IN,IS_NULL, and compound predicates withAND/OR.data_page_filtercallback.SkipRecords/ReadRecords.Page-level prefetching
Computes the byte ranges of required pages based on
RowRangesandOffsetIndex, and usesArrowPreBufferfor asynchronous prefetching.Tests
bucket_select_converter_test.cpp: Covers various predicate combinations,Timestamptype, and Cartesian product computation.column_index_filter_test.cpp: Covers all predicate types (EQUAL,IN,LESS_THAN,GREATER_THAN,IS_NULL, etc.) andAND/ORcompound predicates.page_filtered_row_group_reader_test.cpp: Verifies filtering correctness, edge cases, and prefetching behavior.API and Format
No public API changes. No impact on storage format or protocol.
Documentation
Not applicable.
Generative AI Tooling
Claude Code (Opus 4.6)