-
Notifications
You must be signed in to change notification settings - Fork 25
feat: support prefetch for orc #77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Adds ORC support to the prefetch path by introducing ORC-aware read-range generation and byte-range prebuffering, and wiring a read-ahead cache into the existing PrefetchFileBatchReaderImpl.
Changes:
- Add ORC
ReadRangeGeneratorandOrcReaderWrapperto support row-range generation and ORC prebuffer range hints. - Introduce a shared
ReadAheadCache+CacheInputStreamto serve prefetched byte ranges to parallel readers. - Extend prefetch/caching controls via
ReadContext/builder and expand integration/unit tests to cover ORC prefetch scenarios.
Reviewed changes
Copilot reviewed 30 out of 30 changed files in this pull request and generated 17 comments.
Show a summary per file
| File | Description |
|---|---|
| test/inte/scan_and_read_inte_test.cpp | Adds ORC prefetch=true parameterization for integration coverage. |
| test/inte/read_inte_with_index_test.cpp | Adds ORC prefetch=true parameterization for integration coverage. |
| test/inte/read_inte_test.cpp | Expands ORC test params to include prefetch combinations. |
| src/paimon/format/orc/read_range_generator_test.cpp | New unit tests for ORC row-range generation / thresholds. |
| src/paimon/format/orc/read_range_generator.h | Declares ORC row-range generator API. |
| src/paimon/format/orc/read_range_generator.cpp | Implements ORC row-range generation and prefetch decision logic. |
| src/paimon/format/orc/orc_reader_wrapper_test.cpp | New unit test for wrapper row position tracking. |
| src/paimon/format/orc/orc_reader_wrapper.h | Introduces wrapper to expose GetNextRowToRead and prebuffer hints. |
| src/paimon/format/orc/orc_reader_wrapper.cpp | Implements wrapper seek/schema/next-batch logic. |
| src/paimon/format/orc/orc_input_stream_impl.h | Adds async-read tracking fields. |
| src/paimon/format/orc/orc_input_stream_impl.cpp | Implements ORC async read via underlying InputStream::ReadAsync. |
| src/paimon/format/orc/orc_format_defs.h | Adds ORC prefetch threshold-related constants. |
| src/paimon/format/orc/orc_file_batch_reader_test.cpp | Updates tests for new row reader options signature/column id capture. |
| src/paimon/format/orc/orc_file_batch_reader.h | Switches ORC reader to PrefetchFileBatchReader + prefetch APIs. |
| src/paimon/format/orc/orc_file_batch_reader.cpp | Wires ORC reader wrapper + read-range generation into batch reader. |
| src/paimon/format/orc/CMakeLists.txt | Adds new ORC sources and tests to the build. |
| src/paimon/core/operation/read_context.cpp | Adds prefetch-cache options into read context/builder plumbing. |
| src/paimon/core/operation/internal_read_context.h | Exposes prefetch-cache settings to internal read pipeline. |
| src/paimon/core/operation/abstract_split_read.cpp | Enables prefetch path for ORC and passes cache config into prefetch reader. |
| src/paimon/core/deletionvectors/apply_deletion_vector_batch_reader_test.cpp | Updates prefetch reader construction to include cache params. |
| src/paimon/common/utils/byte_range_combiner.h | Adjusts combiner API/visibility after refactor. |
| src/paimon/common/utils/byte_range_combiner.cpp | Refactors coalescing implementation into CoalesceByteRanges. |
| src/paimon/common/reader/prefetch_file_batch_reader_impl_test.cpp | Updates tests for new prefetch reader ctor and adds ORC format where available. |
| src/paimon/common/reader/prefetch_file_batch_reader_impl.h | Extends Create/ctor to accept cache config + cache instance. |
| src/paimon/common/reader/prefetch_file_batch_reader_impl.cpp | Creates shared read-ahead cache and initializes it using PreBufferRange(). |
| src/paimon/common/fs/cache_input_stream.h | New InputStream wrapper serving reads from ReadAheadCache when possible. |
| src/paimon/common/file_index/bitmap/apply_bitmap_index_batch_reader_test.cpp | Updates prefetch reader construction to include cache params. |
| include/paimon/reader/prefetch_file_batch_reader.h | Adds PreBufferRange() API for byte-range prefetch hints. |
| include/paimon/read_context.h | Adds builder APIs and getters for enabling/configuring prefetch cache. |
| cmake_modules/orc.diff | Patches upstream ORC to expose preBufferRange(...) and reuse it in preBuffer. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
lxy-9602
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Purpose
Support prefetch for ORC , and wiring a read-ahead cache into the existing PrefetchFileBatchReaderImpl. Introduce a shared ReadAheadCache + CacheInputStream to serve prefetched byte ranges to parallel readers.
API and Format
PrefetchFileBatchReader:
virtual Result<std::vector<std::pair<uint64_t, uint64_t>>> PreBufferRange();ReadContext:
bool EnablePrefetchCache() constandconst CacheConfig& GetCacheConfig() constReadContextBuilder:
ReadContextBuilder& WithCacheConfig(const CacheConfig& config);ReadContextBuilder& EnablePrefetchCache(bool enabled);