[feature](vparquet-reader) Implements parquet file page cache.#59307
[feature](vparquet-reader) Implements parquet file page cache.#59307morningman merged 1 commit intoapache:masterfrom
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
TPC-H: Total hot run time: 35539 ms |
TPC-DS: Total hot run time: 179978 ms |
ClickBench: Total hot run time: 27.34 s |
e32d9d4 to
b25ce38
Compare
|
run buildall |
TPC-H: Total hot run time: 35717 ms |
TPC-DS: Total hot run time: 179518 ms |
ClickBench: Total hot run time: 27.98 s |
b25ce38 to
7a15d5f
Compare
|
run buildall |
TPC-H: Total hot run time: 36141 ms |
TPC-DS: Total hot run time: 179957 ms |
ClickBench: Total hot run time: 27.16 s |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
7a15d5f to
eeb4023
Compare
|
run buildall |
eeb4023 to
3d89dda
Compare
|
run buildall |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
3d89dda to
59ebab3
Compare
|
run buildall |
TPC-H: Total hot run time: 35311 ms |
TPC-DS: Total hot run time: 179267 ms |
ClickBench: Total hot run time: 27.13 s |
59ebab3 to
a8f6d42
Compare
ClickBench: Total hot run time: 27.91 s |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
be/src/io/file_factory.cpp
Outdated
| return unexpected(std::move(reader_res).error()); | ||
| } | ||
| auto file_reader = std::move(reader_res).value(); | ||
| LOG_INFO("create file reader for path={}, size={}, mtime={}", file_description.path, |
There was a problem hiding this comment.
Remove this log, or using DEBUG level
| _page_reader->file_end_offset(), _page_reader->header_start_offset()); | ||
| const std::vector<uint8_t>& header_bytes = _page_reader->header_bytes(); | ||
| size_t total = header_bytes.size() + level_bytes.size() + payload.size; | ||
| auto* page = new DataPage(total, true, segment_v2::DATA_PAGE); |
There was a problem hiding this comment.
Potential memory leak?
Better use a more safety way?
There was a problem hiding this comment.
No leak here: StoragePageCache::insert takes ownership of the DataPage* and the LRU cache frees it via LRUHandle::free() (which deletes LRUCacheValueBase) when the entry is evicted or the
last handle is released, so the allocation in be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp is managed by the cache (be/src/olap/page_cache.cpp, be/src/olap/
lru_cache.h). Even in the LRU‑K “first insert not cached” path, the returned handle release will free the value.
| void ColumnChunkReader<IN_COLLECTION, OFFSET_INDEX>::_insert_page_into_cache( | ||
| const std::vector<uint8_t>& level_bytes, const Slice& payload) { | ||
| StoragePageCache::CacheKey key( | ||
| fmt::format("{}::{}", _stream_reader->path(), _stream_reader->mtime()), |
There was a problem hiding this comment.
fmt::format("{}::{}", _stream_reader->path(), _stream_reader->mtime()
this part is same for every page, better cache it to reuse:
class ParquetPageCacheKeyBuilder {
std::string _file_key_prefix; // Cached once per column chunk
public:
void init(const std::string& path, int64_t mtime) {
_file_key_prefix = fmt::format("{}::{}", path, mtime);
}
StoragePageCache::CacheKey make_key(uint64_t end_offset, uint64_t offset) const {
return StoragePageCache::CacheKey(_file_key_prefix, end_offset, offset);
}
};
| _page_statistics.page_cache_hit_counter += 1; | ||
| // Detect whether the cached payload is compressed or decompressed and record | ||
| bool is_cache_payload_decompressed = true; | ||
| if (_cur_page_header.compressed_page_size > 0) { |
There was a problem hiding this comment.
Better extract this check logic:
bool should_cache_decompressed(const tparquet::PageHeader* header,
const tparquet::ColumnMetaData& metadata) {
if (header->compressed_page_size <= 0) return true;
if (metadata.codec == tparquet::CompressionCodec::UNCOMPRESSED) return true;
double ratio = static_cast<double>(header->uncompressed_page_size) /
static_cast<double>(header->compressed_page_size);
return ratio <= config::parquet_page_cache_decompress_threshold;
}
And reuse it for both here and in be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp
7cb3056 to
c18b9a0
Compare
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
FE UT Coverage ReportIncrement line coverage |
TPC-H: Total hot run time: 32151 ms |
TPC-DS: Total hot run time: 173930 ms |
ClickBench: Total hot run time: 26.92 s |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
What problem does this PR solve?
Problem Summary:
Release note
[Feature] Implementation of Parquet File Page Cache and Integration with Unified Page Cache Framework
Solution Overview
This PR implements a page-level caching mechanism for Parquet files and integrates it with Apache Doris's existing unified page cache framework, significantly improving query performance by caching decompressed (or compressed) data pages in memory.
Key Features
• Leverages Existing Framework: Directly integrates with Doris's StoragePageCache infrastructure used for internal tables
• Shared Resource Management: Parquet cache shares memory pool and eviction policies with internal table caches
• Consistent Monitoring: Reuses existing cache statistics and RuntimeProfile for unified performance monitoring
• Cache Type Identification: Uses segment_v2::DATA_PAGE as cache page type, consistent with internal table data page caching
• Compression Ratio Awareness: Automatically chooses between caching compressed or decompressed data based on parquet_page_cache_decompress_threshold (default: 1.5)
• Flexible Storage: Caches decompressed data when uncompressed_size/compressed_size ≤ threshold, otherwise caches compressed data if enable_parquet_cache_compressed_pages=true
• Cache Key Design: Uses file_path::mtime::offset as key to ensure cache consistency across file modifications
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)