[FEA] Parquet metadata caching due to overhead in reader

This is part of my ongoing series of studies on the Parquet reader #18892. Through my analysis, I’ve identified a key observation when reading Parquet files: notable idle time gaps before I/O operations. A runtime breakdown reveals the root cause: metadata parsing overhead (specifically, what I refer to as "metadata" and what may be called Thrift encoding).

# Metadata as overhead

The key insight as the reason is: The cuDF Parquet reader currently (re-)materializes `aggregate_reader_metadata` from the file metadata on every reader object construction. **Consider a case where the same file is repeatedly read, for different columns and different rowgroups.**

For example, when reading a small column but from a large Parquet file, metadata reading alone takes 8 seconds (inside the NVTX scope `metadata`) - while the actual kernel execution time is negligible (~10 milliseconds in total). Additionally, no GPU kernels run during metadata reading, leaving the GPU idle for *ages* as it waits for metadata processing. 

![Image](https://github.com/user-attachments/assets/98707699-a07d-4ee1-bf91-58f5ec1bf467)
This code is calling multiple `read_parquet` from upstream libcudf. It is in steady-state performance, excluding the initial warm-up run.

<details>

<summary>Check the Parquet file (TPC-H SF100 Lineitem Table) and the column I read:</summary>

```
Column Size Analysis:
    column_name  PAGE_total_compressed_size_GB  PAGE_total_uncompressed_size_GB  Compress_Ratio  AVG_RowGroup_size_MB     datatype  AVG_RowGroup_Rows           column_encodingS
     l_orderkey                       0.165914                         0.360433        2.172410              0.002720      [INT64]            9836687 [RLE, DELTA_BINARY_PACKED]
...other columns...

  created_by: parquet-rs version 54.1.0
  num_columns: 16
  num_rows: 600037902
  num_row_groups: 61
  format_version: 2.6
  serialized_size: 116904
Size of thrift-encoded metadata metadata in KiB:  114.1640625
```
I also have SF300 Lineitem and the metadata reading time is ~30s. TLDR: it does not scale at all.

</details>

# Metadata caching

The natural fix is caching, as the fix to most of the CS problems: exposing a way to reuse pre-materialized `aggregate_reader_metadata` (but it is now an internal class in `cpp/src/io/parquet/reader_impl_helpers.hpp`) for repeated reads of the same file. I’ll draft a PR with nsys profiles to illustrate this, noting that API design discussions may be needed. I could be added as an assignee of this issue but I need help from cuIO side.

I also found this recent issue https://github.com/rapidsai/cudf/issues/17716 mentioned:
> Update the cuDF chunked parquet reader to accept aggregate_reader_metadata as a reader option to skip parsing the file footer (metadata)

**This is exactly my idea** in my PR Draft, but I did for the normal reader instead of the chunked reader.


## No matadata caching in libcudf

<details>

<summary>There is my summary to all cudf parquet reader:</summary>

- `cudf::io::read_parquet`: No metadata caching as discussed above.
- `cudf::io::read_parquet` with vector input: No metadata sharing even if the same file is duplicated in the vector
- `cudf::io::chunked_parquet_reader`: Read metadata once during the reader lifetime, but doesn’t allow reading multiple times or having different options during its lifetime.
- Hybrid Reader: The "stateless reader" and [parquet_metadata() function](https://github.com/mhaseeb123/cudf/blob/fea/hybrid-scan-footer/cpp/include/cudf/io/experimental/hybrid_scan.hpp#L286-L294) suggest metadata caching is supported. But it is still under development.

</details>

# Why metadata caching

There are more reasons to cache metadata:
- Eliminates Processing Overhead: As shown in the Figure, decoding Parquet metadata is computationally intensive, particularly for [wide tables with numerous columns](https://www.influxdata.com/blog/how-good-parquet-wide-tables/). Wide tables are prevalent in ML workloads, which is the primary sweet spot for GPU acceleration.
- Reduces Network Latency for Remote Parquet: Caching eliminates these redundant requests, saving both time and cloud egress fees.

I’ll detail additional benefits and potential use cases in my Draft PR and linked story issue.

<details>

<summary>Reference: Metadata caching in other parquet readers & LanceV2</summary>

In fact, I've found other parquet readers discussing this metadata caching feature:
- Rust Arrow:
  - datafusion: https://blog.xiangpeng.systems/posts/caching-datafusion/#parquet-metadata-cache & https://github.com/apache/datafusion/issues/15582
  - influxdb: https://github.com/influxdata/influxdb/issues/24903
- Presto: https://github.com/prestodb/presto/issues/13921
- Polars did that already to a certain extent but not exposed to end-users.
- C++ arrow and PyArrow: https://github.com/apache/arrow/commit/f8a0902cbbfea46396e2fc0cbd2a88f6f3b04018#diff-014cef7b685622fd8b973ce818320352c32f0923f0313c13123c788bf3e68fda https://arrow.apache.org/docs/cpp/api/formats.html#classparquet_1_1arrow_1_1_file_reader_builder
- (Duckdb's plan: https://github.com/duckdb/duckdb/issues/1622)

- Also in [LanceV2](https://arxiv.org/abs/2504.15247) :
> The I/O cost can be ignored because the metadata is loaded into memory once and then cached for many searches against the data. The cost of loading the data is amortized among these searches and becomes negligible. In our evaluation, we only consider warm searches where the metadata is already in memory. This more accurately models the real-world search use cases we have encountered.

These existing works further underscore the critical value of implementing metadata caching. While there are [ongoing efforts to optimize metadata reading](https://www.influxdata.com/blog/how-good-parquet-wide-tables/), these are orthogonal to the goal of metadata caching.

</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEA] Parquet metadata caching due to overhead in reader #18890

Metadata as overhead

Metadata caching

No matadata caching in libcudf

Why metadata caching

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEA] Parquet metadata caching due to overhead in reader #18890

Description

Metadata as overhead

Metadata caching

No matadata caching in libcudf

Why metadata caching

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions