Skip to content

[FEA] Parquet metadata caching due to overhead in reader #18890

@JigaoLuo

Description

@JigaoLuo

This is part of my ongoing series of studies on the Parquet reader #18892. Through my analysis, I’ve identified a key observation when reading Parquet files: notable idle time gaps before I/O operations. A runtime breakdown reveals the root cause: metadata parsing overhead (specifically, what I refer to as "metadata" and what may be called Thrift encoding).

Metadata as overhead

The key insight as the reason is: The cuDF Parquet reader currently (re-)materializes aggregate_reader_metadata from the file metadata on every reader object construction. Consider a case where the same file is repeatedly read, for different columns and different rowgroups.

For example, when reading a small column but from a large Parquet file, metadata reading alone takes 8 seconds (inside the NVTX scope metadata) - while the actual kernel execution time is negligible (~10 milliseconds in total). Additionally, no GPU kernels run during metadata reading, leaving the GPU idle for ages as it waits for metadata processing.

Image
This code is calling multiple read_parquet from upstream libcudf. It is in steady-state performance, excluding the initial warm-up run.

Check the Parquet file (TPC-H SF100 Lineitem Table) and the column I read:
Column Size Analysis:
    column_name  PAGE_total_compressed_size_GB  PAGE_total_uncompressed_size_GB  Compress_Ratio  AVG_RowGroup_size_MB     datatype  AVG_RowGroup_Rows           column_encodingS
     l_orderkey                       0.165914                         0.360433        2.172410              0.002720      [INT64]            9836687 [RLE, DELTA_BINARY_PACKED]
...other columns...

  created_by: parquet-rs version 54.1.0
  num_columns: 16
  num_rows: 600037902
  num_row_groups: 61
  format_version: 2.6
  serialized_size: 116904
Size of thrift-encoded metadata metadata in KiB:  114.1640625

I also have SF300 Lineitem and the metadata reading time is ~30s. TLDR: it does not scale at all.

Metadata caching

The natural fix is caching, as the fix to most of the CS problems: exposing a way to reuse pre-materialized aggregate_reader_metadata (but it is now an internal class in cpp/src/io/parquet/reader_impl_helpers.hpp) for repeated reads of the same file. I’ll draft a PR with nsys profiles to illustrate this, noting that API design discussions may be needed. I could be added as an assignee of this issue but I need help from cuIO side.

I also found this recent issue #17716 mentioned:

Update the cuDF chunked parquet reader to accept aggregate_reader_metadata as a reader option to skip parsing the file footer (metadata)

This is exactly my idea in my PR Draft, but I did for the normal reader instead of the chunked reader.

No matadata caching in libcudf

There is my summary to all cudf parquet reader:
  • cudf::io::read_parquet: No metadata caching as discussed above.
  • cudf::io::read_parquet with vector input: No metadata sharing even if the same file is duplicated in the vector
  • cudf::io::chunked_parquet_reader: Read metadata once during the reader lifetime, but doesn’t allow reading multiple times or having different options during its lifetime.
  • Hybrid Reader: The "stateless reader" and parquet_metadata() function suggest metadata caching is supported. But it is still under development.

Why metadata caching

There are more reasons to cache metadata:

  • Eliminates Processing Overhead: As shown in the Figure, decoding Parquet metadata is computationally intensive, particularly for wide tables with numerous columns. Wide tables are prevalent in ML workloads, which is the primary sweet spot for GPU acceleration.
  • Reduces Network Latency for Remote Parquet: Caching eliminates these redundant requests, saving both time and cloud egress fees.

I’ll detail additional benefits and potential use cases in my Draft PR and linked story issue.

Reference: Metadata caching in other parquet readers & LanceV2

In fact, I've found other parquet readers discussing this metadata caching feature:

The I/O cost can be ignored because the metadata is loaded into memory once and then cached for many searches against the data. The cost of loading the data is amortized among these searches and becomes negligible. In our evaluation, we only consider warm searches where the metadata is already in memory. This more accurately models the real-world search use cases we have encountered.

These existing works further underscore the critical value of implementing metadata caching. While there are ongoing efforts to optimize metadata reading, these are orthogonal to the goal of metadata caching.

Metadata

Metadata

Assignees

No one assigned

    Labels

    PythonAffects Python cuDF API.cuIOcuIO issuefeature requestNew feature or request

    Type

    No type

    Projects

    Status

    No status

    Status

    In Progress

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions