Skip to content

Expose Parquet Data Pages information #11

@raulcd

Description

@raulcd

It would be interesting to be able to show Data Pages and metadata information inside the ColumnChunk.

The following information could be interesting:

  • Number of pages per column chunk
  • Per-page metadata:
    • Page type (DataPage, DataPageV2, DictionaryPage)
    • Number of values
    • Encoding (PLAIN, RLE_DICTIONARY, etc.)
    • Definition/repetition level encodings
    • Compressed/uncompressed sizes
  • Page indexes (if present):
    • Column Index: per-page min/max statistics, null counts
    • Offset Index: page locations, sizes, first row numbers

The main problem is that pyarrow does not expose this information.

PyArrow is able to read the ParquetFooter metadata and get information about File/RowGroup/ColumnChunk and whether page indexes exist.
By design PyArrow focuses on data reading not format inspection and as the Page level details / page indexes content aren't part of the metadata does are not strictly exposed.

As datanomy is not expected to be used for high-performance reading we could do something with custom Python Thrift bindings in order to manually parse page headers and expose that information. This gives us access to format internals while still leveraging PyArrow for high-level operations.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions