Skip to content

Conversation

@lhoestq
Copy link
Member

@lhoestq lhoestq commented Jul 7, 2025

continuation of #3199

Pretty important PR since sub-rowgroup loading lets us:

TODO:

  • write metadata with page index when possible
  • load rows using page index when possible

It works using page pruning from arrow-rs

@severo
Copy link
Collaborator

severo commented Aug 13, 2025

I saw that it's not currently implemented in PyArrow's reader, which is why we rely on arrow-rs, right?

Do you have an idea if the feature could be added to PyArrow @kszucs?

@lhoestq lhoestq changed the title [Rows] sub-rowgroup loading using libviewer [Rows] sub-rowgroup loading using arrow-rs + libviewer Aug 14, 2025
@lhoestq
Copy link
Member Author

lhoestq commented Aug 22, 2025

TODO: fix the Parquet argument error: Parquet error: Invalid offset in sparse column chunk data: 2475610 error that happens when libviewer reads a parquet file with nested data tries to skip pages

@kszucs
Copy link
Member

kszucs commented Sep 2, 2025

The issue is caused by wrong indexing for nested types.

The offset index must be aware of the record offsets which doesn't correspond to the number of values due to record shredding. Sadly the metadata doesn't contain information of the number of records in each page. Additionally records may span over multiple pages depending on the writer, like pyarrow doesn't require page boundaries to be at record boundaries by default (only if offset index is enabled or page version 2 is being used).

In order to have an indexer which keeps the original parquet file unchanged we can do the following:

  1. for non-nested types use the existing solution and build up the offset index using the number of values in a page
  2. for nested types where records DO NOT span over multiple pages we can read and decode only the repetition levels and count where rep_level == 0 indicating a record boundary
  3. for nested types where records DO span over multiple pages we need to rewrite the column so that each page will be at record boundary

I was able to update the indexer's code to do 1) and 2) but it required to expose some internal low-level constructs from parquet-rs and for case 3) we would need to rewrite the parquet file nonetheless. Since pyarrow's default to allow case 3) I suggest to don't bother with a faster indexer for now and rather use pyarrow to rewrite parquet files with write_page_index=True. Meanwhile we could contribute the implementation 1) and 2) to parquet-rs with a fallback to rewrite the parquet file in case 3). Once this feature would be available upstream we could directly use it.

Meanwhile we can still profit from the page pruning reader when parquet files do have offset indexes, otherwise use row-group pruning just like the viewer currently does. Also advise users and update datasets to write parquet files with page index enabled.

Ideally we should avoid the default behavior of allowing records spanning over multiple pages in pyarrow, but that requires a conversation with the community.

@lhoestq
Copy link
Member Author

lhoestq commented Nov 3, 2025

closing in favor of #3244

@lhoestq lhoestq closed this Nov 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants