[Rows] sub-rowgroup loading using arrow-rs + libviewer #3213

lhoestq · 2025-07-07T16:42:25Z

continuation of #3199

Pretty important PR since sub-rowgroup loading lets us:

load Pandas datasets that are >300MB, or from parquet files file PyArrow/Spark if row groups are >300MB
increase the row group size in datasets for better Xet deduplication (done it feat: use content defined chunking datasets#7589)

TODO:

write metadata with page index when possible
load rows using page index when possible

It works using page pruning from arrow-rs

severo · 2025-08-13T19:13:44Z

I saw that it's not currently implemented in PyArrow's reader, which is why we rely on arrow-rs, right?

Do you have an idea if the feature could be added to PyArrow @kszucs?

lhoestq · 2025-08-22T14:15:11Z

TODO: fix the Parquet argument error: Parquet error: Invalid offset in sparse column chunk data: 2475610 error that happens when libviewer reads a parquet file with nested data tries to skip pages

kszucs · 2025-09-02T13:51:25Z

The issue is caused by wrong indexing for nested types.

The offset index must be aware of the record offsets which doesn't correspond to the number of values due to record shredding. Sadly the metadata doesn't contain information of the number of records in each page. Additionally records may span over multiple pages depending on the writer, like pyarrow doesn't require page boundaries to be at record boundaries by default (only if offset index is enabled or page version 2 is being used).

In order to have an indexer which keeps the original parquet file unchanged we can do the following:

for non-nested types use the existing solution and build up the offset index using the number of values in a page
for nested types where records DO NOT span over multiple pages we can read and decode only the repetition levels and count where rep_level == 0 indicating a record boundary
for nested types where records DO span over multiple pages we need to rewrite the column so that each page will be at record boundary

I was able to update the indexer's code to do 1) and 2) but it required to expose some internal low-level constructs from parquet-rs and for case 3) we would need to rewrite the parquet file nonetheless. Since pyarrow's default to allow case 3) I suggest to don't bother with a faster indexer for now and rather use pyarrow to rewrite parquet files with write_page_index=True. Meanwhile we could contribute the implementation 1) and 2) to parquet-rs with a fallback to rewrite the parquet file in case 3). Once this feature would be available upstream we could directly use it.

Meanwhile we can still profit from the page pruning reader when parquet files do have offset indexes, otherwise use row-group pruning just like the viewer currently does. Also advise users and update datasets to write parquet files with page index enabled.

Ideally we should avoid the default behavior of allowing records spanning over multiple pages in pyarrow, but that requires a conversation with the community.

lhoestq · 2025-11-03T14:50:41Z

closing in favor of #3244

kszucs and others added 3 commits June 10, 2025 12:02

feat: primitive parquet reader with page pruning

2dcbb2b

add poetry build for libviewer

7dbd7d8

add libviewer to rows

f6e88f0

lhoestq mentioned this pull request Jul 7, 2025

feat: primitive parquet reader with page pruning #3199

Closed

lhoestq changed the title ~~[Rows] sub-rowgroup loading using libviewer~~ [Rows] sub-rowgroup loading using arrow-rs + libviewer Aug 14, 2025

wip

8735ca4

lhoestq closed this Nov 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Rows] sub-rowgroup loading using arrow-rs + libviewer #3213

[Rows] sub-rowgroup loading using arrow-rs + libviewer #3213

Uh oh!

lhoestq commented Jul 7, 2025 •

edited

Loading

Uh oh!

severo commented Aug 13, 2025

Uh oh!

lhoestq commented Aug 22, 2025

Uh oh!

kszucs commented Sep 2, 2025

Uh oh!

lhoestq commented Nov 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[Rows] sub-rowgroup loading using arrow-rs + libviewer #3213

[Rows] sub-rowgroup loading using arrow-rs + libviewer #3213

Uh oh!

Conversation

lhoestq commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

severo commented Aug 13, 2025

Uh oh!

lhoestq commented Aug 22, 2025

Uh oh!

kszucs commented Sep 2, 2025

Uh oh!

lhoestq commented Nov 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lhoestq commented Jul 7, 2025 •

edited

Loading