Skip to content

Conversation

@platypii
Copy link
Contributor

This PR changes the parquetDataFrame function. This function would fetch and decode by row group. This made sense since we basically need to download the whole rowgroup anyway.

The problem is that for files with very large row groups, even after downloading, decoding all that data can lag the browser and eat up memory. See hyparam/demos#22 (comment)

This PR changes parquetDataFrame to use virtual row groups which are max size 1000. Also splits virtual row groups at the natural row group boundaries to avoid fetching two natual row groups at once.

Fixes hyparam/demos#22

Previously the parquet file referenced in that issue would take about 15 seconds to load and use 6.5gb of memory on chrome. After the change it loads in 1-2 sec with 545mb memory! 🎉

@platypii platypii requested a review from severo May 14, 2025 05:36
Copy link
Contributor

@severo severo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome improvement!


It made me wonder: as a column row group is divided into pages, and as the metadata give us the page ranges, would it make sense to use them instead of arbitrary 1000 rows long virtual groups?
I'm not sure if all the columns of the same row group have the same number of pages and if they are aligned. If not, this idea might not make sense, indeed. But better asking just in case.

@platypii
Copy link
Contributor Author

metadata give us the page ranges

Sadly, the parquet metadata does NOT include page boundaries. (*)

(*): Some parquet files include a ColumnOffsetIndex which DOES give page boundaries. But its not in the metadata so you need to make an additional request for it. I am considering in the future adding a useColumnIndexes option which tells hyparquet whether to make that additional request or not.

@platypii platypii merged commit 9f806c5 into master May 14, 2025
4 checks passed
@platypii platypii deleted the virtual-row-groups branch May 14, 2025 19:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Out of memory on a 200MB Parquet file

3 participants