Virtual row groups of size 1000 #243

platypii · 2025-05-14T05:36:09Z

This PR changes the parquetDataFrame function. This function would fetch and decode by row group. This made sense since we basically need to download the whole rowgroup anyway.

The problem is that for files with very large row groups, even after downloading, decoding all that data can lag the browser and eat up memory. See hyparam/demos#22 (comment)

This PR changes parquetDataFrame to use virtual row groups which are max size 1000. Also splits virtual row groups at the natural row group boundaries to avoid fetching two natual row groups at once.

Fixes hyparam/demos#22

Previously the parquet file referenced in that issue would take about 15 seconds to load and use 6.5gb of memory on chrome. After the change it loads in 1-2 sec with 545mb memory! 🎉

severo

Awesome improvement!

It made me wonder: as a column row group is divided into pages, and as the metadata give us the page ranges, would it make sense to use them instead of arbitrary 1000 rows long virtual groups?
I'm not sure if all the columns of the same row group have the same number of pages and if they are aligned. If not, this idea might not make sense, indeed. But better asking just in case.

src/lib/tableProvider.ts

platypii · 2025-05-14T16:15:36Z

metadata give us the page ranges

Sadly, the parquet metadata does NOT include page boundaries. (*)

(*): Some parquet files include a ColumnOffsetIndex which DOES give page boundaries. But its not in the metadata so you need to make an additional request for it. I am considering in the future adding a useColumnIndexes option which tells hyparquet whether to make that additional request or not.

Virtual row groups of size 1000

fd083e9

platypii requested a review from severo May 14, 2025 05:36

severo approved these changes May 14, 2025

View reviewed changes

src/lib/tableProvider.ts Show resolved Hide resolved

src/lib/tableProvider.ts Outdated Show resolved Hide resolved

PR feedback

e66984b

platypii merged commit 9f806c5 into master May 14, 2025
4 checks passed

platypii deleted the virtual-row-groups branch May 14, 2025 19:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Virtual row groups of size 1000 #243

Virtual row groups of size 1000 #243

Uh oh!

platypii commented May 14, 2025

Uh oh!

severo left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

platypii commented May 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Virtual row groups of size 1000 #243

Virtual row groups of size 1000 #243

Uh oh!

Conversation

platypii commented May 14, 2025

Uh oh!

severo left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

platypii commented May 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

severo left a comment •

edited

Loading