Skip to content

Conversation

@rjzamora
Copy link
Member

@rjzamora rjzamora commented Dec 10, 2025

Adds a new streaming node for reading Parquet data with uniform chunk distribution. Here, "uniform" does not mean that all chunks will be a uniform size. Rather, it means that every chunk will correspond to the same file count of file fraction. This approach assumes the files and row-groups in the dataset have a relatively uniform distribution.

Motivation:

  • This partitioning approach is similar to the approach already used in cudf-polars. Therefore, it is easy to integrate this API with cudf-polars.
  • This approach produces file- or row-group-aligned reads in most cases.
    • The only exception is when a large file is partitioned into a larger number of chunks than the number of row-groups (which is rare).
  • This approach supports distributed chunking across a single large file. The read_parquet node doesn't support this yet (though, it certainly can in the future. See: Support large-file splitting between ranks in read_parquet #736).

Other considerations:

  • The main read_parquet node may be safer for datasets with non-uniform file and/or row-group sizes.
  • The estimate_target_num_chunks utility cannot account for the effects of filters. Therefore, the size of each chunk may be significantly smaller than the corresponding num_rows_per_chunk argument when a filter is applied at IO time (even if the dataset is "uniform"). This may also be the case for row-count-based chunks (?)
  • The effect of aligning reads with row-group/file boundaries was almost negligible in my local testing (<5%). Therefore, using read_parquet_uniform is unlikely to provide a measurable performance improvement (more testing is needed to say for sure).

@rjzamora rjzamora self-assigned this Dec 10, 2025
@rjzamora rjzamora added the improvement Improves an existing functionality label Dec 10, 2025
@copy-pr-bot
Copy link

copy-pr-bot bot commented Dec 10, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@rjzamora rjzamora added the non-breaking Introduces a non-breaking change label Dec 10, 2025
@rjzamora
Copy link
Member Author

/ok to test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improves an existing functionality non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant