Introduce read_parquet_uniform node
#732
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Adds a new streaming node for reading Parquet data with uniform chunk distribution. Here, "uniform" does not mean that all chunks will be a uniform size. Rather, it means that every chunk will correspond to the same file count of file fraction. This approach assumes the files and row-groups in the dataset have a relatively uniform distribution.
Motivation:
read_parquetnode doesn't support this yet (though, it certainly can in the future. See: Support large-file splitting between ranks inread_parquet#736).Other considerations:
read_parquetnode may be safer for datasets with non-uniform file and/or row-group sizes.estimate_target_num_chunksutility cannot account for the effects of filters. Therefore, the size of each chunk may be significantly smaller than the correspondingnum_rows_per_chunkargument when a filter is applied at IO time (even if the dataset is "uniform"). This may also be the case for row-count-based chunks (?)read_parquet_uniformis unlikely to provide a measurable performance improvement (more testing is needed to say for sure).