-
Notifications
You must be signed in to change notification settings - Fork 28
Support large-file splitting between ranks in read_parquet
#736
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
/ok to test |
|
/ok to test |
cpp/src/streaming/cudf/parquet.cpp
Outdated
| std::back_inserter(chunk_files) | ||
| bool rank_has_assigned_work = false; // Track if this rank was assigned work | ||
|
|
||
| if (files.size() >= size) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be very helpful to move the two cases into separate helper functions, with docstrings describing the two different approaches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I definitely think that makes sense. I refactored the chunk-assignment logic into two distinct helper functions: assign_chunks_standard and assign_chunks_split_files.
@wence- - Let me know if this refactor is disruptive to anything you may have in flight.
Addresses item 2 in #615
General Approach: When the number of files is smaller than the number of ranks, we estimate the number of chunks we need to produce for each file, and use this estimate to determine the (row-group-aligned) boundary between ranks.
This PR does not attempt to optimize metadata collection or to align chunks with row-group boundaries. I'd like that to separate that work for now.