Skip to content

Conversation

@rjzamora
Copy link
Member

Addresses item 2 in #615

General Approach: When the number of files is smaller than the number of ranks, we estimate the number of chunks we need to produce for each file, and use this estimate to determine the (row-group-aligned) boundary between ranks.

This PR does not attempt to optimize metadata collection or to align chunks with row-group boundaries. I'd like that to separate that work for now.

@rjzamora rjzamora self-assigned this Dec 11, 2025
@rjzamora rjzamora added feature request New feature or request non-breaking Introduces a non-breaking change labels Dec 11, 2025
@copy-pr-bot
Copy link

copy-pr-bot bot commented Dec 11, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@rjzamora
Copy link
Member Author

/ok to test

@rjzamora
Copy link
Member Author

/ok to test

@rjzamora rjzamora marked this pull request as ready for review December 15, 2025 17:46
@rjzamora rjzamora requested review from a team as code owners December 15, 2025 17:46
std::back_inserter(chunk_files)
bool rank_has_assigned_work = false; // Track if this rank was assigned work

if (files.size() >= size) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be very helpful to move the two cases into separate helper functions, with docstrings describing the two different approaches.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I definitely think that makes sense. I refactored the chunk-assignment logic into two distinct helper functions: assign_chunks_standard and assign_chunks_split_files.

@wence- - Let me know if this refactor is disruptive to anything you may have in flight.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature request New feature or request non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants