Skip to content

Conversation

NeoLegends
Copy link
Member

... but rely on _distrib_info instead.

Fixes #1678

@Judyxujj feel free to give this PR a test run.

@NeoLegends NeoLegends self-assigned this Feb 4, 2025
@NeoLegends NeoLegends changed the title DFDataset: do not pickle sharding info DFDataset: do not pickle sharding info Feb 4, 2025
@NeoLegends NeoLegends requested a review from Judyxujj February 4, 2025 13:11
@albertz
Copy link
Member

albertz commented Feb 4, 2025

I don't understand. We actually want that this is pickled/serialized. It doesn't make sense to not serialize this? We want/need this information.

@albertz
Copy link
Member

albertz commented Feb 4, 2025

How does this work together with #1676?

@NeoLegends
Copy link
Member Author

NeoLegends commented Feb 4, 2025

I don't understand. We actually want that this is pickled/serialized. It doesn't make sense to not serialize this? We want/need this information.

The dataset already handles the pickling of this information on its own via _distrib_info and then makes sure the user does not set the properties by hand. Alternatively we can remove that property and the assertion. I went the other way now because it seems like a good-enough fix that keeps the existing assertion intact until #1676 lands.

How does this work together with #1676?

#1676 removes the assertion, so the change here won't be necessary anymore.

@albertz
Copy link
Member

albertz commented Feb 4, 2025

So instead of removing/cleaning up the obsolete/duplicate code/logic with _distrib_info, you add another workaround? I think cleaning up makes much more sense than adding yet another workaround?

@NeoLegends
Copy link
Member Author

If it unblocks @Judyxujj quickly then I see value in that, it's not like this PR has a huge number of lines :D

@NeoLegends
Copy link
Member Author

Superseeded by #1676

@NeoLegends NeoLegends closed this Mar 5, 2025
@NeoLegends NeoLegends deleted the moritz-dfd-sharding-pickle branch March 5, 2025 10:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DistributeFilesDataset _num_shards issue

2 participants