Skip to content

Conversation

@ArjunJagdale
Copy link
Contributor

@ArjunJagdale ArjunJagdale commented Jul 28, 2025

(revival of #6832)

#7648 (comment)

Close #4101, and more


PR under work!!!!

@ArjunJagdale
Copy link
Contributor Author

ArjunJagdale commented Jul 28, 2025

Mario’s Patch (in PR #6832):

def _make_split_generators_kwargs(self, prepare_split_kwargs):
    # Pass `pipeline` into `_split_generators()` from `prepare_split_kwargs` if
    # it's in the call signature of `_split_generators()`.
    # This allows for global preprocessing in beam.
    split_generators_kwargs = {}
    if "pipeline" in inspect.signature(self._split_generators).parameters:
        split_generators_kwargs["pipeline"] = prepare_split_kwargs["pipeline"]
    split_generators_kwargs.update(super()._make_split_generators_kwargs(prepare_split_kwargs))
    return split_generators_kwargs

In the latest main(in my fork and og repo's main):

def _make_split_generators_kwargs(self, prepare_split_kwargs):
    """Get kwargs for `self._split_generators()` from `prepare_split_kwargs`."""
    splits = prepare_split_kwargs.pop("splits", None)
    if self._supports_partial_generation():
        return {"splits": splits}
    return {}

It enables passing splits into _split_generators() only for builders that support it(if i am not wrong..). So ignored Beam logic for now!

@lhoestq
Copy link
Member

lhoestq commented Sep 4, 2025

Awesome ! btw we can modify the GeneratorBasedBuilder and ArrowBasedBuilder if needed now that custom loading scripts are not supported anymore :)

I'll review this in a bit

@CloseChoice
Copy link
Contributor

@lhoestq @ArjunJagdale is this still work in progress or is just a review missing? Anything I can help with here? This would indeed be a cool feature

@lhoestq
Copy link
Member

lhoestq commented Oct 28, 2025

I did a preliminary pass and it looks good but we should check the CI, could you run make style @ArjunJagdale so we can run the CI ?

@ArjunJagdale
Copy link
Contributor Author

ArjunJagdale commented Oct 29, 2025

Done! Also some parts may be incomplete because I had to focus on important exams and semester activities so couldn’t finish the work fully. I will still try my best.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

How can I download only the train and test split for full numbers using load_dataset()?

3 participants