feat: Enable Packing for pretokenised dataset #468

dushyantbehl · 2025-02-13T12:18:36Z

Description of the change

This PR adds support for packing for pretokenized datasets which was added in transformers>=4.46

Adds DataCollatorForSeqtoSeq for usage with pretokenized dataset when packing is enabled with padding=False

Removes skip_prepare_dataset as its needed for enabling packing on a tokenized dataset.

Related issue number

How to verify the PR

Was the PR tested

I have added >=1 unit test(s) for every new method I have added.
I have ensured all unit tests pass

github-actions · 2025-02-13T12:18:47Z

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

kmehant · 2025-02-13T18:34:22Z

@dushyantbehl

max_seq_len is too long in your testcase that it was not able to generate not even 1 sample so it was failing
you should pass seq2seq collator with padding=False when it is pretokenized but it seem to pick up completiononly collator.

Remove skip_prepare_dataset flag as we want trainer to pack the dataset if tokenized which is done via prepare_dataset Signed-off-by: Dushyant Behl <[email protected]>

kmehant · 2025-03-20T05:39:14Z

tuning/data/setup_dataprocessor.py

-    dataset_kwargs = {}
-    if is_tokenized_dataset:
-        dataset_kwargs["skip_prepare_dataset"] = True
-


Any reason we dont want to skip anymore?

yes...added explanation here #468 (comment)
We need to remove this check to call prepare_dataset and and enable packing for the pretokenized dataset.

Just checked, looks like it checks for tokenized with trl and skips prepare.

This line is what we need - https://github.com/huggingface/trl/blob/a34987956cd5bf08ed7501da2510b9404bede695/trl/trainer/sft_trainer.py#L467

kmehant

LGTM

willmj

LGTM

Abhishek-TAMU

LGTM

dushyantbehl requested review from Ssukriti, aluu317, anhuong, fabianlim and kmehant as code owners February 13, 2025 12:18

github-actions bot added the feat label Feb 13, 2025

dushyantbehl force-pushed the packing-for-pretokenised branch from d6bcee1 to 4c3f1c3 Compare February 13, 2025 12:19

dushyantbehl mentioned this pull request Feb 13, 2025

chore(deps): upgrade trl and transformers #448

Merged

2 tasks

dushyantbehl force-pushed the packing-for-pretokenised branch from 4c3f1c3 to d0e0e20 Compare February 13, 2025 12:45

willmj mentioned this pull request Feb 20, 2025

feat: Enable streaming in data preprocessor #437

Merged

5 tasks

dushyantbehl force-pushed the packing-for-pretokenised branch 5 times, most recently from 6d4e771 to 73b81c1 Compare March 3, 2025 11:27

Add seq2seq collator for packing pretokenized data

5030dba

Remove skip_prepare_dataset flag as we want trainer to pack the dataset if tokenized which is done via prepare_dataset Signed-off-by: Dushyant Behl <[email protected]>

dushyantbehl force-pushed the packing-for-pretokenised branch from 73b81c1 to 5030dba Compare March 20, 2025 04:31

dushyantbehl changed the title ~~feat: Packing for pretokenised~~ feat: Enable Packing for pretokenised dataset Mar 20, 2025

kmehant reviewed Mar 20, 2025

View reviewed changes

kmehant approved these changes Mar 20, 2025

View reviewed changes

Merge branch 'main' into packing-for-pretokenised

34f2044

willmj approved these changes Mar 20, 2025

View reviewed changes

Abhishek-TAMU approved these changes Mar 20, 2025

View reviewed changes

willmj merged commit 1c9f773 into foundation-model-stack:main Mar 20, 2025
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Enable Packing for pretokenised dataset #468

feat: Enable Packing for pretokenised dataset #468

Uh oh!

dushyantbehl commented Feb 13, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Feb 13, 2025

Uh oh!

kmehant commented Feb 13, 2025 •

edited

Loading

Uh oh!

kmehant Mar 20, 2025

Uh oh!

dushyantbehl Mar 20, 2025

Uh oh!

kmehant Mar 20, 2025

Uh oh!

dushyantbehl Mar 20, 2025

Uh oh!

kmehant left a comment

Uh oh!

willmj left a comment

Uh oh!

Abhishek-TAMU left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat: Enable Packing for pretokenised dataset #468

feat: Enable Packing for pretokenised dataset #468

Uh oh!

Conversation

dushyantbehl commented Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of the change

Related issue number

How to verify the PR

Was the PR tested

Uh oh!

github-actions bot commented Feb 13, 2025

Uh oh!

kmehant commented Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kmehant Mar 20, 2025

Choose a reason for hiding this comment

Uh oh!

dushyantbehl Mar 20, 2025

Choose a reason for hiding this comment

Uh oh!

kmehant Mar 20, 2025

Choose a reason for hiding this comment

Uh oh!

dushyantbehl Mar 20, 2025

Choose a reason for hiding this comment

Uh oh!

kmehant left a comment

Choose a reason for hiding this comment

Uh oh!

willmj left a comment

Choose a reason for hiding this comment

Uh oh!

Abhishek-TAMU left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dushyantbehl commented Feb 13, 2025 •

edited

Loading

kmehant commented Feb 13, 2025 •

edited

Loading