-
Notifications
You must be signed in to change notification settings - Fork 65
feat: Enable Packing for pretokenised dataset #468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Enable Packing for pretokenised dataset #468
Conversation
|
Thanks for making a pull request! 😃 |
d6bcee1 to
4c3f1c3
Compare
4c3f1c3 to
d0e0e20
Compare
|
6d4e771 to
73b81c1
Compare
Remove skip_prepare_dataset flag as we want trainer to pack the dataset if tokenized which is done via prepare_dataset Signed-off-by: Dushyant Behl <[email protected]>
73b81c1 to
5030dba
Compare
| dataset_kwargs = {} | ||
| if is_tokenized_dataset: | ||
| dataset_kwargs["skip_prepare_dataset"] = True | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason we dont want to skip anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes...added explanation here #468 (comment)
We need to remove this check to call prepare_dataset and and enable packing for the pretokenized dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just checked, looks like it checks for tokenized with trl and skips prepare.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is what we need - https://github.com/huggingface/trl/blob/a34987956cd5bf08ed7501da2510b9404bede695/trl/trainer/sft_trainer.py#L467
kmehant
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
willmj
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Abhishek-TAMU
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Description of the change
This PR adds support for packing for pretokenized datasets which was added in
transformers>=4.46Adds
DataCollatorForSeqtoSeqfor usage with pretokenized dataset when packing is enabled withpadding=FalseRemoves
skip_prepare_datasetas its needed for enabling packing on a tokenized dataset.Related issue number
How to verify the PR
Was the PR tested