Skip to content

Conversation

@dushyantbehl
Copy link
Collaborator

@dushyantbehl dushyantbehl commented Feb 13, 2025

Description of the change

This PR adds support for packing for pretokenized datasets which was added in transformers>=4.46

Adds DataCollatorForSeqtoSeq for usage with pretokenized dataset when packing is enabled with padding=False

Removes skip_prepare_dataset as its needed for enabling packing on a tokenized dataset.

Related issue number

How to verify the PR

Was the PR tested

  • I have added >=1 unit test(s) for every new method I have added.
  • I have ensured all unit tests pass

@github-actions
Copy link

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

@kmehant
Copy link
Collaborator

kmehant commented Feb 13, 2025

@dushyantbehl

  1. max_seq_len is too long in your testcase that it was not able to generate not even 1 sample so it was failing
  2. you should pass seq2seq collator with padding=False when it is pretokenized but it seem to pick up completiononly collator.

@dushyantbehl dushyantbehl force-pushed the packing-for-pretokenised branch 5 times, most recently from 6d4e771 to 73b81c1 Compare March 3, 2025 11:27
Remove skip_prepare_dataset flag as we want trainer to
pack the dataset if tokenized which is done via prepare_dataset

Signed-off-by: Dushyant Behl <[email protected]>
@dushyantbehl dushyantbehl force-pushed the packing-for-pretokenised branch from 73b81c1 to 5030dba Compare March 20, 2025 04:31
@dushyantbehl dushyantbehl changed the title feat: Packing for pretokenised feat: Enable Packing for pretokenised dataset Mar 20, 2025
Comment on lines -399 to -402
dataset_kwargs = {}
if is_tokenized_dataset:
dataset_kwargs["skip_prepare_dataset"] = True

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason we dont want to skip anymore?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes...added explanation here #468 (comment)
We need to remove this check to call prepare_dataset and and enable packing for the pretokenized dataset.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checked, looks like it checks for tokenized with trl and skips prepare.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

@kmehant kmehant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@willmj willmj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@Abhishek-TAMU Abhishek-TAMU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@willmj willmj merged commit 1c9f773 into foundation-model-stack:main Mar 20, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants