Skip to content

Add <|endoftext|> to end of documents#98

Open
J38 wants to merge 13 commits intomainfrom
add-eos-token-2
Open

Add <|endoftext|> to end of documents#98
J38 wants to merge 13 commits intomainfrom
add-eos-token-2

Conversation

@J38
Copy link
Contributor

@J38 J38 commented Sep 25, 2021

I modified the grouping function to add the eos_token_id from the given tokenizer to the end of each non-empty document, and to appropriately update the "attention_masks" field of each document with an extra 1.

There is also a data_preprocessing test that checks that the first two batches of 1000 docs from wikitext2 train processed by group() match expected results, and some other basic tests of correctness.

As I was saying in the Propulsion meeting, I am not sure this is the best approach to this, but I am submitting this pull request in compliance with the requested implementation. I think there is merit to actually just attaching the tokenizer's eos_token to the end of the strings before the grouping and tokenization occur (such as when the detokenization step happens).

For instance, right now I am just adding in 1 to the "attention_masks" for each document. It seems like it would be better if how that is set is handled properly by the tokenizer rather than manually and this could lead to issues down the road. If you just modify the string, then the same downstream processes will be applied and whatever is supposed to go in "attention_masks" would be set by the tokenizer rather than our code.

Then again maybe this is no big deal and we always want that to be all 1's or the logic to update properly won't be a big deal and can go in group().

@J38 J38 requested a review from siddk October 12, 2021 09:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant