About the WikiTextDataset in run_trainer.py #6
xszheng2020
started this conversation in
General
Replies: 1 comment
-
You can play around as you please. its meant to be a understandable starting point for large-scale training so lots of "details" are stripped down. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi, @cloneofsimo
Thanks for your great work.
It seems WikiTextDataset unconventionally processes the data, it is not concatenating all texts from the dataset and generating chunks of block_size.
Should we prepare the dataset in a way similar to the following codes?
https://github.com/huggingface/transformers/blob/v4.23.1/examples/pytorch/language-modeling/run_clm_no_trainer.py#L418C1-L432C22
Beta Was this translation helpful? Give feedback.
All reactions