About the WikiTextDataset in run_trainer.py #6

xszheng2020 · 2024-03-03T13:07:26Z

xszheng2020
Mar 3, 2024

Thanks for your great work.

It seems WikiTextDataset unconventionally processes the data, it is not concatenating all texts from the dataset and generating chunks of block_size.

Should we prepare the dataset in a way similar to the following codes?

cloneofsimo · 2024-03-06T02:52:01Z

You can play around as you please. its meant to be a understandable starting point for large-scale training so lots of "details" are stripped down.

0 replies