-
Notifications
You must be signed in to change notification settings - Fork 38
Dataset interface #377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Dataset interface #377
Conversation
|
Great to carve this out! I'm wondering how to stage these changes. Switching to the new binary format is a breaking change, and we'd need to reprocess all currently used training data. This needs to be properly timed and announced. Do we have backwards compatibility for already processed data? |
I'm hoping to have the new format ready this week and make a big announcement. I'm not currently planning backward compatibility (time issue), but could if that's a necessity. |
|
Can we convert existing data to the new format? I could work on a simple converter tool. Old binary in, new binary out |
I don't really think it's worth it, might as well just redo the preparation. To help with the transition I'm noticing that the intermediate memmap dataset I'm making in #378 will essentially support the old binary format with the updated code, so an I could just keep it for a while as a backward compatibility backup. (Except for vision datasets of course.) |
tscholak
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
✨ Description
Part of the data rework:
SampleandBatchconstructs (placeholders for now)GPTData, since it's rarely used and not by the data itself. Instead, use separate tokenizers where needed (Fim, Preparator (already present), lm eval).SampleandBatchboth use torch tensors.)Note: since this is part of a bigger set of changes, it does contain changes that don't immediately make sense but will be useful layer, as well as messy temporary solutions. (See #376 for more info on where this is going).