Create dataset without additional copies of data (2.6 - Data sampling with a sliding window) #745
Replies: 1 comment 1 reply
-
Hi @labdmitriy , thanks for suggesting! I should say that I prioritized code readability (and simplicity in terms of PyTorch commands) over pure efficiency as there are many PyTorch newcomers reading the book. I think the data loader is also lightweight enough (and runs in multiple process) so that it wouldn't result in an observable speed-up during training That being said, I actually really like your observation. It looks very clean and readable! If there's a new edition in 2-3 years, I make sure to revisit that! Thanks! |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi @rasbt,
The current implementation of
GPTDatasetV1
usesappend
method of the Python list after converting chunk from list to tensor.Because torch.tensor() always copies data, then, as I understand, we will use additional RAM/VRAM for
input_ids
andtarget_ids
lists construction:How do you think, maybe we can use
unfold
method of PyTorch tensor to always get the view of the original tensor to get all required data, or this implementation has any disadvantages:Thank you.
Beta Was this translation helpful? Give feedback.
All reactions