-
Hi all, However, for the SMILES string part, I would like to use NLP technique to learn about it, so when I batch data together I would like to do some sequence_padding for different lengths of 'sentences'. Say [[1,2,3], [4,5,6,7]] (indices of 'word' in vocabulary in a sentence), when I merge these two data points as a batch, it will pad all the list to the maximum length of the lists, becomes a matrix [[1,2,3,0],[4,5,6,7]], where 0 is the value used for padding. For this, i think the PyG DataLoader cannot achieve. If I were to do the batching using the normal version of DataLoader with a self-customized collate_fn, I do not know how to replicate the work done by PyG dataloader, for the graph representation part. So I think it will be convenient if I can add some additional functionality on top of the current PyG dataloader to handle the sequence_padding part with all other batching for graphs remaining the same, rather than writing all the things from scratch. Can anyone help wit this issue? Thank you!!!! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Thanks for the issue. I think you have two options here:
I think this raises a valid point on how far we allow such things on top of the current PyG |
Beta Was this translation helpful? Give feedback.
Thanks for the issue. I think you have two options here:
Dataloader
has been done. For sentences, the PyGDataLoader
should simply return a Python list of elements of varying size, on which you can applysequence_padding
afterwards.collate_fn
of thetorch.utils.data.DataLoader
(similar to what we are doing in PyG as well, see here). In general, you can collate a list ofdata
objects together by simply runningBatch.from_data_list
. In thecollate_fn
, you can then do the work ofsequence_padding
simultaneously.I think this raises a valid point on how far we allow such things on top of the current PyG
DataLoader
. Currently, …