How to customize collate_fn to handle graph data and text data simultaneously. #4064

choo0518 · 2022-02-11T12:41:07Z

choo0518
Feb 11, 2022

Hi all,
my data consists of a graph representation of molecules(node features, edge features, edge_index) and the SMILES strings of the molecules. For the graph representation I would like to use GNN models to learn about it, so the DataLoader from PyG is fine for this part.

However, for the SMILES string part, I would like to use NLP technique to learn about it, so when I batch data together I would like to do some sequence_padding for different lengths of 'sentences'. Say [[1,2,3], [4,5,6,7]] (indices of 'word' in vocabulary in a sentence), when I merge these two data points as a batch, it will pad all the list to the maximum length of the lists, becomes a matrix [[1,2,3,0],[4,5,6,7]], where 0 is the value used for padding. For this, i think the PyG DataLoader cannot achieve.

If I were to do the batching using the normal version of DataLoader with a self-customized collate_fn, I do not know how to replicate the work done by PyG dataloader, for the graph representation part.

So I think it will be convenient if I can add some additional functionality on top of the current PyG dataloader to handle the sequence_padding part with all other batching for graphs remaining the same, rather than writing all the things from scratch.

Can anyone help wit this issue? Thank you!!!!

Answered by rusty1s

Feb 14, 2022

Thanks for the issue. I think you have two options here:

Do the batching of sentences after the work of PyG Dataloader has been done. For sentences, the PyG DataLoader should simply return a Python list of elements of varying size, on which you can apply sequence_padding afterwards.
Override the collate_fn of the torch.utils.data.DataLoader (similar to what we are doing in PyG as well, see here). In general, you can collate a list of data objects together by simply running Batch.from_data_list. In the collate_fn, you can then do the work of sequence_padding simultaneously.

I think this raises a valid point on how far we allow such things on top of the current PyG DataLoader. Currently, …

View full answer

rusty1s · 2022-02-14T08:43:21Z

rusty1s
Feb 14, 2022
Maintainer

Thanks for the issue. I think you have two options here:

Do the batching of sentences after the work of PyG Dataloader has been done. For sentences, the PyG DataLoader should simply return a Python list of elements of varying size, on which you can apply sequence_padding afterwards.
Override the collate_fn of the torch.utils.data.DataLoader (similar to what we are doing in PyG as well, see here). In general, you can collate a list of data objects together by simply running Batch.from_data_list. In the collate_fn, you can then do the work of sequence_padding simultaneously.

I think this raises a valid point on how far we allow such things on top of the current PyG DataLoader. Currently, I am not sure how such an interface shall look like, but I'm definitely interested in any suggestions you might have.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to customize collate_fn to handle graph data and text data simultaneously. #4064

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How to customize collate_fn to handle graph data and text data simultaneously. #4064

Uh oh!

choo0518 Feb 11, 2022

Replies: 1 comment

Uh oh!

rusty1s Feb 14, 2022 Maintainer

choo0518
Feb 11, 2022

rusty1s
Feb 14, 2022
Maintainer