Which is the best way to process live streaming data? (IterableDataset, Batched Loader of graph data) #3165

suiluj · 2021-09-17T09:03:38Z

suiluj
Sep 17, 2021

Hello i have a question regarding iterable/generator datasets. I receive data on the fly. It is a constant stream of small graphs.

I am not sure how to do this "the correct" way.

Planned pipeline:

gRPC stream -> prepare graph data keys (e.g. x, edge_index, ...) -> create single graph object -> mini-batch graphs -> training loop

Idea 1: pytorch IterableDataset

My current approach is to prepare the keys for the torch_geometric.data.Data in a pytorch Iterabel Dataset:

from torch.utils.data import IterableDataset, DataLoader # standard versions not pytorch geometric loader
from itertools import islice # for testing limited numer of streamed graphs
import torch_geometric.data as tg_data

class TraceStreamDataset(IterableDataset):
    # ...
   
    def __iter__(self):
        # connects to gRPC stream
        # on receive of new message:
        # generate dict( x: ... , edge_index: ...)
        yield dict_with_graph_data_keys

Then i use the standard pytorch loader to and set batch_size=None (my goal is to create batches with the torch_geoemtric batch class by myself):

iterable_dataset = TraceStreamDataset(...)

loader = DataLoader(iterable_dataset, batch_size=None)

graph_data_list = [] # for saving received graphs (test)
for graph_dict in islice(loader, 1): # islice 1 ends iteration after one test graph (stop receiving from stream for testing results)
    graph = tg_data.Data(x=graph_dict['x'], edge_index=graph_dict['edge_index'])
    graph_data_list.append(graph) # not batched yet
    # if i want to batch graphs i would have to create batched lists here on the fly in the loop

This works so far.

But my goal is to create a simple iterable graph loader which i can use for my training loop. The standard pytorch iterable dataset/loader combination does not allow yields of torch_geometric.data.Data objects directly:

typeerror: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found <class 'torch_geometric.data.data.Data'>

Idea 2 torch_geometric.loader.Dataloader

In the FAQ part 2 ( Do I really need to use these dataset interfaces? ) i read about the torch_geometric.loader.Dataloader for synthetic data on the fly. So I also tried using the torch_geometric.loader.Dataloader for creating mini-batches. But it only allows dataset with a length (which my streaming dataset does not have of course.)

loader = DataLoader(iterable_dataset, batch_size=None)


def generate_graph_data(loader_dict):
    graph = tg_data.Data(x=graph_dict['x'], edge_index=graph_dict['edge_index'])
    yield graph


graph_loader = tg_dataloader(dataset=generate_graph_data(loader_dict=loader),batch_size=2)

batched_graph_list = []
for graph_batch in graph_loader:
    batched_graph_list.append(graph_batch)

got error:

TypeError: object of type 'generator' has no len()

Idea 3: One more idea i have:

in pytorch IterableDataset try to yield a list of torch_geometric.data.Data objects
use this IterableDataset with a pytorch standard loader
in the training loop use the method torch_geometric.data.Batch.from_data_list to create a batch of graphs
start with GNN layers...

Question

So my question is how to do it correctly? Perhaps there is another easy way i did not even see.
I am not a pytorch/pytorch geometric pro yet and i do not know if specific parts like preparing batches outside or inside the training have negative impact on performance other other disadvantages.

Thanks in advance for your help. I am really impressed by this whole ecosystem you build. :)

Answered by rusty1s

Sep 20, 2021

I think following the IterableDataset interface is the way to go, as this fits perfectly into your desired needs. I'm not sure though why you say that the IterableDataset/DataLoader approach does not work. IMO, this should work just fine given that you are utilizing the torch_geometric.loader.DataLoader rather than the torch.utils.data.DataLoader. Is this not the case? This also eliminates to manually batch data objects after loading them.

Idea 2 should work as well, but feels like you are re-implementing a lot of functionality that is already present in IterableDataset.

View full answer

rusty1s · 2021-09-20T07:21:08Z

rusty1s
Sep 20, 2021
Maintainer

I think following the IterableDataset interface is the way to go, as this fits perfectly into your desired needs. I'm not sure though why you say that the IterableDataset/DataLoader approach does not work. IMO, this should work just fine given that you are utilizing the torch_geometric.loader.DataLoader rather than the torch.utils.data.DataLoader. Is this not the case? This also eliminates to manually batch data objects after loading them.

Idea 2 should work as well, but feels like you are re-implementing a lot of functionality that is already present in IterableDataset.

1 reply

suiluj Sep 20, 2021
Author

Oh yes you are right! I made the mistake to not just yield torch_geometric.data.Data graph data in my iterable dataset because of my previous attempts with the default pytorch dataloader. Then I was a bit confused and tried to generate graph data with an additional generator and put that as argument for the pyg dataloader. 😄

Thanks for the hint! Now everything works perfectly with the combination you described:

from torch.utils.data import IterableDataset
import torch_geometric.data as tg_data
from torch_geometric.loader import DataLoader as tg_dataloader
from itertools import islice # for testing limited numer of streamed graphs

# ...

class GraphStreamDataset(IterableDataset):
    # ...
   
    def __iter__(self):
        # connects to gRPC stream
        # on receive of new message:
        # generate dict( x: ... , edge_index: ...)
        # generate tg_data.Data object
        single_graph =  tg_data.Data(x=node_features, edge_index=edge_index)
        yield single_graph

# ...

iterable_graph_dataset = GraphStreamDataset(...)

graph_loader = tg_dataloader(dataset=iterable_graph_dataset,batch_size=2)

batched_graphs_list = [] # only for testing: save generated data in list
for graph_batch in islice(graph_loader, 1): # islice 1 ends iteration after one test graph_batch (stop receiving from stream for testing results)
    batched_graph_list.append(graph_batch)

Thank you for your great graph framework Matthias!
Now I will try to create some models. :)

Greetings from Stuttgart
Julius

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Which is the best way to process live streaming data? (IterableDataset, Batched Loader of graph data) #3165

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Which is the best way to process live streaming data? (IterableDataset, Batched Loader of graph data) #3165

Uh oh!

suiluj Sep 17, 2021

Planned pipeline:

Idea 1: pytorch IterableDataset

Idea 2 torch_geometric.loader.Dataloader

Idea 3: One more idea i have:

Question

Replies: 1 comment · 1 reply

Uh oh!

Uh oh!

rusty1s Sep 20, 2021 Maintainer

Uh oh!

suiluj Sep 20, 2021 Author

suiluj
Sep 17, 2021

Replies: 1 comment 1 reply

rusty1s
Sep 20, 2021
Maintainer

suiluj Sep 20, 2021
Author