distributed or multi-gpu traininig for graph-level classifciation based on many large graphs #9610

zhf-0 · 2024-08-20T10:03:47Z

zhf-0
Aug 20, 2024

Hello, all,

I want to do graph-level classification using GCNConv, but the question is the size of each graph in the dataset is too large to fit in one gpu. The average number of nodes in the dataset is several millions and the number of graphs may be hundreds. The memory of one gpu is not capable for the dataset, therefore I want to use multi-gpu to train the model.

I read related tutorials, such as Distributed Training , and example codes under example/multi_gpu/. I find out that, besides model parallelism, there are two ways to do multi-gpu training for data parallelism

For node or edge level taske, usinig NeighborLoader or NeighborLinkLoader to create mini-batches in one large graph;
For graph level task, splitting the batch into mini-batches since the size of all graphs are small.

But those methods are not suitable for my problem. My idea is using NeighborLoader to deal with every graph in the dataset separately and using SageConv to replace GCNConv

for graph in dataset:
    node_idx = torch.arange(graph.num_nodes)
    rank_idx = node_idx.split( node_idx.size(0)//word_size )[rank]
    single_graph_loader = NeighborLoader(graph, 
                                         num_neighbors = [-1]*3, 
                                         input_nodes = rank_idx,
                                         batch_size = 128)
    for batch in single_graph_loader:
        ...

I am not sure those codes are feasible or not, please correct me if I am wrong.

Another problem I encounter is how to calculate loss. The traditional DDP will calculate the loss based on the mini-batch, and only sync the gradient. In my case, the loss can only be calculated based on the whole graph, not the mini-batch. I can only compute the "subgraph" features by READOUT all node features in the rank. If I reduce all "subgraph" features into the graph feature in rank 0, will it be out of memory in the corresponding gpu?

The newly coming tutorial Distributed Training in PyG seems to be a cure, but it can not be executed in gpu for now. Does anyone have any suggestions?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

distributed or multi-gpu traininig for graph-level classifciation based on many large graphs #9610

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

distributed or multi-gpu traininig for graph-level classifciation based on many large graphs #9610

Uh oh!

zhf-0 Aug 20, 2024

Replies: 0 comments

zhf-0
Aug 20, 2024