How to use multi GPUs to train one GNN model with PyG #9366

lfangyu09 · 2024-05-27T04:35:53Z

lfangyu09
May 27, 2024

Hi,

Thank you so much for this helpful package. May I ask how to use multi GPUs to train one GNN model with PyG? My task is node regression on a large homogenous undirected graph.

The batch size is 1 and my GPU is A6000 (48G). It shows CUDA Out of Memory. So, I followed the multi-GPU Training and examples. But it still shows CUDA Out of Memory. Codes work well on A100 (80G) but fail on two A6000. May I ask how to use multi GPUs to train one GNN model for node regression on a large homogenous undirected graph?

As I use Slurm cluster to train models, how to set up sbatch file? The example has two rows. Do I need to modify it based on my cluster?

export MASTER_PORT=$(expr 10000 + $(echo -n $SLURM_JOBID | tail -c 4))
export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)

Also do I need to modify these codes?

os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12355'

Thank you so much!

rusty1s · 2024-05-27T07:59:06Z

rusty1s
May 27, 2024
Maintainer

If you are hitting CUDA OOM with 48GB, there is definitely something suspicious going on. Are you sure you are not holding the whole dataset in GPU memory? Are you using NeighborLoader to only move mini-batches of data to GPU? Here is a simple example.

0 replies

lfangyu09 · 2024-05-28T00:42:54Z

lfangyu09
May 28, 2024
Author

Thank you so much for your reply. Our dataset has over 10K large graphs and each graph has over 200K nodes. I believe I hold the whole dataset and I did not use NeighborLoader .

I am not familiar with NeighborLoader . May I ask how to use NeighborLoader for a series of large homogenous undirected graphs? Would NeighborLoader be used for training one graph or training many graphs?

Is it correct? Is train_loader used to load one graph or load all graphs from the whole dataset?

class OneStepDataset(pyg.data.Dataset):
......
Train_dataset =  OneStepDataset(.....)
data = Train_dataset[0]
train_loader = NeighborLoader(data, 
                                                   num_neighbors=[-1] * 2,     ### I would like to use all neighbors
                                                   shuffle=True,
                                                   input_nodes=None,            ### I would like to use all nodes as I do not have train_mask, test
                                                   batch_size=1024,
                                                   directed=False,)                  ### Our graph is a undirected graph

for batch in tqdm(train_loader):  
            out = model(batch.x, batch.edge_index)[:batch.batch_size]
            loss = F.cross_entropy(out, batch.y[:batch.batch_size])

Thank you.

2 replies

lfangyu09 May 28, 2024
Author

I am sorry for the confusion. Let me specify my question. My task is node regression. Our dataset has over 10K large graphs and each graph has over 200K nodes. Each graph is one data and I need to train models with these 10K graphs. I followed this model (Learning to Simulate Complex Physics with Graph Networks) and embedded node and edge features to 128.

The batch size is 1. I cannot run this model in A6000 (48G) but can run in A100 (80G). If I want to run this model in 2 A6000, should I use Model Parallel (model_parallel.py) instead of Data Parallel (distributed_batching.py) as one A6000 cannot load an entire model? Is it correct? Or how can I improve codes? By using SparseTensor?

Thank you so much.

rusty1s Jun 5, 2024
Maintainer

Sorry for late reply. This is a tremendously large-scale dataset. I wouldn't expect you can run this dataset with a batch size larger than 1 (and you definitely don't need to IMO). What you can try is to downsample each graph (e.g., drop a few edges), or split your graphs into smaller chunks. Alternatively, you may find sucess with SparseTensor, see here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to use multi GPUs to train one GNN model with PyG #9366

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How to use multi GPUs to train one GNN model with PyG #9366

Uh oh!

Uh oh!

lfangyu09 May 27, 2024

Replies: 2 comments · 2 replies

Uh oh!

rusty1s May 27, 2024 Maintainer

Uh oh!

Uh oh!

lfangyu09 May 28, 2024 Author

Uh oh!

lfangyu09 May 28, 2024 Author

Uh oh!

rusty1s Jun 5, 2024 Maintainer

lfangyu09
May 27, 2024

Replies: 2 comments 2 replies

rusty1s
May 27, 2024
Maintainer

lfangyu09
May 28, 2024
Author

lfangyu09 May 28, 2024
Author

rusty1s Jun 5, 2024
Maintainer