Best ddp strategy #6732

ssnnoo · 2023-02-16T12:13:42Z

ssnnoo
Feb 16, 2023

Hi, I am new to pytorch geometric, so sorry if this is a stupid question: I need to reduce memory footprint as much as possible and was wondering if fsdp or ddp_sharded for distributed training is an option. I always have the full graph in a sample -> no subgraph sampling. Or is only ddp_spawn supported? Can I use ddp-spawn-sharded? Thank you!

wsad1 · 2023-02-17T13:17:04Z

wsad1
Feb 17, 2023
Maintainer

If you model is too large to fit in a single gpu then fsdp (reference1, reference2) should help.

I always have the full graph in a sample -> no subgraph sampling

Are you not using subgraph sampling because you want to make graph level predictions?

0 replies

ssnnoo · 2023-02-18T07:39:47Z

ssnnoo
Feb 18, 2023
Author

Yes, I think so. There is basically all-to-all message passing. But that means I should be able to use fsdp with pytorch geometric, there are now support issues or anything? I was not able to find any examples. Thank you very much.

1 reply

wsad1 Feb 18, 2023
Maintainer

There isn't an example with pytorch-geometric models as such, but I am hoping it should work. Open to accepting PRs with such an example.
You can find an example with an MLP here.

ssmmnn11 · 2023-03-12T19:48:31Z

ssmmnn11
Mar 12, 2023

from: https://discuss.pytorch.org/t/fsdp-issue-with-pytorch-geometric/174628

when trying to adapt the distributed_batching example (https://github.com/pyg-team/pytorch_geometric/blob/master/examples/multi_gpu/distributed_batching.py):

any idea what is going wrong?

Traceback (most recent call last): File "/home/asdf/.local/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, *args) File "/home/asdf/test1/pytorch_geometric/examples/multi_gpu/distributed_batching_fsdp.py", line 128, in run y_pred.append(model.module(data.x, data.adj_t, data.batch)) File "/home/asdf/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/home/asdf/test1/pytorch_geometric/examples/multi_gpu/distributed_batching_fsdp.py", line 57, in forward x = self.atom_encoder(x) File "/home/asdf/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/tmpdir/conda/envs/123/lib/python3.8/site-packages/ogb/graphproppred/mol_encoder.py", line 22, in forward x_embedding += self.atom_embedding_list[i](x[:,i]) File "/home/asdf/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/home/asdf/.local/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 160, in forward return F.embedding( File "/home/asdf/.local/lib/python3.8/site-packages/torch/nn/functional.py", line 2210, in embedding return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) RuntimeError: The tensor has a non-zero number of elements, but its data is not allocated yet. Caffe2 uses a lazy allocation, so you will need to call mutable_data() or raw_mutable_data() to actually allocate memory.

source code:

import os

import torch
import torch.distributed as dist
import torch.multiprocessing as mp
import torch.nn.functional as F
from ogb.graphproppred import Evaluator
from ogb.graphproppred import PygGraphPropPredDataset as Dataset
from ogb.graphproppred.mol_encoder import AtomEncoder, BondEncoder
from torch.nn import BatchNorm1d as BatchNorm
from torch.nn import Linear, ReLU, Sequential
from torch.nn.parallel import DistributedDataParallel
from torch.utils.data.distributed import DistributedSampler

import functools
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.distributed.fsdp.fully_sharded_data_parallel import (
    CPUOffload,
    BackwardPrefetch,
)
from torch.distributed.fsdp.wrap import (
    size_based_auto_wrap_policy,
    enable_wrap,
    wrap,
)

import torch_geometric.transforms as T
from torch_geometric.loader import DataLoader
from torch_geometric.nn import GINEConv, global_mean_pool


class GIN(torch.nn.Module):
    def __init__(self, hidden_channels, out_channels, num_layers=3,
                 dropout=0.5):
        super().__init__()

        self.dropout = dropout

        self.atom_encoder = AtomEncoder(hidden_channels)
        self.bond_encoder = BondEncoder(hidden_channels)

        self.convs = torch.nn.ModuleList()
        for _ in range(num_layers):
            nn = Sequential(
                Linear(hidden_channels, 2 * hidden_channels),
                BatchNorm(2 * hidden_channels),
                ReLU(),
                Linear(2 * hidden_channels, hidden_channels),
                BatchNorm(hidden_channels),
                ReLU(),
            )
            self.convs.append(GINEConv(nn, train_eps=True))

        self.lin = Linear(hidden_channels, out_channels)

    def forward(self, x, adj_t, batch):
        x = self.atom_encoder(x)
        edge_attr = adj_t.coo()[2]
        adj_t = adj_t.set_value(self.bond_encoder(edge_attr), layout='coo')

        for conv in self.convs:
            x = conv(x, adj_t)
            x = F.dropout(x, p=self.dropout, training=self.training)

        x = global_mean_pool(x, batch)
        x = self.lin(x)
        return x


def run(rank, world_size: int, dataset_name: str, root: str):
    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '12355'
    dist.init_process_group('nccl', rank=rank, world_size=world_size)

    dataset = Dataset(dataset_name, root,
                      pre_transform=T.ToSparseTensor(attr='edge_attr'))
    split_idx = dataset.get_idx_split()
    evaluator = Evaluator(dataset_name)

    train_dataset = dataset[split_idx['train']]
    train_sampler = DistributedSampler(train_dataset, num_replicas=world_size,
                                       rank=rank)
    train_loader = DataLoader(train_dataset, batch_size=128,
                              sampler=train_sampler)

    # my_auto_wrap_policy = functools.partial(
    #     size_based_auto_wrap_policy, min_num_params=100
    # )
    
    torch.cuda.set_device(rank)

    model = GIN(128, dataset.num_tasks, num_layers=3, dropout=0.5).to(rank)

    # model = DistributedDataParallel(model, device_ids=[rank])
    model = FSDP(model)
    print(model)

    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
    criterion = torch.nn.BCEWithLogitsLoss(reduction='mean')

    if rank == 0:
        val_loader = DataLoader(dataset[split_idx['valid']], batch_size=256)
        test_loader = DataLoader(dataset[split_idx['test']], batch_size=256)

    for epoch in range(1, 51):
        model.train()

        total_loss = 0.
        for data in train_loader:
            data = data.to(rank)
            optimizer.zero_grad()
            logits = model(data.x, data.adj_t, data.batch)
            loss = criterion(logits, data.y.to(torch.float))
            loss.backward()
            optimizer.step()
            total_loss += float(loss) * logits.size(0)

        loss = float(total_loss / len(train_loader.dataset))

        dist.barrier()

        if rank == 0:  # We evaluate on a single GPU for now.
            model.eval()

            y_pred, y_true = [], []
            for data in val_loader:
                data = data.to(rank)
                with torch.no_grad():
                    y_pred.append(model.module(data.x, data.adj_t, data.batch))
                    y_true.append(data.y)
            val_rocauc = evaluator.eval({
                'y_pred': torch.cat(y_pred, dim=0),
                'y_true': torch.cat(y_true, dim=0),
            })['rocauc']

            y_pred, y_true = [], []
            for data in test_loader:
                data = data.to(rank)
                with torch.no_grad():
                    y_pred.append(model.module(data.x, data.adj_t, data.batch))
                    y_true.append(data.y)
            test_rocauc = evaluator.eval({
                'y_pred': torch.cat(y_pred, dim=0),
                'y_true': torch.cat(y_true, dim=0),
            })['rocauc']

            print(f'Epoch: {epoch:03d}, Loss: {loss:.4f}, '
                  f'Val: {val_rocauc:.4f}, Test: {test_rocauc:.4f}')

        dist.barrier()

    dist.destroy_process_group()

if __name__ == '__main__':
    dataset_name = 'ogbg-molhiv'
    root = '../../data/OGB'

    # Download and process the dataset on main process.
    Dataset(dataset_name, root,
            pre_transform=T.ToSparseTensor(attr='edge_attr'))

    torch.manual_seed(12345)

    world_size = torch.cuda.device_count()
    print('Let\'s use', world_size, 'GPUs!')
    args = (world_size, dataset_name, root)
    mp.spawn(run, args=args, nprocs=world_size, join=True)

0 replies

ssmmnn11 · 2023-03-13T07:41:33Z

ssmmnn11
Mar 13, 2023

pytorch/pytorch#95791 (comment) seems similar

5 replies

rusty1s Mar 15, 2023
Maintainer

Does that mean this is a PyG or PyTorch issue?

ssmmnn11 Mar 15, 2023

Very good question. Maybe somewhat in between, pytorch geometric does something / makes use of something fsdp does not expect / can't handle? What would be a good way to find out which one is the "culprit"? Thank you!

rusty1s Mar 16, 2023
Maintainer

Let me run the example and confirm :)

rusty1s Mar 22, 2023
Maintainer

Sorry for late reply :(

The example works for me, but I also only currently have access to a single GPU, so this might be the reason?

ssmmnn11 Mar 23, 2023

Thank you for checking! I will test again (1 GPU and x GPUs) and report back. Will take me a couple of days unfortunately.

ssmmnn11 · 2023-03-24T13:20:50Z

ssmmnn11
Mar 24, 2023

ok, now with pytorch 2.0 and pyg2.3 with one GPU FSDP is switching of shariding, and then it works:

distributed/fsdp/_init_utils.py:295: UserWarning: FSDP is switching to use NO_SHARD instead of
ShardingStrategy.FULL_SHARD since the world size is 1.

but with GPU count > 1

Traceback (most recent call last):
File "/home/fsda321/pytorch_geometric/examples/multi_gpu/distributed_batching_fsdp.py", line 168, in
mp.spawn(run, args=args, nprocs=world_size, join=True)
File "/asdf/123/conda/envs/dev-pyg23/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 239, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/asdf/123/conda/envs/dev-pyg23/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
while not context.join():
File "/asdf/123/conda/envs/dev-pyg23/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/asdf/123/conda/envs/dev-pyg23/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/home/fsda321/pytorch_geometric/examples/multi_gpu/distributed_batching_fsdp.py", line 128, in run
y_pred.append(model.module(data.x, data.adj_t, data.batch))
File "/asdf/123/conda/envs/dev-pyg23/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/fsda321/pytorch_geometric/examples/multi_gpu/distributed_batching_fsdp.py", line 57, in forward
x = self.atom_encoder(x)
File "/asdf/123/conda/envs/dev-pyg23/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/asdf/123/conda/envs/dev-pyg23/lib/python3.9/site-packages/ogb/graphproppred/mol_encoder.py", line 22, in forward
x_embedding += self.atom_embedding_listi
File "/asdf/123/conda/envs/dev-pyg23/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/asdf/123/conda/envs/dev-pyg23/lib/python3.9/site-packages/torch/nn/modules/sparse.py", line 162, in forward
return F.embedding(
File "/asdf/123/conda/envs/dev-pyg23/lib/python3.9/site-packages/torch/nn/functional.py", line 2210, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: The tensor has a non-zero number of elements, but its data is not allocated yet. Caffe2 uses a lazy allocation, so you will need to call mutable_data() or raw_mutable_data() to actually allocate memory.

The routine has nothing todo with pytorch_geometric though?

1 reply

rusty1s Mar 25, 2023
Maintainer

Yeah, this doesn't look related to PyG. Looks like something is off with torch.nn.Embedding here.

Best ddp strategy #6732

Uh oh!

ssnnoo Feb 16, 2023

Replies: 5 comments · 7 replies

Uh oh!

Uh oh!

wsad1 Feb 17, 2023 Maintainer

Uh oh!

ssnnoo Feb 18, 2023 Author

Uh oh!

wsad1 Feb 18, 2023 Maintainer

Uh oh!

Uh oh!

ssmmnn11 Mar 12, 2023

Uh oh!

ssmmnn11 Mar 13, 2023

Uh oh!

rusty1s Mar 15, 2023 Maintainer

Uh oh!

ssmmnn11 Mar 15, 2023

Uh oh!

rusty1s Mar 16, 2023 Maintainer

Uh oh!

rusty1s Mar 22, 2023 Maintainer

Uh oh!

ssmmnn11 Mar 23, 2023

Uh oh!

ssmmnn11 Mar 24, 2023

Uh oh!

rusty1s Mar 25, 2023 Maintainer

ssnnoo
Feb 16, 2023

Replies: 5 comments 7 replies

wsad1
Feb 17, 2023
Maintainer

ssnnoo
Feb 18, 2023
Author

wsad1 Feb 18, 2023
Maintainer

ssmmnn11
Mar 12, 2023

ssmmnn11
Mar 13, 2023

rusty1s Mar 15, 2023
Maintainer

rusty1s Mar 16, 2023
Maintainer

rusty1s Mar 22, 2023
Maintainer

ssmmnn11
Mar 24, 2023

rusty1s Mar 25, 2023
Maintainer