Training with Graph Mini-Batches #6355

kerem0comert · 2023-01-06T13:52:40Z

kerem0comert
Jan 6, 2023

Hello,

I would like to perform graph-classification where each graph is of the following format:
Data(x=[21, 14], edge_index=[2, 42]), where each graph has 21 nodes with each node having 14 features, and 21 edges that connect them.

As such, my whole dataset has the following format where I have n=108221 such graphs:
graph_batch = DataBatch(x=[2272620, 14], edge_index=[2, 4545240], batch=[2272620], ptr=[108221])

Given the size of my dataset, the code which I used previously for smaller tasks no longer work on my single RTX3070 GPU, as I run out of memory.

As such, I would like to train with PyTorch Mini-Batches, but wanted to ask the community for your opinions to see how I can achieve it using best practices.

Before implementing Mini-Batches, my working code was the following (although in bigger datasets I do not have enough memory to run it):

%%time
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = GATConvModel(in_features=graph_batch.x.shape[1], hidden_features=8, num_classes=len(classes), pool_type="mean").to(device)
data = graph_batch.to(device)

train_percentage = 0.8

train_count = ceil(train_percentage * graph_batch.num_graphs)
test_count = graph_batch.num_graphs - train_count
data.train_mask = torch.tensor([True] * train_count + [False] * test_count)

# Randomly shuffle the train_mask
data.train_mask = data.train_mask[torch.randperm(data.train_mask.size(0))]
data.test_mask = ~data.train_mask
data.y = torch.tensor(y_dataset).type(torch.LongTensor)
optimizer = torch.optim.Adam(model.parameters(), lr=0.005, weight_decay=5e-4)
model.train()

losses = []
for epoch in range(1000):
    optimizer.zero_grad()
    out = model(data.x, data.edge_index, data.batch)
    loss = F.nll_loss(out[data.train_mask], data.y[data.train_mask])
    
    if epoch%10 == 0:
        print(f"Epoch {epoch}, Loss: {loss.item():.4f}")
        losses.append(loss.item())
    
    loss.backward()
    optimizer.step()

# Plot the loss over the epochs
plt.plot(losses)
plt.show()

How can I fix the following code to achieve what I am looking for?

%%time

device = torch.device('cuda' if torch.cuda.is_available() and args.use_cuda else 'cpu')
model = GATConvModel(in_features=graph_batch.x.shape[1], 
                     hidden_features=8, num_classes=len(classes), pool_type="mean").to(device)

train_percentage = 0.8

train_count = ceil(train_percentage * graph_batch.num_graphs)
test_count = graph_batch.num_graphs - train_count
graph_batch.train_mask = torch.tensor([True] * train_count + [False] * test_count)

# Randomly shuffle the train_mask
graph_batch.train_mask = graph_batch.train_mask[torch.randperm(graph_batch.train_mask.size(0))]
graph_batch.test_mask = ~graph_batch.train_mask
graph_batch.y = torch.tensor(y_dataset).type(torch.LongTensor)
graph_batch.to(device)

dataloader = DataLoader(graph_batch, batch_size=args.batch_size)

optimizer = torch.optim.Adam(model.parameters(), lr=0.005, weight_decay=5e-4)
model.train()

losses = []
for epoch in range(1000):
    optimizer.zero_grad()
    for data in dataloader:
        out = model(data.x, data.edge_index, data.batch)
        
        # Here what should the line look like?
        loss = F.nll_loss(out[graph_batch.train_mask], graph_batch.y[graph_batch.train_mask])
        
        if epoch%10 == 0:
            print(f"Epoch {epoch}, Loss: {loss.item():.4f}")
            losses.append(loss.item())
            loss.backward()
    optimizer.step()

# Plot the loss over the epochs
plt.plot(losses)
plt.show()

Thanks in advance for your inputs :)

Answered by wsad1

Jan 7, 2023

Firstly don't combine your list of Data objects into a Data object. DataLoader can work with a list of Data objects.

graph_list: List[Data] = ... # list of `Data` objects
train_loader = DataLoader(graph_list[train_index], batch_size=args.batch_size)
test_loader = DataLoader(graph_list[test_index], batch_size=args.batch_size)

You can directly use these data loaders in the above code.
Refer to graph classifications examples in pyg for more information. https://github.com/pyg-team/pytorch_geometric/blob/master/examples/proteins_topk_pool.py, https://github.com/pyg-team/pytorch_geometric/blob/master/examples/mem_pool.py etc.

View full answer

wsad1 · 2023-01-07T07:27:08Z

wsad1
Jan 7, 2023
Maintainer

Firstly don't combine your list of Data objects into a Data object. DataLoader can work with a list of Data objects.

graph_list: List[Data] = ... # list of `Data` objects
train_loader = DataLoader(graph_list[train_index], batch_size=args.batch_size)
test_loader = DataLoader(graph_list[test_index], batch_size=args.batch_size)

You can directly use these data loaders in the above code.
Refer to graph classifications examples in pyg for more information. https://github.com/pyg-team/pytorch_geometric/blob/master/examples/proteins_topk_pool.py, https://github.com/pyg-team/pytorch_geometric/blob/master/examples/mem_pool.py etc.

17 replies

kerem0comert Jan 23, 2023
Author

Unfortunately not. The problem is not that the code does not compile, both versions do and produce a result in fact. It is just that I would like to move my implementation from a giant graph_batch to a List[Data] object so that I can mini-batch in bigger datasets (the one giant graph_batch for a dataset with 100k samples run out of VRAM for instance).

But as can be seen from this answer, the training I do with List[Data] does compile, but produces significantly worse accuracy, training time and loss curve.
I just wanted to find out what would be the possible cause of this, and whether I am doing something wrong in my implementation to produce this bad result despite the code compiling.

rusty1s Jan 23, 2023
Maintainer

This is still confusing, I am so sorry :D

In the answer you linked you said that the performance differs between one giant graph and Dataloader(..., batch_size=len(dataset) (0.98 vs 0.43), but you now get the same graphs back. So are you now able to get to 0.98 when using batch_size=len(dataset) when using DataLoader?

kerem0comert Jan 23, 2023
Author

Totally fine, probably I explained it rather confusingly - thanks for helping out :)
I do get the same graph back, if I use batch = next(iter(loader)) on it. The problem is, training with the giant graph is only possible to a certain extent, because after some point with bigger datasets, I run out of VRAM. This is why I wanted to explore mini-batching at the first place, and in order to mini-batch I thought that I need to use the DataLoader first, and load with List[Data] as @wsad1 suggested.

So as it currently stands, my dilemma is as follows:

Training with the giant graph yields perfect accuracy and perfect training time but it is not possible to do when the dataset gets bigger, as I am bound by the VRAM that I have (currently it is 8GB. I can have access to 24GB and train throughout the week if necessary, but at the end of the day this approach may not be scalable)
Training with DataLoader using List[Data] seems to be working as I do not run out of VRAM, but the training time is significantly longer and more crucially the accuracy is worse to the point of the model not being useful.

So as I see it - either,

I should mini-batch with the giant graph so that the data would fit to my VRAM
Or I should fix the DataLoader results somehow and use that implementation.

I would be happy to hear your thoughts on either approaches, as so far I got stuck :/

rusty1s Jan 24, 2023
Maintainer

Yes, that's what I understand. Just wanted to make sure you reach the same (good) performance on the full dataset independent of whether you collate by yourself or via DataLoader.

I personally have never seen a drop in performance when using mini-batches, in most cases, it works even better due to its stochastic nature. Overall, it should also be faster as you do more optimization steps. Importantly, there might be some hyperparameters you want to tune when running in mini-batches.

batch_size shouldn't be too small. How is it currently set?
You are now doing way more optimization steps, so you may need to tune the learning_rate to account for this.

Looking at your code, I am not totally sure why you only perform optimization steps a single time per epoch (and only backward every 10th epoch). It should be

for epoch in range(1000):
    for data in dataloader:
        optimizer.zero_grad()
        out = model(data.x, data.edge_index, data.batch)
        
        # Here what should the line look like?
        loss = F.nll_loss(out[graph_batch.train_mask], graph_batch.y[graph_batch.train_mask])
        
        loss.backward()
        optimizer.step()

As a rule of thumb, you probably don't want to deal with creating mini-batches by yourself in PyG - that's the task of DataLoader.

kerem0comert Jan 25, 2023
Author

After much tinkering with the code, it seems to be working smoothly now, also in the case of DataLoader. I was not sure of the exact root cause, but probably it was a combination of factors where I was not sending parameters correctly to the PyG methods. Thank you so much @rusty1s @wsad1 for your kind answers, they were really helpful to nudge me into the right direction :)

For future reference to other potential seekers, I am pasting the final code which produces a reasonable result:

Loading Data

BATCH_SIZE = y.shape[0] // args.batch_scaler

train_count = ceil(args.train_percentage * y.shape[0])
test_count = y.shape[0] - train_count
train_mask = torch.tensor([True] * train_count + [False] * test_count)

# Randomly shuffle the train_mask
train_mask = train_mask[torch.randperm(train_mask.size(0))]
test_mask = ~train_mask

train_loader = DataLoader([graph_list[i] for i in range(len(graph_list)) if train_mask[i]], batch_size=BATCH_SIZE)
test_loader = DataLoader([graph_list[i] for i in range(len(graph_list)) if test_mask[i]], batch_size=BATCH_SIZE)

# train_y = train_y.reshape(train_y.shape[0] // BATCH_SIZE, BATCH_SIZE)

# test_y = test_y.reshape(test_y.shape[0] // BATCH_SIZE, BATCH_SIZE)

print(f"Example graph = {train_loader.dataset[0]}")
print(f"{len(train_loader.dataset)} training samples with batch size={BATCH_SIZE}")
print(f"{len(test_loader.dataset)} test samples with batch size={BATCH_SIZE}")

train/test methods

def train(epoch, loader, model, device, optimizer):
    model.train()
    loss_all = 0
    for i, data in enumerate(loader):
        data = data.to(device)
        optimizer.zero_grad()
        output = model(data.x, data.edge_index, data.batch)
        loss = F.nll_loss(output, data.y)
        loss.backward()
        loss_all += loss.item() * data.num_graphs
        optimizer.step()
    return loss_all / len(loader.dataset)

def test(loader, model, device):
    model.eval()

    correct = 0
    predictions = []
    for i, data in enumerate(loader):
        data = data.to(device)
        pred = model(data.x, data.edge_index, data.batch).max(dim=1)[1]
        predictions.append(pred)
        correct += pred.eq(data.y).sum().item()
    return correct / len(loader.dataset), predictions

Training

%%time


model = GATConvModel(in_features=graph_list[0].x.shape[1], 
                     hidden_features=8, num_classes=len(classes), pool_type="mean").to(device)

optimizer = torch.optim.Adam(model.parameters(), lr=args.learning_rate, weight_decay=args.weight_decay)
model.train()

losses = []
for epoch in range(args.epochs):
    start_time: float = timer()
    losses.append(train(epoch, train_loader, model, device, optimizer))
    train_acc, _ = test(train_loader, model, device)
    if args.train_percentage < 1:
        test_acc, _ = test(test_loader, model, device)
    else:
        test_acc = train_acc
    print(f'Epoch: {epoch:03d}, Loss: {losses[-1]}, Train Acc: {train_acc:.5f}, '
          f'Test Acc: {test_acc:.5f}. Epoch time: {timer() - start_time:.2f}s')

# Plot the loss over the epochs
plt.plot(losses)
plt.show()

Results

test_loader = DataLoader([graph_list[i] for i in range(len(graph_list)) if test_mask[i]], batch_size=1)
accuracy, predictions = test(test_loader, model, device)
print_confusion_matrix(classes=classes,
                       y_true=y[test_mask], 
                       y_pred=torch.Tensor.cpu(torch.tensor(predictions))
)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Training with Graph Mini-Batches #6355

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 17 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Training with Graph Mini-Batches #6355

Uh oh!

Uh oh!

kerem0comert Jan 6, 2023

Replies: 1 comment · 17 replies

Uh oh!

wsad1 Jan 7, 2023 Maintainer

Uh oh!

kerem0comert Jan 23, 2023 Author

Uh oh!

rusty1s Jan 23, 2023 Maintainer

Uh oh!

kerem0comert Jan 23, 2023 Author

Uh oh!

rusty1s Jan 24, 2023 Maintainer

Uh oh!

kerem0comert Jan 25, 2023 Author

Loading Data

train/test methods

Training

Results

kerem0comert
Jan 6, 2023

Replies: 1 comment 17 replies

wsad1
Jan 7, 2023
Maintainer

kerem0comert Jan 23, 2023
Author

rusty1s Jan 23, 2023
Maintainer

kerem0comert Jan 23, 2023
Author

rusty1s Jan 24, 2023
Maintainer

kerem0comert Jan 25, 2023
Author