Multiple datasets and batch setting for graph-based deep learning model #8810

songsong0425 · 2024-01-23T06:38:12Z

songsong0425
Jan 23, 2024

Dear PyG community,

Greetings, always thank you for your effort in updating the package.
Although I asked briefly here, I'm still ambiguous about my logic.
It will be helpful to me if anyone gives feedback or direction.

Before starting, I apologize for the messy sentences and figures :(
My objective is link prediction(i.e., edge classification) using a backbone network and multiple datasets containing only node features and binary edge labels.

For the reproducibility, I make the example datasets.
Note that edge_index in these datasets is not for the message passing.

from torch_geometric.data import Data, Batch
from torch_geometric import seed_everything

seed_everything(1234)

data1 = Data(x=torch.randn(10,5),
             edge_index=torch.randint(10, (2, 15)),
             edge_label=torch.tensor([1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]))
data2 = Data(x=torch.randn(10,5),
             edge_index=torch.randint(10, (2, 8)),
             edge_label=torch.tensor([1, 0, 1, 0, 0, 1, 1, 0]))
data3 = Data(x=torch.randn(10,5),
             edge_index=torch.randint(10, (2, 11)),
             edge_label=torch.tensor([0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1]))

I want to make one end-to-end model learning the meaningful patterns in each dataset and predict links when I get new node features.
Fortunately, I did it under a single dataset with 5-fold cross-validation.

Then, I simply guessed the model for the multiple datasets as below. All classification models have the same GAT structure.

However, there were some questions about the logic.
Q1. If I should allocate the GAT model per dataset, how should I call the dataset under the 5-fold CV?
In the previous trial, I tried to use DataLoader or Batch.from_data_list() which returned the same result to get the multiple datasets.

datasets = Batch.from_data_list([data1, data2, data3])

But it's difficult to combine DataLoader and 5-fold CV since I already imported LinkNeighborLoader for edge classification and it would be conflicted. For a dataset, I ran the below code:

from sklearn.model_selection import StratifiedKFold

folds = 5

def k_fold(dataset, folds):
    skf = StratifiedKFold(folds, shuffle=True, random_state=1234)

    test_indices, train_indices = [], []
    for _, idx in skf.split(torch.zeros(len(dataset.edge_index[0])), dataset.edge_label):
        test_indices.append(torch.from_numpy(idx).to(torch.long))

    val_indices = [test_indices[i - 1] for i in range(folds)]

    for i in range(folds):
        train_mask = torch.ones(len(dataset.edge_index[0]), dtype=torch.bool)
        train_mask[test_indices[i]] = 0
        train_mask[val_indices[i]] = 0
        train_indices.append(train_mask.nonzero(as_tuple=False).view(-1))

    return train_indices, val_indices, test_indices

for fold, (train_idx, val_idx, test_idx) in enumerate(zip(*k_fold(data1, folds))):

    print(f'FOLD {fold}')
    print('-------------------------------------------')
    
    kf_train_data = Data(edge_index=backbone.edge_index, 
                         edge_label=data1.edge_label[train_idx], 
                         edge_label_index=data1.edge_index[:, train_idx], 
                         x=[backbone.x, data1.x.float()], #Please ignore this line
                        num_nodes=backbone.num_nodes)
    kf_val_data = Data(...) #Same process with val_idx
    kf_test_data = Data(...) #Same process with test_idx
    
    ### DataLoader ###
    train_loader = LinkNeighborLoader(kf_train_data, edge_label_index=kf_train_data.edge_label_index, 
                                      batch_size=16, shuffle=True, neg_sampling_ratio=1.0, num_neighbors=[2,2])
    val_loader = LinkNeighborLoader(...)
    test_loader = LinkNeighborLoader(...)
    
    ### Model declaration ###
    model = MyModel(in_channels=256, hidden_channels=128, out_channels=64).to(device)
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4)
    criterion = torch.nn.BCEWithLogitsLoss()

Q2. Assume that I successfully call the multiple datasets with a 5-fold CV. Then, how should I declare the model per dataset?
I considered using the PyTorch utils, such as nn.Modules or for loop in similar case1, 2, or 3, but it looks hard because of the GPU memory.
Moreover, I'm not sure if is it correct to call the model in the for loop like the above code. Usually, the model is declared in the outside of the training process (I guess).

Q3. I guess that there should be a lighter version for my task since all datasets share the same backbone network as the blow figure. If it is possible, can you give me any example cases?
I think that it is much harder to construct but more similar to my task.

(If this is impossible, please let me know.)

Since I'm trying to fix my code and am afraid of the unclear idea, example codes are not perfect.
Thank you so much for reading this long question and have a nice day!

rusty1s · 2024-01-23T19:27:52Z

rusty1s
Jan 23, 2024
Maintainer

Sorry if I am misunderstanding: Do you want to train 5 individual models and then use their ensemble to do the final prediction? Do you want to share the backbone across these different models?

3 replies

songsong0425 Jan 24, 2024
Author

Thank you for your reply, and I apologize for the unclear description..
The former one is a little confusing to me too. Because I have more than 10 datasets for the input data, I'm worried that the size of the model will get too large when I ensemble all of them.
The latter one is exactly correct. I preprocessed the node IDs between the input dataset and the backbone equivalent. So I guess that the model will be able to learn the existence of edges between nodes by node features.

rusty1s Jan 24, 2024
Maintainer

I wouldn't worry too much about the size of the models. Usually, GNNs are quite small, and if you run them during inference then any intermediate state in the model is freed properly. Running them in a for-loop and collecting the predictions should work just fine.

songsong0425 Jan 25, 2024
Author

I see. Then, I guess that the model size will depend on the input data size, and I hope that the shared backbone can reduce cost.

songsong0425 · 2024-01-26T06:15:47Z

songsong0425
Jan 26, 2024
Author

I closed this discussion since it looks too dependent on others' opinions, not a discussion about the PyG-related warning or errors.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multiple datasets and batch setting for graph-based deep learning model #8810

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Multiple datasets and batch setting for graph-based deep learning model #8810

Uh oh!

Uh oh!

songsong0425 Jan 23, 2024

Replies: 2 comments · 3 replies

Uh oh!

rusty1s Jan 23, 2024 Maintainer

Uh oh!

Uh oh!

songsong0425 Jan 24, 2024 Author

Uh oh!

rusty1s Jan 24, 2024 Maintainer

Uh oh!

songsong0425 Jan 25, 2024 Author

Uh oh!

songsong0425 Jan 26, 2024 Author

songsong0425
Jan 23, 2024

Replies: 2 comments 3 replies

rusty1s
Jan 23, 2024
Maintainer

songsong0425 Jan 24, 2024
Author

rusty1s Jan 24, 2024
Maintainer

songsong0425 Jan 25, 2024
Author

songsong0425
Jan 26, 2024
Author