Problem with training graph embedding network #2520

bsmietanka · 2021-05-02T12:06:53Z

bsmietanka
May 2, 2021

Hi,

I wanted to create a plagiarism classifier for source code functions. I have a custom dataset and I convert text representation into program dependence graph. Firstly I used Weissfeiler-Lehman kernel for computing similarity between these graphs to set up some kind of baseline for my plagiarism classifier. I got really good results, around 95% accuracy on a balanced test set consisting of couple thousand pairs (50% plagiarism, 50% nonplagiarism).
After this experiment I thought that training a GNN that would get a similar results wouldn't be a problem. From what I read in "How Powerful are Graph Neural Networks?" (https://arxiv.org/abs/1810.00826) GNNs could achieve as good discriminative ability as WL test.

This is a code for my model (which is strongly inspired by some great examples I found in this repository, thank you for this):

from typing import Optional, Sequence, Union

import torch
from torch import nn
from torch.nn import Linear, Sequential, ReLU, BatchNorm1d as BN
from torch.nn import functional as F
from torch_geometric.data import Batch
from torch_geometric.nn import GINConv, global_mean_pool, global_add_pool, JumpingKnowledge



class GIN0(nn.Module):
    def __init__(self, node_features: int, hidden_dim: int, num_layers: int = 5, train_eps: bool = False):
        super().__init__()
        self.conv1 = GINConv(
            Sequential(
                Linear(node_features, hidden_dim),
                ReLU(),
                Linear(hidden_dim, hidden_dim),
                ReLU(),
                BN(hidden_dim),
            ), train_eps=train_eps)
        self.convs = nn.ModuleList()
        for i in range(num_layers - 1):
            self.convs.append(
                GINConv(
                    Sequential(
                        Linear(hidden_dim, hidden_dim),
                        ReLU(),
                        Linear(hidden_dim, hidden_dim),
                        ReLU(),
                        BN(hidden_dim),
                    ), train_eps=train_eps))
        self.lin1 = Linear(hidden_dim, hidden_dim)

    def reset_parameters(self):
        self.conv1.reset_parameters()
        for conv in self.convs:
            conv.reset_parameters()
        self.lin1.reset_parameters()

    def forward(self, data):
        x, edge_index, batch = data.x, data.edge_index, data.batch
        x = self.conv1(x, edge_index)
        for conv in self.convs:
            x = conv(x, edge_index)
        x = global_add_pool(x, batch)
        x = F.relu(self.lin1(x))
        return x

    def __repr__(self):
        return self.__class__.__name__


class GraphEncoder(nn.Module):
    def __init__(self,
                 input_dim: int,
                 hidden_dim: int,
                 num_layers: int = 5,
                 train_eps: bool = False):
        super().__init__()
        self.out_dim = hidden_dim
        self.model = GIN0(input_dim, hidden_dim, num_layers, train_eps)

    def forward(self, data: Batch) -> torch.FloatTensor:
        x = self.model(data)
        return x


class Classifier(nn.Module):

    def __init__(self, encoder: nn.Module, encoder_dim: int):
        super().__init__()

        self.encoder = encoder
        self.classifier = nn.Sequential(
            nn.BatchNorm1d(encoder_dim),
            nn.Dropout(0.5),
            nn.Linear(encoder_dim, 1)
        )


    def forward(self, x: Batch, y: Batch):
        x_embs = self.encoder(x)
        y_embs = self.encoder(y)
        diff = torch.abs(x - y)
        return torch.sigmoid(self.classifier(diff))

I tried different approaches to train this model:

Metric learning - I used contrastive loss and triplet loss to make my model (GraphEncoder) learn to encode the graphs into fixed size representation which could be then used to compute similarity between these representations (using cosine similarity)
Siamese classifier - feed into my model (Classifier) two samples and classify them as being a plagiarism pair or not

Both approaches have failed miserably, after a lot of effort with tuning hyperparameters like learning rate, batch size, changing pooling method, I was still stuck in the same place. That is model output is more or less random. Even though training loss is getting smaller it doesn't translate into better results during evaluation. I was not able to overfit on my training dataset, I tried to make it a toy dataset, consisting of only 10 unique functions (probably like 1000 source code files), but with no result.

So my questions are:

Do you see something wrong with my model definition? That's what I'm mostly worried about because this is my first contact with Pytorch Geometric, but if there's no problems with that I could show you other parts of my code (training loop etc.)
Is it possible that this specific task is just very hard for GNNs even though I got pretty good results with WL test? I think it's very unlikely, but I don't have very much experience working with graphs.

If you have any other suggestions of what I could try out I would really appreciate it.

Thanks in advance for your help

rusty1s · 2021-05-03T10:52:11Z

rusty1s
May 3, 2021
Maintainer

I think one problem of your model implementation is that you apply a final ReLU before the sigmoid call. As such, negative examples cannot be pushed far away from each other, limiting the training.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Problem with training graph embedding network #2520

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Problem with training graph embedding network #2520

Uh oh!

bsmietanka May 2, 2021

Replies: 1 comment

Uh oh!

rusty1s May 3, 2021 Maintainer

bsmietanka
May 2, 2021

rusty1s
May 3, 2021
Maintainer