Is there a better way to have train-val-test split in HeteroData , with edge_attr and edge_labels, instead of node-level or graph based tasks #3869

snknitin · 2022-01-17T09:56:58Z

snknitin
Jan 17, 2022

Hi PyG community,

I have a HeteroData object which has edge labels. I couldn't find any specific way to split this into train, val test than the RandomLinkSplit since most DataLoaders are node oriented. For example,

def get_train_val_test_split(data):
    edge_types= data.edge_types
    data = T.ToUndirected()(data)
    data = T.AddSelfLoops()(data)
    data = T.NormalizeFeatures()(data)
    rev_edge_types = [e for e in data.edge_types if e not in edge_types]

    for e in rev_edge_types:
        del data[e].edge_label

    transform = T.RandomLinkSplit(num_val = 0.1, num_test = 0.2,is_undirected=True,neg_sampling_ratio=0.0,
                                edge_types=edge_types,
                                rev_edge_types=rev_edge_types
                               )
    train_data, val_data, test_data = transform(data)

    return train_data, val_data, test_data

and this is the example data that needs splitting

HeteroData(
  product={
    x=[846, 392],
    num_nodes=846
  },
  customer={
    x=[84281, 65],
    num_nodes=84281
  },
  shipnode={
    x=[261, 53],
    num_nodes=261
  },
  (customer, orders, product)={
    edge_index=[2, 133319],
    edge_attr=[133319, 6]
  },
  (shipnode, delivers, customer)={
    edge_index=[2, 133319],
    edge_attr=[133319, 18],
    edge_label=[133319, 1]
  }
)

into the train, val, test split.(see the attached images)

In my network , to predict the value of the edge_label, i want to use my edge_attr in the network as a concatenation of the node features. However there are no masks. I'm unsure how I can index the correct edge_index connections in val_data. I'm trying to hack my way through by putting conditionals during the model.eval() and change which part of the predictions i need to consider, but there had to be a better way to mask the validation set, like the case of nodes.

For train , I can see that the excess labels are just 0 and can consider pred[: 93325] vs the target labels. For val and test, the labels seem to be more in number than the edge connections. So for my RMSE loss, i'd need to compare very specific set of those edge label predictions. I cannot create a mask before hand in the HeteroData itself since i'll need to add edge connections for message passing,

Is there a way to create a mask for these , or a better way to split the data into train-val-test based on edge_labels that i'm missing? Perhaps even using some loader. HGT and Neighbor sampling are more for node tasks at the moment.

rusty1s · 2022-01-18T08:36:07Z

rusty1s
Jan 18, 2022
Maintainer

As far as I understand, the problem is that you do not actually want to split links in an inductive fashion. Instead, you want to make use of all edge_index and edge_attr to predict edge_label. Correct?

I think in this case, RandomLinkSplit may not be the best choice. What about simply creating some masks by yourself?

num_edges = data[...].num_edges
perm = torch.randperm(num_edges)

train_idx = perm[:int(0.8 * num_edges)]
val_idx = perm[:int(0.8 * num_edges):int(0.9 * num_edges)]
test_idx = perm[int(0.9 * num_edges):]

# Convert to mask and add to the corresponding edge type.
...

1 reply

snknitin Jan 20, 2022
Author

Yes, the number of edge_attr available in the split and the number of edge_labels don't match and the edge_index and edge_label_index don't either, so it's harder to pass that to an EdgeDecoder model and edge_attr would contain the most important/relevant information to make predictions on the edges. Yeah, I agree, I'll work on creating an edgesplitter class. I was initially planning on making it temporal to have daily graphs as a batch that I append to the data_list before sending it to collate as a dataset. That way splitting would also become easier and involve a temporal aspect to it. I'll try both ways

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Is there a better way to have train-val-test split in HeteroData , with edge_attr and edge_labels, instead of node-level or graph based tasks #3869

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Is there a better way to have train-val-test split in HeteroData , with edge_attr and edge_labels, instead of node-level or graph based tasks #3869

Uh oh!

Uh oh!

snknitin Jan 17, 2022

Replies: 1 comment · 1 reply

Uh oh!

rusty1s Jan 18, 2022 Maintainer

Uh oh!

snknitin Jan 20, 2022 Author

snknitin
Jan 17, 2022

Replies: 1 comment 1 reply

rusty1s
Jan 18, 2022
Maintainer

snknitin Jan 20, 2022
Author