Creating Dataset #2328

JindeShubhamA · 2021-03-30T10:33:15Z

JindeShubhamA
Mar 30, 2021

Hello Matt,

I am working on the task of anomalous detection in financial transaction using Graph based methods. I am currently using AMLSim from IBM repository for creating dataset. The idea is to represents transaction in a graph like structure and use node classification or graph classification methods from PyG to classify fraudulent accounts(node classification) or series of fraudulent transactions(Graph classification). The dataset currently is in csv format which I was thinking of converting in numpy array. Can you maybe guide me on how the numpy array should look like?

The dataset has information for each account in a accounts.csv and all the transaction between them in another transaction.csv. One row in account.csv will have information of one unique account and one row in transaction.csv will have one transaction information between any two accounts. Account.csv will form my node features, and transaction.csv will form my edge index, but would like to know on what the numpy array shape would be.

Eg- what will be the numpy array shape be if there are 10 accounts each with 5 features and there are say 25 transactions among the 10 accounts for node classification as well as graph classification task?

Also, there will be lot of unconnected graphs as it might be that there are say 4 nodes(accounts) in one cluster which are connected to each other because there is a transaction happening between them and say 3 nodes(accounts) in another cluster. Will I need to specify all these subgraphs if I want to apply graph classification task?

rusty1s · 2021-03-30T19:39:42Z

rusty1s
Mar 30, 2021
Maintainer

In your example, the node features would have shape [10, 5], and your edge connectivity can be either represented as a sparse [10, 10] matrix with 25 non-zero entries, or as an edge list of shape [2, 25].

If you want to do graph-classification on subgraphs, you need to specify which nodes belong to a given cluster/label. You can think of this as a (sparse) assignment matrix, where each node is mapped to a given cluster. In case you have 4 nodes in one cluster, and 3 nodes in another, this can be represented as a sparse matrix of shape [n, 2] with 7 entries, or as an edge list of shape [2, 7]. You can then use this matrix to aggregate node features to specific clusters:

x = ... # GNN output node features
row, col = cluster_assignment_index
z = scatter_add(x[row], col, dim=0)  # z has shape [num_clusters, num_features]

0 replies

JindeShubhamA · 2021-03-31T11:02:24Z

JindeShubhamA
Mar 31, 2021
Author

transaction_data_100.npz.zip
Thanks for the response,

Shapes of the numpy array makes sense. I would also be adding edge features. But I guess that will also be a numpy array of the shape (no. of edges, no. of features). Do I need to save these all arrays in a single file (.npz)? I was following the discussion

#2129

I was able to create the 'npz' object. I have attached it here. Now I was working on the MyDataset class. Getting an error in the process function

Traceback (most recent call last):
File "amlSim_benchmark.py", line 51, in
dataset = MyDataset(root='./')
File "node_classification/myDataset.py", line 10, in init
self.data, self.slices = torch.load(self.processed_paths[0])
File "anaconda3/envs/my_env/lib/python3.8/site-packages/torch/serialization.py", line 594, in load
return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
File "anaconda3/envs/my_env/lib/python3.8/site-packages/torch/serialization.py", line 850, in _load
data_file = io.BytesIO(zip_file.get_record(pickle_file))
RuntimeError: [enforce fail at inline_container.cc:222] . file not found: archive/data.pkl

I actually don't want to do processing because I have in the 'npz' object, edge_index, train_mask, test_mask, y, edge_attr files and I do see processed folder with data.pt file. Below is the MyDataset class

import torch
import numpy as np
from torch_geometric.data import InMemoryDataset, Data


class MyDataset(InMemoryDataset):
    def __init__(self, root, transform=None):
        super(MyDataset, self).__init__(root, transform)
        self.data, self.slices = torch.load(self.processed_paths[0])

    @property
    def raw_file_names(self):
        return "transaction_data_100.npz"

    @property
    def processed_file_names(self):
        return "data.pt"

    def process(self):
        data = np.load(self.raw_paths[0])
       
        torch.save(self.collate(data), self.processed_paths[0])

For the Graph Classification dataset, you are saying that I should have label for each node. Then how is it different from the dataset for node classification(in terms of the shapes of numpy array)? I thought there will be label for each sub-graph. For a subgraph with 3 nodes(node1, node2, node3) belonging to class 1, and subgraph of 4 nodes(node4, node5, node6, node7) belonging to class 2, will I have matrix of shape [2,2] (2 subgraphs and corresponding class) or [7,2] (class of each node).

P.S. If I generate dataset for node classification, then what changes I need to do to use the same dataset for the task of Graph classification? Currently the dataset I am trying to convert has the label for each node. If there is say a sub-graph that belongs to say class1, then all the nodes in the subgraph will have label 1. As you can see we can apply node-classification task on this easily. I also want to apply Graph classification task. For that I am first segregating all the subgraphs and assigning label to the subgraph and not node. Can you suggest how should I transform the dataset for the node classification task to graph classification task?

7 replies

rusty1s Apr 15, 2021
Maintainer

CrossEntropy is the correct choice for node classification. It also provides a weight argument, which can help to re-weight the loss in case of highly skewed class distribution.

JindeShubhamA Apr 23, 2021
Author

Thanks Matt,

I was able to get the results for node classification task.

I do have couple of doubts for Graph classification task.

Can a node belong to more than 1 subgraph such that the subgraphs to which node belongs have different class?(I am sampling from a large graphs to generate small subgraphs so it is possible that while sampling, a node may be sampled more than once and my class labels are according to the structure of the subgraphs eg. closed, bipartite, straight line)
The x in the data object mentioned above is the list of subgraphs or list of nodes?

Thanks in advance

JindeShubhamA Apr 23, 2021
Author

Also I was exploring the TU_DATASET(MUTAG) and I found the dataset object is actually a list(I am assuming it is a list of subgraphs). Dataset[0] looks like this -

Data(edge_attr=[28, 4], edge_index=[2, 28], x=[13, 7], y=[1])

Here it is very clear to understand that 'edge_attr' are features for edges, 'edge_index' are list of edges, 'x' contains features for the node in the subgraph, 'y' as label for the graph and all these attributes are local to the graph. However I am not sure how edge_index are numbered.

To simplify say for example I have 10 nodes in the graph which forms 2 subgraph with nodes [1-5] in subGraph1, [6-10] in subGraph2. Now for subGraph2, 'x' will be features for node [6,7,8,9,10]. Then if there is an edge between node 6 to node 7, will in the edge index it should be saved as [[6,7],...] or [[0,1],...] as 6th is the 1st node in subgraph2 and 7 is the second node in subgraph2?

Also it looks like the TU_DATASET(MUTAG) has different structure than the one you mentioned above as it does not contain assign_index. Can you confirm it? If they are different which way of creating dataset will you recommend as I want the nodes to belong to multiple subgraphs.

data = Data(x=x, edge_index=edge_index, y_graph=y_graph, assign_index=assign_index)

Thank you for your constant help.

rusty1s Apr 23, 2021
Maintainer

Yes, of course. In the end, you compute a representation for the given subgraph, not for a specific node.
It's the node feature matrix, it holds the feature vectors for all nodes in the graph, i.e., a tensor of shape [num_nodes, num_features].

I'm not sure what you mean by how edge_index is numbered. It's a tensor of shape [2, num_edges], where each tuple holds the node indices of source and target node.

In general, different data objects point to different nodes, and each data object only knows about the existence of the node in its graph (or subgraph). So, the second edge_index will again start from node index 0.

Yes, there is no notion of assign_index in MUTAG. In your case, I would just save the subgraphs each in a dataobject, and iterate over them in a mini-batch fashion via torch_geometric.data.DataLoader as long as you do not care about the dependency from mapping global nodes to subgraphs.

JindeShubhamA Apr 29, 2021
Author

Thanks,
I was able to create dataset and do some benchmarking.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Creating Dataset #2328

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 7 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Creating Dataset #2328

Uh oh!

Uh oh!

JindeShubhamA Mar 30, 2021

Replies: 2 comments · 7 replies

Uh oh!

rusty1s Mar 30, 2021 Maintainer

Uh oh!

Uh oh!

JindeShubhamA Mar 31, 2021 Author

Uh oh!

rusty1s Apr 15, 2021 Maintainer

Uh oh!

JindeShubhamA Apr 23, 2021 Author

Uh oh!

JindeShubhamA Apr 23, 2021 Author

Uh oh!

rusty1s Apr 23, 2021 Maintainer

Uh oh!

JindeShubhamA Apr 29, 2021 Author

JindeShubhamA
Mar 30, 2021

Replies: 2 comments 7 replies

rusty1s
Mar 30, 2021
Maintainer

JindeShubhamA
Mar 31, 2021
Author

rusty1s Apr 15, 2021
Maintainer

JindeShubhamA Apr 23, 2021
Author

JindeShubhamA Apr 23, 2021
Author

rusty1s Apr 23, 2021
Maintainer

JindeShubhamA Apr 29, 2021
Author