-
Hey, I have a question regarding heterogeneous data. In my dataset, I have transactional data that consists of various nodes such as customers, transactions (customer-customer), devices, etc. These transactions have a temporal aspect and were performed at a specific point in the past. The objective is to perform node classification on the transactions, while avoiding the prediction of other nodes. I have implemented the train-test split based on the temporal axis to enable inductive learning. However, I'm wondering how to handle the other nodes, as they can also change over time. For instance, new customers or devices may be added. In the example of heterogeneous data provided on GitHub (https://github.com/pyg-team/pytorch_geometric/blob/master/examples/hetero/to_hetero_mag.py), the train-test split is only applied to the "paper" nodes. In my case, it would be beneficial to perform a train-test split for other nodes as well, ensuring that no information from the test set influences the training process. Otherwise, the graph would contain customers who do not yet exist. Is it possible to implement this, and if so, how can it be done? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 2 replies
-
There are two options how you can achieve this:
|
Beta Was this translation helpful? Give feedback.
-
Thanks for the fast response! From my understanding of your response, I gather that I need to create a mask for all node types. |
Beta Was this translation helpful? Give feedback.
Sorry if my answer was not clear enough. In particular, for point (1) you don't want to create a
mask
for every node type, but you want to create separatedata
objects for training, validation, and testing. Each of these data objects then only contains visible information up to this point in time.If you want to use temporal sampling from point (2), then the idea is to operate on a single
data
object, and let temporal sampling take care of avoiding data leakage. That is, all nodes have atime
attribute, and temporal sampling will then only sample nodes that have a timestamp less than or equal to the seed timestamp.