How can I correctly index multiple node types and accurately construct their edge index lists? #7961

SHamda · 2023-08-31T12:26:06Z

SHamda
Aug 31, 2023

I have a heteregenous graph that contains multiple node and edge types. I used two different methods to create the node indices and the edge lists.
Method 1:

import torch
from torch_geometric.data import Data, HeteroData

# Get the starting index for each node type
user_start_index = 0
tweet_start_index = user_start_index + len(df_users)
lemma_start_index = tweet_start_index + len(temp_df)
root_start_index = lemma_start_index + len(df_w_l)
word_start_index = root_start_index + len(df_l_r)

# Create node indices
user_node_indices = {node_id: index + user_start_index for index, node_id in enumerate(df_users['userId'])}
tweet_node_indices = {node_id: index + tweet_start_index for index, node_id in enumerate(temp_df['id'])}
lemma_node_indices = {node_id: index + lemma_start_index for index, node_id in enumerate(df_w_l['Source'])}
root_node_indices = {node_id: index + root_start_index for index, node_id in enumerate(df_l_r['Source'])}
word_node_indices = {node_id: index + word_start_index for index, node_id in enumerate(df_word_instances['Source'])}
tf_node_indices = {node_id: index + tweet_start_index for index, node_id in enumerate(df_word_instances['Target'])}


# Create node features
user_features = torch.tensor(df_users[['followers_count', 'friends_count', 'favourites_count', 'statuses_count']].values, dtype=torch.long)
#tweet_features = torch.tensor(df_tweets[['truncated', 'is_quote_status', 'retweet_count', 'favorite_count']].values, dtype=torch.long)
tweet_features = torch.tensor(temp_df[embedding_cols].values)

# Create edge indices
edge_index_t_u = []
for index, row in temp_df.iterrows():
    edge_index_t_u.append((user_node_indices[row['userId']], tweet_node_indices[row['id']]))
list1 = edge_index_t_u
edge_index_t_u = torch.tensor(edge_index_t_u, dtype=torch.long).t().contiguous()

#link between words and lemmas
edge_index_w_l = []
for index, row in df_w_l.iterrows():
    edge_index_w_l.append((word_node_indices[row['Target']], lemma_node_indices[row['Source']]))
list2 = edge_index_w_l
edge_index_w_l = torch.tensor(edge_index_w_l, dtype=torch.long).t().contiguous()

#link between lemmas and roots
edge_index_l_r = []
for index, row in df_l_r.iterrows():
    edge_index_l_r.append((root_node_indices[row['Source']], lemma_node_indices[row['Target']]))
edge_index_l_r = torch.tensor(edge_index_l_r, dtype=torch.long).t().contiguous()

#link between words and tweets
edge_index_w_t = []
for index, row in df_word_instances.iterrows():
    edge_index_w_t.append((word_node_indices[row['Source']], tf_node_indices[row['Target']]))
edge_index_w_t = torch.tensor(edge_index_w_t, dtype=torch.long).t().contiguous()

Method 2:

import torch
from torch_geometric.data import Data, HeteroData

# Create mapping from node ids to their indices
user_node_indices = {node_id: index for index, node_id in enumerate(df_users['userId'])}
tweet_node_indices = {node_id: index for index, node_id in enumerate(temp_df['id'])}
tf_node_indices = {node_id: index for index, node_id in enumerate(df_word_instances['Target'])}
lemma_node_indices = {node_id: index for index, node_id in enumerate(df_w_l['Source'])}
word_node_indices = {node_id: index for index, node_id in enumerate(df_word_instances['Source'])}
root_node_indices = {node_id: index for index, node_id in enumerate(df_l_r['Source'])}

# Create node features
user_features = torch.tensor(df_users[['followers_count', 'friends_count', 'favourites_count', 'statuses_count']].values, dtype=torch.long)
#tweet_features = torch.tensor(df_tweets[['truncated', 'is_quote_status', 'retweet_count', 'favorite_count']].values, dtype=torch.long)
tweet_features = torch.tensor(temp_df[embedding_cols].values)

# Create edge indices
edge_index_t_u = []
for index, row in temp_df.iterrows():
    edge_index_t_u.append((user_node_indices[row['userId']], tweet_node_indices[row['id']]))
list1 = edge_index_t_u
edge_index_t_u = torch.tensor(edge_index_t_u, dtype=torch.long).t().contiguous()


#link between words and lemmas
edge_index_w_l = []
for index, row in df_w_l.iterrows():
    edge_index_w_l.append((word_node_indices[row['Target']], lemma_node_indices[row['Source']]))
list2 = edge_index_w_l
edge_index_w_l = torch.tensor(edge_index_w_l, dtype=torch.long).t().contiguous()

#link between lemmas and roots
edge_index_l_r = []
for index, row in df_l_r.iterrows():
    edge_index_l_r.append((root_node_indices[row['Source']], lemma_node_indices[row['Target']]))
edge_index_l_r = torch.tensor(edge_index_l_r, dtype=torch.long).t().contiguous()

#link between words and tweets
edge_index_w_t = []
for index, row in df_word_instances.iterrows():
    edge_index_w_t.append((word_node_indices[row['Source']], tf_node_indices[row['Target']]))
edge_index_w_t = torch.tensor(edge_index_w_t, dtype=torch.long).t().contiguous()

The primary distinction between these methods is in the initialization of node indices. In Method 1, I ensure sequential indexing for each node type to prevent index overlap. I assume this is the conventional way of creating node indices and that the second method should not be used since it will invetibaly have the same indice value for different node types which will result in a wrong conception of the graph. (correct me if I'm wrong)

However, I've encountered an error. The error message is displayed below:

IndexError                                Traceback (most recent call last)

[/usr/local/lib/python3.10/dist-packages/torch_geometric/nn/conv/message_passing.py](https://localhost:8080/#) in _lift(self, src, edge_index, dim)
    271                 index = edge_index[dim]
--> 272                 return src.index_select(self.node_dim, index)
    273             except (IndexError, RuntimeError) as e:

IndexError: index out of range in self


During handling of the above exception, another exception occurred:

IndexError                                Traceback (most recent call last)

7 frames

[/usr/local/lib/python3.10/dist-packages/torch_geometric/nn/conv/message_passing.py](https://localhost:8080/#) in _lift(self, src, edge_index, dim)
    273             except (IndexError, RuntimeError) as e:
    274                 if index.min() < 0 or index.max() >= src.size(self.node_dim):
--> 275                     raise IndexError(
    276                         f"Encountered an index error. Please ensure that all "
    277                         f"indices in 'edge_index' point to valid indices in "

IndexError: Encountered an index error. Please ensure that all indices in 'edge_index' point to valid indices in the interval [0, 2757] (got interval [2128, 4885])

If my understanding is correct, the interval [0, 2757] corresponds to the number of tweets in the dataset, and the interval [2128, 4885] . is interpreted as follows [number of users, number of users + number of tweets]. Which is illustrated through the code below:

# Get the starting index for each node type
user_start_index = 0
tweet_start_index = user_start_index + len(df_users)
lemma_start_index = tweet_start_index + len(temp_df)
root_start_index = lemma_start_index + len(df_w_l)
word_start_index = root_start_index + len(df_l_r)

# Create node indices
user_node_indices = {node_id: index + user_start_index for index, node_id in enumerate(df_users['userId'])}
tweet_node_indices = {node_id: index + tweet_start_index for index, node_id in enumerate(temp_df['id'])}

My question and the confusion I have is around why does torch geometric range depend on the first index range it encounters and not the overall range AKA the sum of counts of all node types[tweet + user + lemma + root + word]. In the error message I got it should know that tweets indices range is between [2128, 4885] .

How can I correctly index multiple node types and accurately construct their edge index lists?

Answered by rusty1s

Sep 1, 2023

The second method is correct. Each node type has its indices range from [0, num_nodes - 1], so that we can efficiently fetch its feature vectors from it.

View full answer

rusty1s · 2023-09-01T10:25:03Z

rusty1s
Sep 1, 2023
Maintainer

The second method is correct. Each node type has its indices range from [0, num_nodes - 1], so that we can efficiently fetch its feature vectors from it.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How can I correctly index multiple node types and accurately construct their edge index lists? #7961

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How can I correctly index multiple node types and accurately construct their edge index lists? #7961

Uh oh!

SHamda Aug 31, 2023

Replies: 1 comment

Uh oh!

rusty1s Sep 1, 2023 Maintainer

SHamda
Aug 31, 2023

rusty1s
Sep 1, 2023
Maintainer