Subtoken embedding as node initialization #5460

tehranixyz · 2022-09-16T22:52:26Z

tehranixyz
Sep 16, 2022

Hi,
In the problem I'm working on, each node contains a token.

I map each unique token to an integer number, so basically, the feature of each node in my graphs is an integer number.
Then, the first layer in my model is torch.nn.Embedding which maps the integer value of the nodes to a fixed length tensor.

self.embeddings = nn.Embedding(vocab_size, 128)
self.conv1 = RGCNConv(128, 256, num_edge_type)

This approach works, however, sometimes the number of unique tokens is just too much.
The problem is that hello, world, and hello_world each are treated as unique tokens.
I want to know what is the efficient way that I can split the tokens into subtokens (for example hello_world will be converted to hello, _, world subtokens) and then get the tensor embedding for each subtoken, stack the tensors together and get the average/summation of the stacked tensors.
So, in the end, some nodes might have one subtoken, and some others may have two or more subtokens, but every time that the number of subtokens is greater than 1, the embedding of subtokens is retrieved, stacked together and averaged out.
I think this way, I can limit the number of unique tokens. However, I'm not sure what is the efficient way of implementing this approach.

Thanks for your help.

Padarn · 2022-09-18T00:10:46Z

Padarn
Sep 18, 2022
Collaborator

Sorry I don't have a good answer for you, but I do feel this is a topic you might find more information on if you look at NLP libraries. Hugging face has a lot of good support for different tokenizers: https://huggingface.co/docs/transformers/main_classes/tokenizer

0 replies

rusty1s · 2022-09-19T10:04:30Z

rusty1s
Sep 19, 2022
Maintainer

You

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Subtoken embedding as node initialization #5460

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Subtoken embedding as node initialization #5460

Uh oh!

tehranixyz Sep 16, 2022

Replies: 2 comments

Uh oh!

Padarn Sep 18, 2022 Collaborator

Uh oh!

rusty1s Sep 19, 2022 Maintainer

tehranixyz
Sep 16, 2022

Padarn
Sep 18, 2022
Collaborator

rusty1s
Sep 19, 2022
Maintainer