Subtoken embedding as node initialization #5460
tehranixyz
started this conversation in
General
Replies: 2 comments
-
Sorry I don't have a good answer for you, but I do feel this is a topic you might find more information on if you look at NLP libraries. Hugging face has a lot of good support for different tokenizers: https://huggingface.co/docs/transformers/main_classes/tokenizer |
Beta Was this translation helpful? Give feedback.
0 replies
-
You |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
In the problem I'm working on, each node contains a token.
I map each unique token to an integer number, so basically, the feature of each node in my graphs is an integer number.
Then, the first layer in my model is
torch.nn.Embedding
which maps the integer value of the nodes to a fixed length tensor.This approach works, however, sometimes the number of unique tokens is just too much.
The problem is that
hello
,world
, andhello_world
each are treated as unique tokens.I want to know what is the efficient way that I can split the tokens into subtokens (for example
hello_world
will be converted tohello
,_
,world
subtokens) and then get the tensor embedding for each subtoken, stack the tensors together and get the average/summation of the stacked tensors.So, in the end, some nodes might have one subtoken, and some others may have two or more subtokens, but every time that the number of subtokens is greater than 1, the embedding of subtokens is retrieved, stacked together and averaged out.
I think this way, I can limit the number of unique tokens. However, I'm not sure what is the efficient way of implementing this approach.
Thanks for your help.
Beta Was this translation helpful? Give feedback.
All reactions