OOV Handling #10591
-
Context Documentation Request Immediate Request Basically, I’m looking for some consistent way to represent non conventional words, so they don’t mess up the predictions. Relevant Link |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
Or, for the transformer models, is it possible to use the [Mask] token for OOV words? Could I replace OOV word with a special token like [Mask]? It looks like there is a import spacy
nlp = spacy.load("en_core_web_trf")
doc = nlp("<s> Apple shares rose on the <mask>. Apple pie is delicious.")
print(doc._.trf_data.wordpieces.strings)
# [['<s>', '<s>', 'ĠApple', 'Ġshares', 'Ġrose', 'Ġon', 'Ġthe', '<mask>', '.', 'ĠApple', 'Ġpie', 'Ġis', 'Ġdelicious', '.', '</s>']]
vocab = nlp.get_pipe("transformer").model.tokenizer.vocab
print([s for s in vocab.keys() if s[0] == "<" and s[-1] == ">"])
# ['<mask>', '<unk>', '<pad>', '<s>', '</s>', '<|endoftext|>'] |
Beta Was this translation helpful? Give feedback.
-
Short answer: to do what you want, you should add a hook that returns zero vectors for OOV terms. For Transformers there may not be a reasonable way to do this. In more detail... Besides the recent floret releases, spaCy pipelines don't use subwords for word vectors. However, if word vectors are not present, then the tok2vec contextual embedding can be returned, which happens in the small and Transformers pipelines. For Transformers, token vectors are calculated by mapping subwords generated by the underlying HuggingFace tokenizer to spaCy tokens and then combining the embeddings of the underlying tokens (see the docs). The Transformers tokenizers are designed to use subwords in a way that there's no such thing as an OOV token, so I don't think there's a simple way to do what you want - you'd have to invent your own definition of what it means to be OOV. You can't modify tokens in spaCy Docs, it's a design decision we have, so there's no easy way to swap in the mask token unless you do it as a preprocessing step (and I'm not sure it would do what you want). If you have a lot of OOV words, generally the right thing to do is train your own model. For word vectors that's straightforward. If your text is still English then for Transformers I would suggest just using the model as-is first to check whether there's actually a problem - it might just work. If you're really concerned about the tokens and have a good way to detect them, then you can just preprocess them out of your text. You could do this using a lightweight spaCy pipeline to identify them before passing them on to a full pipeline, for example.
You can modify which attributes are used in your As a small note, your examples of weird tokens will all be multiple tokens in spaCy. Hyphenated terms are usually split into multiple tokens with the English tokenizer. I've written kind of generally about this, but if you described what kind of task and text you actually have (dictionary entries?) it might be possible to give better advice. |
Beta Was this translation helpful? Give feedback.
Short answer: to do what you want, you should add a hook that returns zero vectors for OOV terms. For Transformers there may not be a reasonable way to do this.
In more detail... Besides the recent floret releases, spaCy pipelines don't use subwords for word vectors. However, if word vectors are not present, then the tok2vec contextual embedding can be returned, which happens in the small and Transformers pipelines.
For Transformers, token vectors are calculated by mapping subwords generated by the underlying HuggingFace tokenizer to spaCy tokens and then combining the embeddings of the underlying tokens (see the docs). The Transformers tokenizers are designed to use subwords in a way tha…