OOV Handling #10591

PetrochukM · 2022-03-28T18:57:57Z

PetrochukM
Mar 28, 2022

Context
I'm using spaCy in a context where I'm dealing with a lot of non-standard words that would typically not be seen in regular training data. So, I am really worried about how "out-of-vocabulary" words are handled.

Documentation Request
Unfortunately, I'm not able to able learn more about how OOV words are handled in the various spaCy models including the transformer models, or other models? I don't know if spaCy tries some sort of subword or character-based approach for handling OOV words? Is there a way to turn that off, also?

Immediate Request
Are there models which represent all OOV words with zero vectors? Do y'all have a model like that? I don't want the model to mess up if there is an egregious OOV word like "COVID-19", "FAR-dhər", "prə-NUN-see-AY-shən", "MOH-tər-sy-kəl", "KOH-bolt". I'm worried subword or character-based models will misinterpret these, and that'll mess up the predictions, so I'd rather just have zero-vectors instead for OOVs. I understand spaCy also has additional features like word shape, etc that I’d like to be ignored too because the word shape isn’t conventional.

Basically, I’m looking for some consistent way to represent non conventional words, so they don’t mess up the predictions.

Relevant Link
https://spacy.io/models/en

Answered by polm

Mar 31, 2022

Short answer: to do what you want, you should add a hook that returns zero vectors for OOV terms. For Transformers there may not be a reasonable way to do this.

In more detail... Besides the recent floret releases, spaCy pipelines don't use subwords for word vectors. However, if word vectors are not present, then the tok2vec contextual embedding can be returned, which happens in the small and Transformers pipelines.

For Transformers, token vectors are calculated by mapping subwords generated by the underlying HuggingFace tokenizer to spaCy tokens and then combining the embeddings of the underlying tokens (see the docs). The Transformers tokenizers are designed to use subwords in a way tha…

View full answer

PetrochukM · 2022-03-31T01:57:30Z

PetrochukM
Mar 31, 2022
Author

Or, for the transformer models, is it possible to use the [Mask] token for OOV words? Could I replace OOV word with a special token like [Mask]? It looks like there is a <mask> token available that gets tokenized correctly, can I use it?

import spacy

nlp = spacy.load("en_core_web_trf")
doc = nlp("<s> Apple shares rose on the <mask>. Apple pie is delicious.")
print(doc._.trf_data.wordpieces.strings)
# [['<s>', '<s>', 'ĠApple', 'Ġshares', 'Ġrose', 'Ġon', 'Ġthe', '<mask>', '.', 'ĠApple', 'Ġpie', 'Ġis', 'Ġdelicious', '.', '</s>']]
vocab = nlp.get_pipe("transformer").model.tokenizer.vocab
print([s for s in vocab.keys() if s[0] == "<" and s[-1] == ">"])
# ['<mask>', '<unk>', '<pad>', '<s>', '</s>', '<|endoftext|>']

0 replies

polm · 2022-03-31T11:00:10Z

polm
Mar 31, 2022

Short answer: to do what you want, you should add a hook that returns zero vectors for OOV terms. For Transformers there may not be a reasonable way to do this.

In more detail... Besides the recent floret releases, spaCy pipelines don't use subwords for word vectors. However, if word vectors are not present, then the tok2vec contextual embedding can be returned, which happens in the small and Transformers pipelines.

For Transformers, token vectors are calculated by mapping subwords generated by the underlying HuggingFace tokenizer to spaCy tokens and then combining the embeddings of the underlying tokens (see the docs). The Transformers tokenizers are designed to use subwords in a way that there's no such thing as an OOV token, so I don't think there's a simple way to do what you want - you'd have to invent your own definition of what it means to be OOV.

You can't modify tokens in spaCy Docs, it's a design decision we have, so there's no easy way to swap in the mask token unless you do it as a preprocessing step (and I'm not sure it would do what you want).

If you have a lot of OOV words, generally the right thing to do is train your own model. For word vectors that's straightforward. If your text is still English then for Transformers I would suggest just using the model as-is first to check whether there's actually a problem - it might just work.

If you're really concerned about the tokens and have a good way to detect them, then you can just preprocess them out of your text. You could do this using a lightweight spaCy pipeline to identify them before passing them on to a full pipeline, for example.

I understand spaCy also has additional features like word shape, etc that I’d like to be ignored too because the word shape isn’t conventional.

You can modify which attributes are used in your tok2vec settings to completely disable word shape features, but it's not relevant for Transformers.

As a small note, your examples of weird tokens will all be multiple tokens in spaCy. Hyphenated terms are usually split into multiple tokens with the English tokenizer.

I've written kind of generally about this, but if you described what kind of task and text you actually have (dictionary entries?) it might be possible to give better advice.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

OOV Handling #10591

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

OOV Handling #10591

Uh oh!

Uh oh!

PetrochukM Mar 28, 2022

Replies: 2 comments

Uh oh!

Uh oh!

PetrochukM Mar 31, 2022 Author

Uh oh!

polm Mar 31, 2022

PetrochukM
Mar 28, 2022

PetrochukM
Mar 31, 2022
Author

polm
Mar 31, 2022