Performance with misspellings #6277
Replies: 2 comments
-
I can only talk from my experience while training a Spanish NER model on a specific domain, with not that many examples per label, so take it for what it is. Using my NER models, they simply do not work at all if there is a misspelling, unless the misspelling explicitly occurs in the training data. I tested different training configurations of noise-level and orth-variant-level but they didn't make any difference at all in my experiments. One approach I tried was processing the input with a spellchecker (e.g. hunspell or pyspell) before passing it to the NER model, which looks very promising and is fairly easy to implement, but depending on your domain it may not be as simple. |
Beta Was this translation helpful? Give feedback.
-
There may be a small degree of robustness to misspellings in the middle of words because the default models use prefix (default: 1 char) and suffix (3 chars) features, but it depends so much on what your data looks like that you'll have to evaluate it for your task. There is no built-in data augmentation for misspellings or word-internal noise, but you can try augmenting your training data outside of spacy to make your model more robust. The orth variants are variants of full tokens (like ASCII quotes vs. unicode quotes), so that augmenter is probably not particularly useful for misspellings. I've thought about adding an augmenter for character variants (like |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
How robust is spaCy to misspellings? We are considering adding spaCy to a production system; however, it's not clear how robust spaCy is to misspelled words. For example, if the user inserted random characters in the middle of a valid sentence, would spaCy handle this well? Or would spaCy's performance degrade significantly?
I think the answer to this depends on how much data is corrupted or misspelled in spaCy's training data.
Which page or section is this issue related to?
https://spacy.io/usage/facts-figures
Beta Was this translation helpful? Give feedback.
All reactions