spaCy named entity recognition does not seem to work if the entity was at the beginning of the string #12612
-
I'm using spaCy named entity recognition (NER) to parse names out of English sentences and I've noticed that if the named entity was at the very beginning of the sentence, it won't be picked up. let me show you an example:
this works as expected and gives the output ('Dumbledore', 'PERSON') however, if I were to use the following sentence s = "Dumbledore, however, was choosing another lemon drop and did not answer." or any other sentence that starts with "Dumbledore", the code above will not pick up Dumbledore as a person. any suggestions on what could be the problem here |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments
-
Hello @vrunm, thanks for your question. Have you noticed this issue arising often? However, try replacing |
Beta Was this translation helpful? Give feedback.
-
@bdura This happens often, it is a problem of the small model. Should I then use the trf and large models? |
Beta Was this translation helpful? Give feedback.
-
The lemmatization of the following sentence
has given a little bit confusing result ['not', 'finished', 'not', 'finish', 'a', 'gem', 'for', 'our', 'adopt', 'daughter', 'bear', 'of', 'avatar', 'and', 'whose', 'conception', 'be', 'a', 'complete', 'mystery'] The thing is that first word "finished" and "second" word "finished" were detected as different POS. ['PART', 'AUX', 'PART', 'VERB', 'DET', 'NOUN', 'ADP', 'PRON', 'VERB', 'NOUN', 'PROPN', 'VERB', 'ADP', 'PROPN', 'NOUN', 'CCONJ', 'DET', 'NOUN', 'AUX', 'DET', 'ADJ', 'NOUN'] First one was detected as an auxilary, and the second one was detected as a verb, as had been expected. Adding third "Not finished!" gave the followinf result: ['not', 'finish', 'not', 'finish', 'not', 'finish', 'a', 'gem', 'for', 'our', 'adopt', 'daughter', 'bear', 'of', 'avatar', 'and', 'whose', 'conception', 'be', 'a', 'complete', 'mystery'] Removing one of them ['not', 'finish', 'a', 'gem', 'for', 'our', 'adopt', 'daughter', 'bear', 'of', 'avatar', 'and', 'whose', 'conception', 'be', 'a', 'complete', 'mystery'] Even four repetition let to get the expected result ['not', 'finish', 'not', 'finish', 'not', 'finish', 'not', 'finish', 'a', 'gem', 'for', 'our', 'adopt', 'daughter', 'bear', 'of', 'avatar', 'and', 'whose', 'conception', 'be', 'a', 'complete', 'mystery'] I find it difficult to find both a logical explanation and a workaround that will allow solving the problem not only with a specific example. |
Beta Was this translation helpful? Give feedback.
-
Both of your questions can be answered in a similar way. Both the named entity recognition and part-of-speech tagging pipelines use machine learning models. Such models will make mistakes and these mistakes are hard for us to correct in general, because models are not a deterministic set of rules. The accuracy of a model depends on several factors, including:
Taking your named entity recognition example, the
You'd have to dive deeper to understand why it works in this case. But Similar reasoning applies to your second question, the models are a trade-off between size, speed and accuracy. Also in this case So what does this mean in practice? First, models make mistakes. Second, if the error rate is not acceptable, you may want to look at larger models (such as md/lg/trf); or if you are working in a very specific domain, annotating more training data. Finally, do not underestimate the power of a set of rules. If you are working in a particular domain, say processing Harry Potter novels, you could get a lot of milage out of making a small set of rules to recognize names since it is a finite set (using e.g. the attribute ruler). |
Beta Was this translation helpful? Give feedback.
Both of your questions can be answered in a similar way. Both the named entity recognition and part-of-speech tagging pipelines use machine learning models. Such models will make mistakes and these mistakes are hard for us to correct in general, because models are not a deterministic set of rules. The accuracy of a model depends on several factors, including:
Taking your named entity recognition example, the
en_core_web_sm
model is (as the name suggests) a small model. It uses a relatively small convolutional network, but also does not use static embeddings that are pretrained on a large corpus. Sinc…