spaCy named entity recognition does not seem to work if the entity was at the beginning of the string #12612

vrunm · 2023-05-08T17:10:07Z

vrunm
May 8, 2023

I'm using spaCy named entity recognition (NER) to parse names out of English sentences and I've noticed that if the named entity was at the very beginning of the sentence, it won't be picked up. let me show you an example:


import spacy
nlp = spacy.load("en_core_web_sm")

s = 'It must have made sense to Dumbledore, though, because he put it back in his pocket and said, Hagrid’s late.'

doc = nlp(s)

named_entities = [] 
if doc.ents:
    for ent in doc.ents:
        named_entities.append((ent.text, ent.label_))

for name in named_entities:
    print(name)

this works as expected and gives the output

('Dumbledore', 'PERSON')
('Hagrid', 'ORG')

however, if I were to use the following sentence

s = "Dumbledore, however, was choosing another lemon drop and did not answer."

or any other sentence that starts with "Dumbledore", the code above will not pick up Dumbledore as a person.

any suggestions on what could be the problem here

Answered by danieldk

May 15, 2023

Both of your questions can be answered in a similar way. Both the named entity recognition and part-of-speech tagging pipelines use machine learning models. Such models will make mistakes and these mistakes are hard for us to correct in general, because models are not a deterministic set of rules. The accuracy of a model depends on several factors, including:

The size of the training data;
the quality of the training data;
the size of the model.

Taking your named entity recognition example, the en_core_web_sm model is (as the name suggests) a small model. It uses a relatively small convolutional network, but also does not use static embeddings that are pretrained on a large corpus. Sinc…

View full answer

bdura · 2023-05-09T12:11:55Z

bdura
May 9, 2023

Hello @vrunm, thanks for your question. Have you noticed this issue arising often? en_core_web_sm is a machine learning model, and as such may miss "obvious" answers such as this one.

However, try replacing Dumbledore with, say, Harry. It should work - it does with version en-core-web-sm-3.5.0.

0 replies

vrunm · 2023-05-09T12:17:37Z

vrunm
May 9, 2023
Author

@bdura This happens often, it is a problem of the small model. Should I then use the trf and large models?

0 replies

vrunm · 2023-05-09T18:19:34Z

vrunm
May 9, 2023
Author

The lemmatization of the following sentence

"Not finished! Not finished! A gem for our adopted daughter, Kiri, - - born of Grace's avatar, - - and whose conception was a complete mystery."*

has given a little bit confusing result

['not', 'finished', 'not', 'finish', 'a', 'gem', 'for', 'our', 'adopt', 'daughter', 'bear', 'of', 'avatar', 'and', 'whose', 'conception', 'be', 'a', 'complete', 'mystery']

The thing is that first word "finished" and "second" word "finished" were detected as different POS.

['PART', 'AUX', 'PART', 'VERB', 'DET', 'NOUN', 'ADP', 'PRON', 'VERB', 'NOUN', 'PROPN', 'VERB', 'ADP', 'PROPN', 'NOUN', 'CCONJ', 'DET', 'NOUN', 'AUX', 'DET', 'ADJ', 'NOUN']

First one was detected as an auxilary, and the second one was detected as a verb, as had been expected.

Adding third "Not finished!" gave the followinf result:

['not', 'finish', 'not', 'finish', 'not', 'finish', 'a', 'gem', 'for', 'our', 'adopt', 'daughter', 'bear', 'of', 'avatar', 'and', 'whose', 'conception', 'be', 'a', 'complete', 'mystery']

Removing one of them

['not', 'finish', 'a', 'gem', 'for', 'our', 'adopt', 'daughter', 'bear', 'of', 'avatar', 'and', 'whose', 'conception', 'be', 'a', 'complete', 'mystery']

Even four repetition let to get the expected result

['not', 'finish', 'not', 'finish', 'not', 'finish', 'not', 'finish', 'a', 'gem', 'for', 'our', 'adopt', 'daughter', 'bear', 'of', 'avatar', 'and', 'whose', 'conception', 'be', 'a', 'complete', 'mystery']

I find it difficult to find both a logical explanation and a workaround that will allow solving the problem not only with a specific example.

0 replies

danieldk · 2023-05-15T07:14:25Z

danieldk
May 15, 2023

Both of your questions can be answered in a similar way. Both the named entity recognition and part-of-speech tagging pipelines use machine learning models. Such models will make mistakes and these mistakes are hard for us to correct in general, because models are not a deterministic set of rules. The accuracy of a model depends on several factors, including:

The size of the training data;
the quality of the training data;
the size of the model.

Taking your named entity recognition example, the en_core_web_sm model is (as the name suggests) a small model. It uses a relatively small convolutional network, but also does not use static embeddings that are pretrained on a large corpus. Since the model is relatively limited, the model may have picked up patterns like: words that are capitalized are typically names, except if they occur at the beginning of the sentence (since all sentence-initial words are capitalized). This may be the reason that it fails to annotate your example correctly. However, if you use the en_core_web_lg model instead, you will see that that model will return the correct annotation:

('Dumbledore', 'PERSON')

You'd have to dive deeper to understand why it works in this case. But en_core_web_lg is a larger model that uses pretrained word embeddings. So, it may e.g. be the case that Dumbledore occurs in the set of word embeddings and the vector is similar to other names, allowing the model to extrapolate that since the vector of _ Dumbledore_ is similar to that of names it has seen in the training data that Dumbledore must also be a name.

Similar reasoning applies to your second question, the models are a trade-off between size, speed and accuracy. Also in this case en_core_web_lg does predicts 'VERB' consistently as the tag for finished.

So what does this mean in practice? First, models make mistakes. Second, if the error rate is not acceptable, you may want to look at larger models (such as md/lg/trf); or if you are working in a very specific domain, annotating more training data. Finally, do not underestimate the power of a set of rules. If you are working in a particular domain, say processing Harry Potter novels, you could get a lot of milage out of making a small set of rules to recognize names since it is a finite set (using e.g. the attribute ruler).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

spaCy named entity recognition does not seem to work if the entity was at the beginning of the string #12612

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

spaCy named entity recognition does not seem to work if the entity was at the beginning of the string #12612

Uh oh!

vrunm May 8, 2023

Replies: 4 comments

Uh oh!

bdura May 9, 2023

Uh oh!

vrunm May 9, 2023 Author

Uh oh!

vrunm May 9, 2023 Author

Uh oh!

danieldk May 15, 2023

vrunm
May 8, 2023

bdura
May 9, 2023

vrunm
May 9, 2023
Author

vrunm
May 9, 2023
Author

danieldk
May 15, 2023