Identifying entities in lowercase for spacy's NER #11931

n-srinidhi · 2022-12-05T11:21:05Z

n-srinidhi
Dec 5, 2022

Hi!I am having issues with detecting entites given in lower case. For Eg : HDFC Bank is being recognized as an entity by a custom trained spacy NER, while hdfc bank returns {}. I did go through this thread, but haven't got a suitable solution. Any update on this?

Answered by polm

Dec 6, 2022

The spaCy models are trained on newspaper style text, which is properly capitalized and punctuated for the most part, so they haven't seen lowercase text like before and won't do very well. I believe we do some augmentation to help with this, but being lowercase does make it harder. They also won't do very well on isolated entity names ("HDFC Bank" rather than "I opened an account with HDFC Bank").

If this is a common problem for you, it might make sense to train your own NER models, or to use a truecasing model, as mentioned in the issue you linked. (The particular truecaser linked there seems to be abandoned and somewhat old, so I would look for something more recent.)

View full answer

polm · 2022-12-06T04:07:16Z

polm
Dec 6, 2022

The spaCy models are trained on newspaper style text, which is properly capitalized and punctuated for the most part, so they haven't seen lowercase text like before and won't do very well. I believe we do some augmentation to help with this, but being lowercase does make it harder. They also won't do very well on isolated entity names ("HDFC Bank" rather than "I opened an account with HDFC Bank").

If this is a common problem for you, it might make sense to train your own NER models, or to use a truecasing model, as mentioned in the issue you linked. (The particular truecaser linked there seems to be abandoned and somewhat old, so I would look for something more recent.)

5 replies

n-srinidhi Dec 6, 2022
Author

Hi! Thanks for your prompt response! So my current model is a custom trained NER model, with custom entities. Will including more examples of lowercase text and isolated entity names like HDFC Bank help with this?

polm Dec 6, 2022

Adding more examples of lowercase text should help.

If your training data includes isolated entities, then the model will learn to predict them, but it may have limited ability to generalize. What is your input text? Do you actually provide text like "HDFC Bank" to the model and need to predict if the whole string is an entity or not?

n-srinidhi Dec 6, 2022
Author

So the text does not contain stand alone entities. It is always accompanied by text. For example:
The losses were partly stemmed by a pick up in Nifty PSU Bank (.NIFTYPSU) index, the top sectoral gainer that climbed 0.39%, after a Morgan Stanley report said that public sector banks would continue their strong performance on the back of higher margins.
Where Nifty PSU Bank is an entity. so there are cases, although rare that the text might contain the same entity as nifty psu bank

polm Dec 6, 2022

OK, in that case providing training data that's isolated entities won't be useful, since the model won't be able to learn to pick the entity out of a sentence. Your training data should be as much like your input as possible. So it should also have complete sentences.

If you have training data that is properly cased, you can try lowercasing all of it, or making lowercased copies, so the model can learn that case isn't critical - look up "data augmentation".

n-srinidhi Dec 6, 2022
Author

This has been super helpful! Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Identifying entities in lowercase for spacy's NER #11931

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Identifying entities in lowercase for spacy's NER #11931

Uh oh!

n-srinidhi Dec 5, 2022

Replies: 1 comment · 5 replies

Uh oh!

polm Dec 6, 2022

Uh oh!

n-srinidhi Dec 6, 2022 Author

Uh oh!

polm Dec 6, 2022

Uh oh!

n-srinidhi Dec 6, 2022 Author

Uh oh!

polm Dec 6, 2022

Uh oh!

n-srinidhi Dec 6, 2022 Author

n-srinidhi
Dec 5, 2022

Replies: 1 comment 5 replies

polm
Dec 6, 2022

n-srinidhi Dec 6, 2022
Author

n-srinidhi Dec 6, 2022
Author

n-srinidhi Dec 6, 2022
Author