Custom NER on a large dataset #8748

NaziaToma · 2021-07-17T06:16:38Z

NaziaToma
Jul 17, 2021

Let's say, I have json file containing 3 labels - A, B, C.
A label has around 4000 data in it.
B label has 400 data and C has 60 data.

What is the best practice to train all the labels at once so that if I form a sentence which has all the 3 labels in it can predict perfectly?

I have trained the whole dataset already for 100 iterations but for the test case it only worked for label C.

polm · 2021-07-17T06:51:40Z

polm
Jul 17, 2021

I'm a little unclear on what kind of data you have and what kind of training you've done. Could you give some example sentences maybe? Also try using spacy debug data.

I assume that each of your labels is an NER label, and that each "data" is a sentence.

If you have examples sentences that contain each label separately, but none that contain them together, that's a problem, and the model will have trouble learning how your labels fit together.

When you say you trained it for 100 iterations, do you mean 100 epochs? I'm a little surprised that it would learn C better than the other labels if it has fewer examples.

Also, even a good model won't be perfect. Besides your training data, you should have validation data to check that the model isn't overfitting. As you add more training data or tweak your model you can use your validation data to confirm that the model is getting better.

2 replies

NaziaToma Jul 17, 2021
Author

My training json file looks like this
{"classes":["A","B","C"],"annotations":[["Dhansagar",{"entities":[[0,9,"A"]]}],["Udaykati",{"entities":[[0,8,"B"]]}],["Guthia",{"entities":[[0,6,"C"]]}]........

Yes, I meant 100 epochs. I want my model to learn given data. Here, classes are my NER Labels.

For example my test case is "Udaykati Dhansagar Guthia"
So it should be able to detect like this

Udaykati --> B
Dhansagar --> A
Guthia --> C

polm Jul 17, 2021

OK, thanks, that makes it clear. That is not going to work.

Training data for the NER model should not be isolated words, it should be complete sentences. Something like:

Apples are my favorite fruit.

With "Apples" labeled as "FOOD".

If you have specific words that you want to always get a label, rather than training an NER model, you should use the rule-based matchers.

The NER models are useful when you can't list all the things you want to label, but there are regular patterns in language that make them recognizable. So if you have a sentence like "Last year I went to XXX on vacation", you can tell that XXX is a location even if you've never seen it before. That's one example of something an NER model is good for.

MatthiasMurray · 2021-07-18T02:25:32Z

MatthiasMurray
Jul 18, 2021

From your example

Udaykati --> B
Dhansagar --> A
Guthia --> C

All appear to be locations (unions?) in Bangladesh. I used this site.
If there are more locations beyond those in your JSON consider scraping the full list to use as a database.

polm gave a good example of entity extraction on a sentence -- it is meant for identifying parts of whole sentences. In this case, I assume "Udaykati Dhansagar Guthia" is not a sentence. If you are only expecting one word at a time entity extraction is not going to perform well, NER is meant to be used on complete sentences.

Can you share what A,B,C refer to? While spaCy is a fantastic tool for natural language processing, NLP works better if the class division is based on information found online, like on Wikipedia. This is because word vectors are often pre-trained on web text.

For example, if you wanted to extract 'Bangladesh', 'Barisal', and 'Guthia' from the sentence "I grew up in Guthia in Barisal which is in Bangladesh" you may be able to train sentences of that type to identify [('Guthia','UNION'),('Barisal','DISTRICT'),('Bangladesh','COUNTRY')], but harder if the distinction is more specific to your use case.

9 replies

NaziaToma Jul 23, 2021
Author

"Rule-based entity recognition" works fine here. But let's say, according to my entity ruler Barisal is labeled as District. And my test case is " I live in Borisal". So, in this case, the entity ruler won't work because of the spelling mistake. But I want to predict the labels for slightly misspelled names as well.

MatthiasMurray Jul 23, 2021

There are a couple of options here, neither of which are guaranteed to work in every situation, but which may reduce the number of obvious cases where this occurs.

At the expense of inference latency, you can modify your rule-based recognizer to also use regex patterns allowing some number of character modifications, or use an external implementation of something like Levenshtein distance, to check tokens against every variation of every one of your hardcoded examples.
You can add an NER component downstream of your NER Ruler, trained on labeled data that includes some of these kinds of modifications (again, ideally in the kind of sentence context you expect). If there is a particular website or other large text corpus that contains many of these spelling variations, it may be worthwhile annotating examples with the full sentence context. Another approach to obtaining this labeled data is to create defined augmentations/modifications of each of the 'true' labeled data...can be limited like subjective regex rules or come from third party software. If you do synthesize data, be wary of overfitting (experiment with increasing dropout, lowering batch size, learning rate) and try to provide a variety of contexts.

Note on 2: If this NER component is 'after' the Ruler, it will defer to the Ruler but then catch any examples not identified by the Ruler. For more info on using Entity Ruler together with NER refer to 'Using the entity ruler' section of 'Rule-based entity recognition' on this page

polm Jul 24, 2021

Matthias's suggestions are perfect. That said, I would like to emphasize that solutions for this won't be perfect. Do you really need to get all locations with spelling errors? If you have a lot of them for some reason it might be important, but if it's relatively clean text they might be a small enough fraction of your data that you can ignore them.

If you really must get every single mention, even NER models or fuzzy matching won't be perfect, so evaluate how many errors you can tolerate before deciding to spend time implementing steps to catch spelling errors.

NaziaToma Jul 24, 2021
Author

Thanks a lot! Using entity ruler before NER solves this problem significantly. But there's more problem using entity ruler because some words like "Madaripur" is labeled as "Union" and also "District". If I train some data according to cases where it should be union or district, it will still find it as "Union" every time because the ruler is before NER.

polm Jul 24, 2021

Ah, if you need to disambiguate entries that's a harder problem. Your instance is a bit of a special case but "semantic role labeling", which is like NER but harder, is related. (Semantic Role Labeling would be used to differentiate a seller and a buyer in a sentence, for example, rather than just marking them as "people".)

If there are nearby keywords that make it obvious you can do some rule-based post-processing (like maybe "SomeOtherCity in Madaripur" makes it clear which one it is), but in general that's going to be difficult.

Uh oh!

Custom NER on a large dataset #8748

Uh oh!

NaziaToma Jul 17, 2021

Replies: 2 comments · 11 replies

Uh oh!

polm Jul 17, 2021

Uh oh!

Uh oh!

NaziaToma Jul 17, 2021 Author

Uh oh!

polm Jul 17, 2021

Uh oh!

MatthiasMurray Jul 18, 2021

Uh oh!

NaziaToma Jul 23, 2021 Author

Uh oh!

Uh oh!

MatthiasMurray Jul 23, 2021

Uh oh!

polm Jul 24, 2021

Uh oh!

NaziaToma Jul 24, 2021 Author

Uh oh!

polm Jul 24, 2021

NaziaToma
Jul 17, 2021

Replies: 2 comments 11 replies

polm
Jul 17, 2021

NaziaToma Jul 17, 2021
Author

MatthiasMurray
Jul 18, 2021

NaziaToma Jul 23, 2021
Author

NaziaToma Jul 24, 2021
Author