Custom NER on a large dataset #8748
Replies: 2 comments 11 replies
-
I'm a little unclear on what kind of data you have and what kind of training you've done. Could you give some example sentences maybe? Also try using spacy debug data. I assume that each of your labels is an NER label, and that each "data" is a sentence. If you have examples sentences that contain each label separately, but none that contain them together, that's a problem, and the model will have trouble learning how your labels fit together. When you say you trained it for 100 iterations, do you mean 100 epochs? I'm a little surprised that it would learn C better than the other labels if it has fewer examples. Also, even a good model won't be perfect. Besides your training data, you should have validation data to check that the model isn't overfitting. As you add more training data or tweak your model you can use your validation data to confirm that the model is getting better. |
Beta Was this translation helpful? Give feedback.
-
From your example Udaykati --> B All appear to be locations (unions?) in Bangladesh. I used this site. polm gave a good example of entity extraction on a sentence -- it is meant for identifying parts of whole sentences. In this case, I assume "Udaykati Dhansagar Guthia" is not a sentence. If you are only expecting one word at a time entity extraction is not going to perform well, NER is meant to be used on complete sentences. Can you share what A,B,C refer to? While spaCy is a fantastic tool for natural language processing, NLP works better if the class division is based on information found online, like on Wikipedia. This is because word vectors are often pre-trained on web text. For example, if you wanted to extract 'Bangladesh', 'Barisal', and 'Guthia' from the sentence "I grew up in Guthia in Barisal which is in Bangladesh" you may be able to train sentences of that type to identify [('Guthia','UNION'),('Barisal','DISTRICT'),('Bangladesh','COUNTRY')], but harder if the distinction is more specific to your use case. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Let's say, I have json file containing 3 labels - A, B, C.
A label has around 4000 data in it.
B label has 400 data and C has 60 data.
What is the best practice to train all the labels at once so that if I form a sentence which has all the 3 labels in it can predict perfectly?
I have trained the whole dataset already for 100 iterations but for the test case it only worked for label C.
Beta Was this translation helpful? Give feedback.
All reactions