Interaction between entity ruler and training, and other conceptual queries #7591
Replies: 1 comment
-
Hey, sorry for the delayed reply on this. As a general point, for high level conceptual questions like "is it better to do X or Y?", often the best advice is to try both approaches and see what the difference is. While depending on the details of the problem it may be possible to give good advice, a lot of ML is empirical, and you just have to try and see.
Currently components do not update examples during training so you can't have this kind of dependence. Allowing this kind of interaction is something we're working on actively now.
This is mostly a try-and-see kind of problem. For MOBILE NUM specifically, I could see it go either way - maybe mobile numbers look sufficiently distinctive that they don't interact with other entities, or maybe having them labeled helps the model learn to differentiate house numbers in an address from mobile numbers.
If you do fine-tuning without some entities you run the risk of catastrophic forgetting, see here for details. Typical options here would be to train an address-only model, or to train a model with all the annotations. I would expect a lot of interaction between normal entites and addresses ("Coca-Cola Lane"), so I would try both. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi!
First off, great work on spacy 3 =) I'm new to it but it's an amazing treasure trove.
While working on a project, I've got a couple conceptual questions that I wanted to seek clarification on.
My goal is to fine tune an existing pre-trained model (en_core_web_trf) to perform NER. Some of my categories, such as PERSON and ORG are already part of the label scheme of the pre-trained model. I've also got new NER categories, some of which require statistical, model-based NER (e.g. ADDRESS) and some of which (e.g. MOBILE NUM) I intend to extract solely using regex, using the entity ruler.
In the training data, I labelled all these categories, PERSON, ORG, ADDRESS and MOBILE NUM, regardless whether or not they are to be extracted statistically, or via the entity ruler.
When I simply load and use the pre-trained en_core_web_trf model, the results are already pretty neat. I also realised that adding the entity ruler before the ner component in the pipeline increased the performance, even of those categories that were statistically obtained via the ner component.
When moving on to fine tuning using the cli, I've now got a few questions:
Thanks so much!
Beta Was this translation helpful? Give feedback.
All reactions