Model forgetting previously trained entities #5229
Replies: 2 comments
-
Hi Team, Sorry for asking again, If possible can we get access to the training/labelled data which you guys have used for training the en_core_web_lg model. Thanks, Regards, |
Beta Was this translation helpful? Give feedback.
-
No, sorry, the license doesn't allow us to share the data (OntoNotes). You can create your own "silver" corpus by running the model on your own texts that contain the full range of entity types and mixing that data in to prevent it from forgetting as much. It's even better if you have the resources to hand-correct some of the entities, since the performance, especially on the rare entity types, may still degrade to some degree. The other alternative of training a second NER model just with your PERSON type could also be a good option. Given the fact that you need to retrain frequently, this might be the better option because you may be able to retrain more quickly. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi Team,
I am referring to the forgetting problem.
So we have collected custom names list around 95k and then tested on how many of them will be redacted by Spacy model. For the remaining names (35k names list) which were not redacted by spacy, we are retraining the model in batches.
Initially i took 1000 names and used the template + data generator function (with some small changes, Like defining PERSON just from First_NAME column, because i copied my names list in this column) and trained the model.
define PERSON as FIRST_NAME + LAST_NAME
df["PERSON"] = df["FIRST_NAME"] #+ " " + df["LAST_NAME"]
After training the model, i tested with sample data and observed the forgetting problem. I know the reason for this it may be because of less number of training examples or coverage of entities.
Another problem is, in our company we would be getting labelled data (entities) periodically . So i need to set up a job which retrains the model on timely basis.
i have couple of ideas for this and can you let me mention your comments on this.
Create another model (starting using spacy.blank ) with just our labelled data and adding both models (en_core_web_lg , OurModel ) to data redaction pipeline, considering they don't take much run time.
Else appending our data to the one which guys used for training the en_core_web_lg model. So if possible can we get access to the data which you guys have used for training the en_core_web_lg model. And when ever we are retraining the model with our labelled data set, we will take some samples out of the previous training data set (balanced with all entities)
Thanks,
Regards,
Aravind
Your Environment
Beta Was this translation helpful? Give feedback.
All reactions