Model forgetting previously trained entities #5229

Aravindreddy986 · 2020-03-30T12:34:38Z

Aravindreddy986
Mar 30, 2020

Hi Team,

I am referring to the forgetting problem.

So we have collected custom names list around 95k and then tested on how many of them will be redacted by Spacy model. For the remaining names (35k names list) which were not redacted by spacy, we are retraining the model in batches.

Initially i took 1000 names and used the template + data generator function (with some small changes, Like defining PERSON just from First_NAME column, because i copied my names list in this column) and trained the model.

define PERSON as FIRST_NAME + LAST_NAME

df["PERSON"] = df["FIRST_NAME"] #+ " " + df["LAST_NAME"]

After training the model, i tested with sample data and observed the forgetting problem. I know the reason for this it may be because of less number of training examples or coverage of entities.

Another problem is, in our company we would be getting labelled data (entities) periodically . So i need to set up a job which retrains the model on timely basis.

i have couple of ideas for this and can you let me mention your comments on this.

Create another model (starting using spacy.blank ) with just our labelled data and adding both models (en_core_web_lg , OurModel ) to data redaction pipeline, considering they don't take much run time.

Else appending our data to the one which guys used for training the en_core_web_lg model. So if possible can we get access to the data which you guys have used for training the en_core_web_lg model. And when ever we are retraining the model with our labelled data set, we will take some samples out of the previous training data set (balanced with all entities)

Thanks,

Regards,
Aravind

Your Environment

Operating System:
Python Version Used:
spaCy Version Used:
Environment Information:

Aravindreddy986 · 2020-04-02T18:07:37Z

Aravindreddy986
Apr 2, 2020
Author

Hi Team,

Sorry for asking again,

If possible can we get access to the training/labelled data which you guys have used for training the en_core_web_lg model.

Thanks,

Regards,
Aravind

0 replies

adrianeboyd · 2020-04-03T10:10:22Z

adrianeboyd
Apr 3, 2020

No, sorry, the license doesn't allow us to share the data (OntoNotes).

You can create your own "silver" corpus by running the model on your own texts that contain the full range of entity types and mixing that data in to prevent it from forgetting as much. It's even better if you have the resources to hand-correct some of the entities, since the performance, especially on the rare entity types, may still degrade to some degree.

The other alternative of training a second NER model just with your PERSON type could also be a good option. Given the fact that you need to retrain frequently, this might be the better option because you may be able to retrain more quickly.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Model forgetting previously trained entities #5229

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Model forgetting previously trained entities #5229

Uh oh!

Aravindreddy986 Mar 30, 2020

define PERSON as FIRST_NAME + LAST_NAME

Your Environment

Replies: 2 comments

Uh oh!

Aravindreddy986 Apr 2, 2020 Author

Uh oh!

adrianeboyd Apr 3, 2020

Aravindreddy986
Mar 30, 2020

Aravindreddy986
Apr 2, 2020
Author

adrianeboyd
Apr 3, 2020