Adding a new label to a trained Spancategorizer model #10995

marpavakav · 2022-06-21T10:47:42Z

marpavakav
Jun 21, 2022

Hello! Is there a way to introduce a new label when training the spancategorizer and while using a trained spancat model as base that did not originally include this new label?

Background: I want to train a spancategoriser on 1 label first and then use that trained model as base and train a second model but this second time my train dataset has 2 labels: The original one and a second new one. Is that possible to do? If yes, how?

If I just introduce a new trainset with 2 labels, spaCy complains about the new label. I have even tried to initialise the second model and the first model with both labels using a labels.json file by including the "[initialize.components.spancat.labels]" option within the config file but that didnt do the trick.

I suspect what I am trying to do is not possible at the moment. Is that right?
Thank you!

Answered by pmbaumgartner

Jun 22, 2022

Thanks for the detail, that's helpful.

The main issue is that only some some components are resizable, and SpanCategorizer is not one of those components. You can check this with .is_resizable.

I don't think that means you're out of options for dealing with this problem. If you could tell me a bit more about your data, maybe we can come up with something that'll work. Between your two training sets, are they the same examples in both just with different labels? Or does training set 1 contain examples that are unique from training set 2?

View full answer

pmbaumgartner · 2022-06-21T15:18:17Z

pmbaumgartner
Jun 21, 2022

If I just introduce a new trainset with 2 labels, spaCy complains about the new label.

What error do you get when you try this? There shouldn't be a problem here if you're training the model from scratch on this dataset.

1 reply

marpavakav Jun 22, 2022
Author

Hello! thank you for trying to help me!

No, I am not trying to train from scratch. Let me explain a bit more. So I run my first model with a trainset that has only 1 label “LABEL1” and in the config I get “spancat” from factory like this:

[components.spancat]
factory = "spancat"
max_positive = null
scorer = {"@scorers":"spacy.spancat_scorer.v1"}
spans_key = "sc"
threshold = 0.5

that runs fine and now I have a trained model1. Now I want to use that model1 as base but now my new trainset has 2 labels “LABEL1” and “LABEL2”. I have edited the config to read the model1 as base:

[components.spancat]
source="model1/model-best"
max_positive = null
scorer = {"@scorers":"spacy.spancat_scorer.v1"}
spans_key = "sc"
threshold = 0.5

but when I try to train with this config and the new trainset I get this error:
"Aborting and saving the final best model. Encountered exception: KeyError('LABEL2')"

Which I think means that spaCy doesn’t like the second label.

Then I thought to try and initialise the labels in the config like this:

[initialize.components]

[initialize.components.spancat]

[initialize.components.spancat.labels]
@readers = "spacy.read_labels.v1"
path = "labels.json"

labels.json looks like this:

{
"spancat":[
"LABEL1",
"LABEL2"
]
}

But I still get the same error.
I wonder if I cannot introduce later new labels and I need to always go back and get the factory spancat and train from sratch whenever I have new labels?

pmbaumgartner · 2022-06-22T16:24:45Z

pmbaumgartner
Jun 22, 2022

Thanks for the detail, that's helpful.

The main issue is that only some some components are resizable, and SpanCategorizer is not one of those components. You can check this with .is_resizable.

I don't think that means you're out of options for dealing with this problem. If you could tell me a bit more about your data, maybe we can come up with something that'll work. Between your two training sets, are they the same examples in both just with different labels? Or does training set 1 contain examples that are unique from training set 2?

5 replies

marpavakav Jun 23, 2022
Author

hello again!
So, the 2 datasets are coming from the same bigger pool of texts. they will not be the same exactly texts but they will have the same entities/labels to extract.

let me explain more. Say we have 1 million legal documents and we have a schema of 10 entities we need to extract from each document. (some texts will have all 10 entities some will have only a number of them). We thought rather than annotating say 1000 documents with all 10 entities in one go we could try this:

first, annotate 100 documents only on entity1 --> train with spancategorizer from factory and get model1
then find another (different) 100 documents that will have the entity1 prefilled with what the model1 says and now the annotator
will label only entity2 (they will also correct entity1 if the model got it wrong) --> use model1 as base and the new trainset that now has 2 entities/labels and train to get model2
repeat step 2 for all 10 entities of the schema, each time using the previous model as base and also for prefilling all the entities/labels that we have already looked at

We thought if we could do that we would be speeding up the annotation process. Also, using each time the previous trained model as base we thought it would mean that we are not losing all the annotations that were done before. Another reason for doing it like this is that as we are working on this project, our stakeholders might decide to add more entities/labels in the schema.
But of course I dont know if this idea is doable?

Thank you for your time and help!

pmbaumgartner Jun 23, 2022

Thanks for the explanation, I think your situation is pretty common as I've encountered it myself. I think of it as more of a data problem and less of a model architecture problem.

From my perspective, the underlying problem is between steps 1 and 2 when you move from dataset1 (entity1 labeled) to dataset2 (entity1 predicted + corrected, entity2 labeled). The problem is that if you were to now train a model with dataset1 and dataset2 the model isn't quite sure how to treat the examples from dataset1 with the entity2 labels, since you have 100 examples without those labels. The reason here is a bit nuanced but essentially if you label something you're telling the model what is relevant, and at the same time not labeling is telling the model what isn't relevant. So if you have 100 examples that contain entity2 objects, but aren't labeled, you'd be "misinforming" the model about the characteristics that make that entity. Think of it this way: if dataset1 and dataset2 had the same example (by chance), but an entity2 was only labeled in dataset2, you are now confusing the model because in one case a span is relevant but in another it is not.

The solution is unfortunately a little more work but worth it. Essentially you'll need to re-label each of your incremental datasets with the new entities. So as you expand to dataset2 and entity2, dataset2 is really dataset1 + newdata. If you're able to use the predictions from the model (or your past annotations) for each of your existing entities as you relabel it's not really that much additional work - plus it gives you the opportunity to correct any labeling errors you made along the way and correct labels if your definition of the entities in the schema changed.

marpavakav Jun 24, 2022
Author

yes that was the idea, but I think we are saying that I should use the new dataset that will now have entity1 (predicted from the previous model)+entity2 (from manual annotation) and train from scratch i.e. use the spancat from factory again and I cannot use the already trained model as base. Have I understood correctly?

pmbaumgartner Jun 24, 2022

Yes, that's correct.

marpavakav Jun 24, 2022
Author

Thank you very much for your help and your time!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Adding a new label to a trained Spancategorizer model #10995

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 6 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Adding a new label to a trained Spancategorizer model #10995

Uh oh!

marpavakav Jun 21, 2022

Replies: 2 comments · 6 replies

Uh oh!

pmbaumgartner Jun 21, 2022

Uh oh!

marpavakav Jun 22, 2022 Author

Uh oh!

pmbaumgartner Jun 22, 2022

Uh oh!

Uh oh!

marpavakav Jun 23, 2022 Author

Uh oh!

pmbaumgartner Jun 23, 2022

Uh oh!

marpavakav Jun 24, 2022 Author

Uh oh!

pmbaumgartner Jun 24, 2022

Uh oh!

marpavakav Jun 24, 2022 Author

marpavakav
Jun 21, 2022

Replies: 2 comments 6 replies

pmbaumgartner
Jun 21, 2022

marpavakav Jun 22, 2022
Author

pmbaumgartner
Jun 22, 2022

marpavakav Jun 23, 2022
Author

marpavakav Jun 24, 2022
Author

marpavakav Jun 24, 2022
Author