Adding new category to multi-label classification problem #8171

drisspg · 2021-05-21T15:39:49Z

drisspg
May 21, 2021

Scenario:

I have generated dataset X₀ for multi-label classification that has N categories. I train model everything looks good and then product comes back and says we want another category. I say okay I generate dataset X₁ with N+1 categories. X₀ and X₁ are disjoint in this scenario.

Problem:

It is expensive to generate annotated datasets. Instead of relabel X₀ and only add the extra category when relevant, X₁ is a separate text dataset that has been labeled to try and improve existing performance on the original N categories and provide labeled trainng data for the new N+1 category. If a new dataset was made X = X₀ U X₁ the problem arrises that in the original X₀ dataset some examples should have been labeled with the N+1 category. Training a model on X may not result in good performance because a similar instance in X₀ did not get labeled with the N+1 category and it did in X₁. I think this would likely hurt the recall of the new category.

Potential solutions

Create new dataset X = X₀ U {X₁ | N+1 category is not present} and Y = {X₁ | N+1 category is present}. Train existing model on dataset X and create new model on dataset Y. I have updated to spacy 3.0 and already have a pipeline with 2 text cats ( merge two separately trained into one model). The main limitation of this is that we lose any nice correlations that may be present for the previous N categories and the new N+1 category.
Create new dataset X = X₀ U X₁ and retrain model. Accept the potential loss in performance do to conflicting instance labels but hopefully as X₁ grows in size the loss in performance will decrease.
Train model on X₁. Use this new model to infer on all of X₀. If prediction for an instance in X₀ is significant for the N+1 category, amend that instance label with this new N+1 category being present. After this has been done X = X₀ U X₁ and retrain model with the hope that this beat solution 2.

Has anyone had a similar experience where their multi-label classification problem is learned continually and has new labels that potentially overlap with previous datasets. I am also in a similar scenario for NER.

svlandeg · 2021-05-21T16:51:10Z

svlandeg
May 21, 2021

Hi!

I just wanted to add one more option for the future to consider. We're currently working on implementing a resizable textcat, which means that you'll be able to add new labels to an already trained textcat model. I think for the annotations that have changed label, you'll want to make sure that you retrain/fine-tune further on those, and the model will likely adjust itself accordingly.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Adding new category to multi-label classification problem #8171

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Adding new category to multi-label classification problem #8171

Uh oh!

Uh oh!

drisspg May 21, 2021

Scenario:

Problem:

Potential solutions

Replies: 1 comment

Uh oh!

svlandeg May 21, 2021

drisspg
May 21, 2021

svlandeg
May 21, 2021