Adding new category to multi-label classification problem #8171
drisspg
started this conversation in
Help: Best practices
Replies: 1 comment
-
Hi! I just wanted to add one more option for the future to consider. We're currently working on implementing a resizable |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Scenario:
I have generated dataset X0 for multi-label classification that has N categories. I train model everything looks good and then product comes back and says we want another category. I say okay I generate dataset X1 with N+1 categories. X0 and X1 are disjoint in this scenario.
Problem:
It is expensive to generate annotated datasets. Instead of relabel X0 and only add the extra category when relevant, X1 is a separate text dataset that has been labeled to try and improve existing performance on the original N categories and provide labeled trainng data for the new N+1 category. If a new dataset was made X = X0 U X1 the problem arrises that in the original X0 dataset some examples should have been labeled with the N+1 category. Training a model on X may not result in good performance because a similar instance in X0 did not get labeled with the N+1 category and it did in X1. I think this would likely hurt the recall of the new category.
Potential solutions
Create new dataset X = X0 U {X1 | N+1 category is not present} and Y = {X1 | N+1 category is present}. Train existing model on dataset X and create new model on dataset Y. I have updated to spacy 3.0 and already have a pipeline with 2 text cats ( merge two separately trained into one model). The main limitation of this is that we lose any nice correlations that may be present for the previous N categories and the new N+1 category.
Create new dataset X = X0 U X1 and retrain model. Accept the potential loss in performance do to conflicting instance labels but hopefully as X1 grows in size the loss in performance will decrease.
Train model on X1. Use this new model to infer on all of X0. If prediction for an instance in X0 is significant for the N+1 category, amend that instance label with this new N+1 category being present. After this has been done X = X0 U X1 and retrain model with the hope that this beat solution 2.
Has anyone had a similar experience where their multi-label classification problem is learned continually and has new labels that potentially overlap with previous datasets. I am also in a similar scenario for NER.
Beta Was this translation helpful? Give feedback.
All reactions