Textcat Model with Large Dataset + Number of Labels #10578
-
Currently training a multilabel Textcat model on 27 labels, with a dataset of 32,000 examples which average a length of 3,000 words per example. The issue is the model seems to be only training on 5-8~ of the possible labels, plateauing around a 0.30 f1-score, before ending training due to the patience limit being hit. What adjustments should/could I be making to fields such as the batch size, the patience, etc. Is the model just not seeing certain labels during training before the patience limit is being hit? Should I increase the batch limit for either the training or evaluation to be over/closer to the average length of the docs? A few of the labels only occur a handful (5-20) of times in the training and test sets, however, when their are 20+ labels that appear to be not being trained or tested on, I feel like I could be making some adjustments. Unfortunately, the dev is occurring offline on a different system so I cannot post the exact output and code. Apologies for this and thank you in advance! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Have you tried just increasing your patience limit? The default value is not special, if it's much smaller than the size of your epoch you should change that. It sounds like you might be having issues with imbalanced data. Somewhat inbalanced data should be OK but it would be better if you could balance your dataset somewhat. You can do that by finding more examples of rare classes, using augmentation to create artificial examples of rare classes, and reducing examples of common classes somewhat. If you have 32k examples but only 20 of a given class I don't think there's any way for the model to learn to handle the minority class reasonably. One thing you could do is use a classifier to sort documents into "rare classes" and "non-rare classes", and then separate classifiers for each of those cases, where the dataset could be more balanced. (Currently this would require three textcats. We have plans to make a hierarchical classifier that would make this easier, but are not working on it yet.) |
Beta Was this translation helpful? Give feedback.
Have you tried just increasing your patience limit? The default value is not special, if it's much smaller than the size of your epoch you should change that.
It sounds like you might be having issues with imbalanced data. Somewhat inbalanced data should be OK but it would be better if you could balance your dataset somewhat. You can do that by finding more examples of rare classes, using augmentation to create artificial examples of rare classes, and reducing examples of common classes somewhat.
If you have 32k examples but only 20 of a given class I don't think there's any way for the model to learn to handle the minority class reasonably. One thing you could do is use a classifier to s…