Textcat Model with Large Dataset + Number of Labels #10578

nashcaps2255 · 2022-03-29T15:05:14Z

nashcaps2255
Mar 29, 2022

Currently training a multilabel Textcat model on 27 labels, with a dataset of 32,000 examples which average a length of 3,000 words per example. The issue is the model seems to be only training on 5-8~ of the possible labels, plateauing around a 0.30 f1-score, before ending training due to the patience limit being hit.

What adjustments should/could I be making to fields such as the batch size, the patience, etc. Is the model just not seeing certain labels during training before the patience limit is being hit? Should I increase the batch limit for either the training or evaluation to be over/closer to the average length of the docs? A few of the labels only occur a handful (5-20) of times in the training and test sets, however, when their are 20+ labels that appear to be not being trained or tested on, I feel like I could be making some adjustments.

Unfortunately, the dev is occurring offline on a different system so I cannot post the exact output and code. Apologies for this and thank you in advance!

Answered by polm

Mar 30, 2022

Have you tried just increasing your patience limit? The default value is not special, if it's much smaller than the size of your epoch you should change that.

It sounds like you might be having issues with imbalanced data. Somewhat inbalanced data should be OK but it would be better if you could balance your dataset somewhat. You can do that by finding more examples of rare classes, using augmentation to create artificial examples of rare classes, and reducing examples of common classes somewhat.

If you have 32k examples but only 20 of a given class I don't think there's any way for the model to learn to handle the minority class reasonably. One thing you could do is use a classifier to s…

View full answer

polm · 2022-03-30T03:34:18Z

polm
Mar 30, 2022

Have you tried just increasing your patience limit? The default value is not special, if it's much smaller than the size of your epoch you should change that.

It sounds like you might be having issues with imbalanced data. Somewhat inbalanced data should be OK but it would be better if you could balance your dataset somewhat. You can do that by finding more examples of rare classes, using augmentation to create artificial examples of rare classes, and reducing examples of common classes somewhat.

If you have 32k examples but only 20 of a given class I don't think there's any way for the model to learn to handle the minority class reasonably. One thing you could do is use a classifier to sort documents into "rare classes" and "non-rare classes", and then separate classifiers for each of those cases, where the dataset could be more balanced. (Currently this would require three textcats. We have plans to make a hierarchical classifier that would make this easier, but are not working on it yet.)

1 reply

nashcaps2255 Mar 30, 2022
Author

Will attempt it with a patience limit above one epoch and see how that goes.

The data is imbalanced for sure and I will augment today to attempt to fix that a bit, the odd thing is though that the two topics with the highest occurrences in the training data aren't being predicted at all.

Similar to #8049 (comment) , I am also getting that RunTimeWarning after the first evaluation.

Thank you so much!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Textcat Model with Large Dataset + Number of Labels #10578

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Textcat Model with Large Dataset + Number of Labels #10578

Uh oh!

nashcaps2255 Mar 29, 2022

Replies: 1 comment · 1 reply

Uh oh!

polm Mar 30, 2022

Uh oh!

nashcaps2255 Mar 30, 2022 Author

nashcaps2255
Mar 29, 2022

Replies: 1 comment 1 reply

polm
Mar 30, 2022

nashcaps2255 Mar 30, 2022
Author